# load packages
library(tidyverse)
library(tidymodels)
library(NHANES) #data set
library(knitr)
library(patchwork)
# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 20))Multinomial Logistic Regression
Topics
Conditions for logistic regression AE
Multinomial logistic regression
Interpret model coefficients
Inference for a coefficient \(\beta_{jk}\)
Computational setup
Generalized Linear Models
Generalized Linear Models (GLMs)
In practice, there are many different types of outcome variables:
- Continuous: Price of house
- Binary: Win or Lose
- Nominal: Democrat, Republican or Third Party candidate
- Ordered: Movie rating (1 - 5 stars)
- and others…
Predicting each of these outcomes requires a generalized linear model, a broader class of models that generalize the multiple linear regression model
Recommended reading for more details about GLMs: Generalized Linear Models: A Unifying Theory.
Binary outcome (Logistic)
Given \(P(y_i=1|x_i)= \hat{p}_i\hspace{5mm} \text{ and } \hspace{5mm}P(y_i=0|x_i) = 1-\hat{p}_i\)
\[ \log\Big(\frac{\hat{p}_i}{1-\hat{p}_i}\Big) = \hat{\beta}_0 + \hat{\beta}_1 x_{i} + \dots \hat{\beta}_px_p \]
We can calculate \(\hat{p}_i\) by solving the logit equation:
\[ \hat{p}_i = \frac{\exp\{\hat{\beta}_0 + \hat{\beta}_1 x_{i} + \dots +\hat{\beta}_px_p\}}{1 + \exp\{\hat{\beta}_0 + \hat{\beta}_1 x_{i} + \dots +\hat{\beta}_px_p\}} \]
Binary outcome (Logistic)
Suppose we consider \(y=0\) the baseline category such that
\[ P(y_i=1|x_i) = \hat{p}_{i1} \hspace{2mm} \text{ and } \hspace{2mm} P(y_i=0|x_i) = \hat{p}_{i0} \]
Then the logistic regression model is
\[ \log\bigg(\frac{\hat{p}_{i1}}{1- \hat{p}_{i1}}\bigg) = \log\bigg(\frac{\hat{p}_{i1}}{\hat{p}_{i0}}\bigg) = \hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \dots \hat{\beta}_px_{ip} \]
Slope, \(\hat{\beta}_j\): When \(x\) increases by one unit, the odds of \(y=1\) versus the baseline \(y=0\) are expected to multiply by a factor of \(\exp\{\hat{\beta}_j\}\)
Intercept, \(\hat{\beta}_0\): When \(x=0\), the predicted odds of \(y=1\) versus the baseline \(y=0\) are \(\exp\{\hat{\beta}_0\}\)
Multinomial outcome variable
Suppose the outcome variable \(y\) is categorical and can take values \(1, 2, \ldots, K\) such that \((K > 2)\)
Multinomial Distribution:
\[ P(y=1) = p_1, P(y=2) = p_2, \ldots, P(y=K) = p_K \]
such that \(\sum\limits_{k=1}^{K} p_k = 1\)
Multinomial Logistic Regression
If we have an predictor variables \(x_1, \ldots, x_p\), then we want to fit a model such that \(P(y = k) = p_k\) is a function of \(x_1, \ldots,x_p\)
Choose a baseline category. Let’s choose \(y=1\). Then,
\[ \log\bigg(\frac{p_{ik}}{p_{i1}}\bigg) = \beta_{0k} + \beta_{1k} x_{i1} + \dots + \beta_{pk}x_{ip} \]
In the multinomial logistic model, we have a separate equation for each category of the outcome relative to the baseline category
- If the outcome has \(K\) possible categories, there will be \(K-1\) equations as part of the multinomial logistic model
Multinomial Logistic Regression
Suppose we have a outcome variable \(y\) that can take three possible outcomes that are coded as “A”, “B”, “C”
Let “A” be the baseline category. Then
\[ \begin{aligned} \log\bigg(\frac{p_{iB}}{p_{iA}}\bigg) &= \beta_{0B} + \beta_{1B}x_{i1} + \dots \beta_{pB}x_{ip} \\[10pt] \log\bigg(\frac{p_{iC}}{p_{iA}}\bigg) &= \beta_{0C} + \beta_{1C} x_{i1} + \dots +\beta_{pC}x_{ip} \end{aligned} \]
Data
NHANES Data
National Health andNutritionExamination Survey is conducted by the National Center for Health Statistics (NCHS)
The goal is to “assess the health and nutritional status of adults and children in the United States”
This survey includes an interview and a physical examination
NHANES Data
We will use the data from the
NHANESR packageContains 75 variables for the 2009 - 2010 and 2011 - 2012 sample years
The data in this package is modified for educational purposes and should not be used for research
Original data can be obtained from the NCHS website for research purposes
Type
?NHANESin console to see list of variables and definitions
Variables
Goal: Use a person’s age and whether they do regular physical activity to predict their self-reported health rating.
Outcome:
HealthGen: Self-reported rating of participant’s health in general. Excellent, Vgood, Good, Fair, or Poor.
Predictors:
Age: Age at time of screening (in years). Participants 80 or older were recorded as 80.PhysActive: Participant does moderate to vigorous-intensity sports, fitness or recreational activities.
The data
nhanes_adult <- NHANES |>
filter(Age >= 18) |>
select(HealthGen, Age, PhysActive) |>
filter(!(is.na(HealthGen))) |>
mutate(obs_num = 1:n())glimpse(nhanes_adult)Rows: 6,710
Columns: 4
$ HealthGen <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgood, Vgood, …
$ Age <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, 60, 56, 56,…
$ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, …
$ obs_num <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
Exploratory data analysis

Exploratory data analysis

Fitting a multinomial logistic regression model
Model in R
Use the multinom function from the nnet R package
library(nnet)
health_fit <- multinom(HealthGen ~ Age + PhysActive,
data = nhanes_adult)Use code chunk option #| results: hide to suppress convergence output
Model result
health_fitCall:
multinom(formula = HealthGen ~ Age + PhysActive, data = nhanes_adult)
Coefficients:
(Intercept) Age PhysActiveYes
Vgood 1.2053460 0.0009101848 -0.3209047
Good 1.9476261 -0.0023686122 -1.0014925
Fair 0.9145492 0.0030462534 -1.6454297
Poor -1.5211414 0.0221905681 -2.6556343
Residual Deviance: 17588.88
AIC: 17612.88
Tidy model output
tidy(health_fit)# A tibble: 12 × 6
y.level term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Vgood (Intercept) 1.21 0.145 8.33 8.42e-17
2 Vgood Age 0.000910 0.00246 0.369 7.12e- 1
3 Vgood PhysActiveYes -0.321 0.0929 -3.45 5.51e- 4
4 Good (Intercept) 1.95 0.141 13.8 1.39e-43
5 Good Age -0.00237 0.00242 -0.977 3.29e- 1
6 Good PhysActiveYes -1.00 0.0901 -11.1 1.00e-28
7 Fair (Intercept) 0.915 0.164 5.57 2.61e- 8
8 Fair Age 0.00305 0.00288 1.06 2.90e- 1
9 Fair PhysActiveYes -1.65 0.107 -15.3 5.69e-53
10 Poor (Intercept) -1.52 0.290 -5.24 1.62e- 7
11 Poor Age 0.0222 0.00491 4.52 6.11e- 6
12 Poor PhysActiveYes -2.66 0.236 -11.3 1.75e-29
Tidy model output, with CI
tidy(health_fit, conf.int = TRUE)# A tibble: 12 × 8
y.level term estimate std.error statistic p.value conf.low conf.high
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Vgood (Intercept) 1.21e+0 0.145 8.33 8.42e-17 0.922 1.49
2 Vgood Age 9.10e-4 0.00246 0.369 7.12e- 1 -0.00392 0.00574
3 Vgood PhysActiveY… -3.21e-1 0.0929 -3.45 5.51e- 4 -0.503 -0.139
4 Good (Intercept) 1.95e+0 0.141 13.8 1.39e-43 1.67 2.22
5 Good Age -2.37e-3 0.00242 -0.977 3.29e- 1 -0.00712 0.00238
6 Good PhysActiveY… -1.00e+0 0.0901 -11.1 1.00e-28 -1.18 -0.825
7 Fair (Intercept) 9.15e-1 0.164 5.57 2.61e- 8 0.592 1.24
8 Fair Age 3.05e-3 0.00288 1.06 2.90e- 1 -0.00260 0.00869
9 Fair PhysActiveY… -1.65e+0 0.107 -15.3 5.69e-53 -1.86 -1.43
10 Poor (Intercept) -1.52e+0 0.290 -5.24 1.62e- 7 -2.09 -0.952
11 Poor Age 2.22e-2 0.00491 4.52 6.11e- 6 0.0126 0.0318
12 Poor PhysActiveY… -2.66e+0 0.236 -11.3 1.75e-29 -3.12 -2.19
Neatly display model output
| y.level | term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|---|
| Vgood | (Intercept) | 1.205 | 0.145 | 8.325 | 0.000 | 0.922 | 1.489 |
| Vgood | Age | 0.001 | 0.002 | 0.369 | 0.712 | -0.004 | 0.006 |
| Vgood | PhysActiveYes | -0.321 | 0.093 | -3.454 | 0.001 | -0.503 | -0.139 |
| Good | (Intercept) | 1.948 | 0.141 | 13.844 | 0.000 | 1.672 | 2.223 |
| Good | Age | -0.002 | 0.002 | -0.977 | 0.329 | -0.007 | 0.002 |
| Good | PhysActiveYes | -1.001 | 0.090 | -11.120 | 0.000 | -1.178 | -0.825 |
| Fair | (Intercept) | 0.915 | 0.164 | 5.566 | 0.000 | 0.592 | 1.237 |
| Fair | Age | 0.003 | 0.003 | 1.058 | 0.290 | -0.003 | 0.009 |
| Fair | PhysActiveYes | -1.645 | 0.107 | -15.319 | 0.000 | -1.856 | -1.435 |
| Poor | (Intercept) | -1.521 | 0.290 | -5.238 | 0.000 | -2.090 | -0.952 |
| Poor | Age | 0.022 | 0.005 | 4.522 | 0.000 | 0.013 | 0.032 |
| Poor | PhysActiveYes | -2.656 | 0.236 | -11.275 | 0.000 | -3.117 | -2.194 |
Fair vs. Excellent Health
The baseline category for the model is Excellent.
. . .
The model equation for the log-odds a person rates themselves as having “Fair” health vs. “Excellent” is
\[ \log\Big(\frac{\hat{p}_{Fair}}{\hat{p}_{Excellent}}\Big) = 0.915 + 0.003 ~ \text{age} - 1.645 ~ \text{PhysActive} \]
Interpretations
\[ \log\Big(\frac{\hat{p}_{Fair}}{\hat{p}_{Excellent}}\Big) = 0.915 + 0.003 ~ \text{age} - 1.645 ~ \text{PhysActive} \]
For each additional year in age, the odds a person rates themselves as having fair health versus excellent health are expected to multiply by 1.003 (exp(0.003)), holding physical activity constant.
. . .
The odds a person who does physical activity will rate themselves as having fair health versus excellent health are expected to be 0.193 (exp(-1.645)) times the odds for a person who doesn’t do physical activity, holding age constant.
Interpretations
\[ \log\Big(\frac{\hat{p}_{Fair}}{\hat{p}_{Excellent}}\Big) = 0.915 + 0.003 ~ \text{age} - 1.645 ~ \text{PhysActive} \]
The odds a 0 year old person who doesn’t do physical activity rates themselves as having fair health vs. excellent health are 2.497 (exp(0.915)).
. . .
Need to mean-center age for the intercept to have a meaningful interpretation!
Practice
| y.level | term | estimate |
|---|---|---|
| Vgood | (Intercept) | 1.205 |
| Vgood | Age | 0.001 |
| Vgood | PhysActiveYes | -0.321 |
| Good | (Intercept) | 1.948 |
| Good | Age | -0.002 |
| Good | PhysActiveYes | -1.001 |
| Fair | (Intercept) | 0.915 |
| Fair | Age | 0.003 |
| Fair | PhysActiveYes | -1.645 |
| Poor | (Intercept) | -1.521 |
| Poor | Age | 0.022 |
| Poor | PhysActiveYes | -2.656 |
Write the equation for Very good (
Vgood) versus Excellent.Interpret
Agein the context of the data.Interpret
PhysActiveYesin the context of the data.
Recap
Introduce multinomial logistic regression
Interpret model coefficients