# load packages
library(tidyverse)
library(tidymodels)
library(openintro)
library(knitr)
library(kableExtra) # for table embellishments
library(Stat2Data) # for empirical logit
# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))Logistic Regression: Model comparison + inference
Announcements
Next project milestone: Analysis and draft in April 14 lab
Team Feedback (email from TEAMMATES) due Tuesday, April 8 at 11:59pm (check email)
HW 04 due Tuesday, April 8 - released after class
Statistics experience due April 15
Topics
Comparing logistic regression models using
Drop-in-deviance test
AIC
BIC
Inference for a single model coefficient
Computational setup
Data: Risk of coronary heart disease
This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease.
high_risk:- 1: High risk of having heart disease in next 10 years
- 0: Not high risk of having heart disease in next 10 years
age: Age at exam time (in years)education: 1 = Some High School, 2 = High School or GED, 3 = Some College or Vocational School, 4 = CollegecurrentSmoker: 0 = nonsmoker, 1 = smokertotChol: Total cholesterol (in mg/dL)
Review: Drop-in-deviance test
Which model do we choose?
| term | estimate |
|---|---|
| (Intercept) | -6.673 |
| age | 0.082 |
| totChol | 0.002 |
| currentSmoker1 | 0.443 |
| term | estimate |
|---|---|
| (Intercept) | -6.456 |
| age | 0.080 |
| totChol | 0.002 |
| currentSmoker1 | 0.445 |
| education2 | -0.270 |
| education3 | -0.232 |
| education4 | -0.035 |
We will use the Drop-in-deviance test to compare the two models
Drop-in-deviance test
Hypotheses
\[ \begin{aligned} H_0&: \beta_{ed2} = \beta_{ed3} = \beta_{ed4} = 0 \\ H_a&: \text{ at least one }\beta_j \text{ is not } 0 \end{aligned} \]
. . .
Test in R
## reduced and full models
high_risk_fit_reduced <- glm(high_risk ~ age + totChol + currentSmoker,
data = heart_disease, family = "binomial")
high_risk_fit_full <- glm(high_risk ~ age + totChol + currentSmoker +
education,
data = heart_disease, family = "binomial")
## drop-in-deviance test
anova(high_risk_fit_reduced, high_risk_fit_full, test = "Chisq") |>
tidy() |>
kable(digits = 3)Drop-in-deviance test
Hypotheses
| term | df.residual | residual.deviance | df | deviance | p.value |
|---|---|---|---|---|---|
| high_risk ~ age + totChol + currentSmoker | 4082 | 3224.812 | NA | NA | NA |
| high_risk ~ age + totChol + currentSmoker + education | 4079 | 3217.600 | 3 | 7.212 | 0.065 |
Model comparison using AIC and BIC
AIC & BIC
Estimators of prediction error and relative quality of models:
Akaike’s Information Criterion (AIC)1: \[AIC = -2\log L + 2(p+1)\]
Schwarz’s Bayesian Information Criterion (BIC)2: \[ BIC = -2\log L + \log(n)\times(p+1)\]
AIC & BIC
\[ \begin{aligned} & AIC = \color{blue}{-2\log L} \color{black}{+ 2(p+1)} \\ & BIC = \color{blue}{-2\log L} + \color{black}{\log(n)\times(p+1)} \end{aligned} \]
. . .
First Term: Decreases as p increases
AIC & BIC
\[ \begin{aligned} & AIC = -2\log L + \color{blue}{2(p+1)} \\ & BIC = -2\log L + \color{blue}{\log(n)\times(p+1)} \end{aligned} \]
Second term: Increases as p increases
Using AIC & BIC
\[ \begin{aligned} & AIC = -2\log L + \color{red}{2(p+1)} \\ & BIC = -2 \log L + \color{red}{\log(n)\times(p+1)} \end{aligned} \]
Choose model with the smaller value of AIC or BIC
If \(n \geq 8\), the penalty for BIC is larger than that of AIC, so BIC tends to favor more parsimonious models (i.e. models with fewer terms)
AIC from the glance() function
Let’s look at the AIC for the model that includes age, totChol, and currentSmoker
glance(high_risk_fit)$AIC[1] 3232.812
. . .
Calculating AIC
- 2 * glance(high_risk_fit)$logLik + 2 * (3 + 1)[1] 3232.812
Comparing the models using AIC
Let’s compare the full and reduced models using AIC.
glance(high_risk_fit_reduced)$AIC[1] 3232.812
glance(high_risk_fit_full)$AIC[1] 3231.6
Based on AIC, which model would you choose?
Comparing the models using BIC
Let’s compare the full and reduced models using BIC
glance(high_risk_fit_reduced)$BIC[1] 3258.074
glance(high_risk_fit_full)$BIC[1] 3275.807
Based on BIC, which model would you choose?
Application exercise
Estimating coefficients
Statistical model
The form of the statistical model for logistic regression is
\[ \log\Big(\frac{\pi}{1-\pi}\Big) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p \]
where \(\pi\) is the probability \(Y = 1\).
. . .
Notice there is no error term when writing the statistical model for logistic regression. Why?
- Recall that the statistical model is the “data-generating” model
- Each individual observed \(Y\) is generated from a Bernoulli distribution, \(Bernoulli(\pi)\) (similarly we can think of \(n\) observed \(Y\)’s as generated from a Binomial distribution, \(Binomial(n,p)\))
- Therefore, the randomness is not produced by an error term but rather in the distribution used to generate \(Y\)
Estimating coefficients
Recall the log likelihood function
\[ \log L = \sum\limits_{i=1}^n[y_i \log(\hat{\pi}_i) + (1 - y_i)\log(1 - \hat{\pi}_i)] \]
where
\(\hat{\pi} = \frac{exp\{\hat{\beta}_0 + \hat{\beta}_1X_1 + \dots + \hat{\beta}_pX_p\}}{1 + exp\{\hat{\beta}_0 + \hat{\beta}_1X_1 + \dots + \hat{\beta}_pX_p\}}\)
. . .
The coefficients \(\hat{\beta}_0, \ldots, \hat{\beta}_p\) are estimated using maximum likelihood estimation
Basic idea: Find the values of \(\hat{\beta}_0, \ldots, \hat{\beta}_p\) that give the observed data the maximum probability of occurring
Iterative model fitting in R
high_risk_fit_full <- glm(high_risk ~ age + totChol + currentSmoker,
data = heart_disease, family = "binomial",
# print each iteration
control = glm.control(trace = TRUE))Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... on entry
Deviance = 3268.418 Iterations - 1
NULL
Deviance = 3225.648 Iterations - 2
[1] -4.4798060788 0.0496514919 0.0008626177 0.2457937103
Deviance = 3224.813 Iterations - 3
[1] -6.394448772 0.078826360 0.001808306 0.417970457
Deviance = 3224.812 Iterations - 4
[1] -6.665531647 0.082366229 0.001992166 0.442315330
Deviance = 3224.812 Iterations - 5
[1] -6.673085394 0.082464718 0.001997547 0.442980128
#stop printing for future models
untrace(glm.fit)Inference for coefficients
Inference for coefficients
There are two approaches for testing coefficients in logistic regression
Drop-in-deviance test. Use to test…
- a single coefficient
- a categorical predictor with 3+ levels
- a group of predictor variables
(Wald) hypothesis test. Use to test
- a single coefficient
Hypothesis test for \(\beta_j\)
Hypotheses: \(H_0: \beta_j = 0 \hspace{2mm} \text{ vs } \hspace{2mm} H_a: \beta_j \neq 0\), given the other variables in the model
. . .
(Wald) Test Statistic: \[z = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)}\]
. . .
P-value: \(P(|Z| > |z|)\), where \(Z \sim N(0, 1)\), the Standard Normal distribution
Confidence interval for \(\beta_j\)
We can calculate the C% confidence interval for \(\beta_j\) as the following:
\[ \hat{\beta}_j \pm z^* \times SE(\hat{\beta}_j) \]
where \(z^*\) is calculated from the \(N(0,1)\) distribution
. . .
This is an interval for the change in the log-odds for every one unit increase in \(x_j\)
Interpretation in terms of the odds
The change in odds for every one unit increase in \(x_j\).
\[ \exp\{\hat{\beta}_j \pm z^* \times SE(\hat{\beta}_j)\} \]
. . .
Interpretation: We are \(C\%\) confident that for every one unit increase in \(x_j\), the odds multiply by a factor of \(\exp\{\hat{\beta}_j - z^*\times SE(\hat{\beta}_j)\}\) to \(\exp\{\hat{\beta}_j + z^* \times SE(\hat{\beta}_j)\}\), holding all else constant.
Coefficient for age
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -6.67309 | 0.37815 | -17.64688 | 0.00000 | -7.42283 | -5.94008 |
| age | 0.08246 | 0.00575 | 14.34353 | 0.00000 | 0.07127 | 0.09382 |
| totChol | 0.00200 | 0.00103 | 1.94024 | 0.05235 | -0.00003 | 0.00401 |
| currentSmoker1 | 0.44298 | 0.09359 | 4.73305 | 0.00000 | 0.25998 | 0.62699 |
. . .
Hypotheses:
\[ H_0: \beta_{age} = 0 \hspace{2mm} \text{ vs } \hspace{2mm} H_a: \beta_{age} \neq 0 \], given total cholesterol and current smoker is in the model
Coefficient for age
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -6.67309 | 0.37815 | -17.64688 | 0.00000 | -7.42283 | -5.94008 |
| age | 0.08246 | 0.00575 | 14.34353 | 0.00000 | 0.07127 | 0.09382 |
| totChol | 0.00200 | 0.00103 | 1.94024 | 0.05235 | -0.00003 | 0.00401 |
| currentSmoker1 | 0.44298 | 0.09359 | 4.73305 | 0.00000 | 0.25998 | 0.62699 |
Test statistic:
\[z = \frac{0.08246 - 0}{0.00575} = 14.34 \]
Coefficient for age
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -6.67309 | 0.37815 | -17.64688 | 0.00000 | -7.42283 | -5.94008 |
| age | 0.08246 | 0.00575 | 14.34353 | 0.00000 | 0.07127 | 0.09382 |
| totChol | 0.00200 | 0.00103 | 1.94024 | 0.05235 | -0.00003 | 0.00401 |
| currentSmoker1 | 0.44298 | 0.09359 | 4.73305 | 0.00000 | 0.25998 | 0.62699 |
P-value:
\[ P(|Z| > |14.34|) \approx 0 \]
. . .
2 * pnorm(14.34,lower.tail = FALSE)[1] 1.230554e-46
Coefficient for age
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -6.67309 | 0.37815 | -17.64688 | 0.00000 | -7.42283 | -5.94008 |
| age | 0.08246 | 0.00575 | 14.34353 | 0.00000 | 0.07127 | 0.09382 |
| totChol | 0.00200 | 0.00103 | 1.94024 | 0.05235 | -0.00003 | 0.00401 |
| currentSmoker1 | 0.44298 | 0.09359 | 4.73305 | 0.00000 | 0.25998 | 0.62699 |
Conclusion:
The p-value is very small, so we reject \(H_0\). The data provide sufficient evidence that age is a statistically significant predictor of whether someone is high risk of having heart disease, after accounting for education.
CI for age
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -6.673 | 0.378 | -17.647 | 0.000 | -7.423 | -5.940 |
| age | 0.082 | 0.006 | 14.344 | 0.000 | 0.071 | 0.094 |
| totChol | 0.002 | 0.001 | 1.940 | 0.052 | 0.000 | 0.004 |
| currentSmoker1 | 0.443 | 0.094 | 4.733 | 0.000 | 0.260 | 0.627 |
Interpret the 95% confidence interval for age in terms of the odds of being high risk for heart disease.
Recap
Comparing logistic regression models using
Drop-in-deviance test
AIC
BIC
Inference for a single model coefficient
Footnotes
Akaike, Hirotugu. “A new look at the statistical model identification.” IEEE transactions on automatic control 19.6 (1974): 716-723.↩︎
Schwarz, Gideon. “Estimating the dimension of a model.” The annals of statistics (1978): 461-464.↩︎