Apr 01, 2025
Next project milestone: Analysis and draft in April 14 lab
Team Feedback (email from TEAMMATES) due Tuesday, April 8 at 11:59pm (check email)
HW 04 due Tuesday, April 8 - released after class
Statistics experience due April 15
Comparing logistic regression models using
Drop-in-deviance test
AIC
BIC
Inference for a single model coefficient
This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease.
high_risk:
age: Age at exam time (in years)
education: 1 = Some High School, 2 = High School or GED, 3 = Some College or Vocational School, 4 = College
currentSmoker: 0 = nonsmoker, 1 = smoker
totChol: Total cholesterol (in mg/dL)
| term | estimate |
|---|---|
| (Intercept) | -6.673 |
| age | 0.082 |
| totChol | 0.002 |
| currentSmoker1 | 0.443 |
| term | estimate |
|---|---|
| (Intercept) | -6.456 |
| age | 0.080 |
| totChol | 0.002 |
| currentSmoker1 | 0.445 |
| education2 | -0.270 |
| education3 | -0.232 |
| education4 | -0.035 |
We will use the Drop-in-deviance test to compare the two models
Hypotheses
\[ \begin{aligned} H_0&: \beta_{ed2} = \beta_{ed3} = \beta_{ed4} = 0 \\ H_a&: \text{ at least one }\beta_j \text{ is not } 0 \end{aligned} \]
Test in R
## reduced and full models
high_risk_fit_reduced <- glm(high_risk ~ age + totChol + currentSmoker,
data = heart_disease, family = "binomial")
high_risk_fit_full <- glm(high_risk ~ age + totChol + currentSmoker +
education,
data = heart_disease, family = "binomial")
## drop-in-deviance test
anova(high_risk_fit_reduced, high_risk_fit_full, test = "Chisq") |>
tidy() |>
kable(digits = 3)Hypotheses
| term | df.residual | residual.deviance | df | deviance | p.value |
|---|---|---|---|---|---|
| high_risk ~ age + totChol + currentSmoker | 4082 | 3224.812 | NA | NA | NA |
| high_risk ~ age + totChol + currentSmoker + education | 4079 | 3217.600 | 3 | 7.212 | 0.065 |
Estimators of prediction error and relative quality of models:
Akaike’s Information Criterion (AIC)1: \[AIC = -2\log L + 2(p+1)\]
Schwarz’s Bayesian Information Criterion (BIC)2: \[ BIC = -2\log L + \log(n)\times(p+1)\]
\[ \begin{aligned} & AIC = \color{blue}{-2\log L} \color{black}{+ 2(p+1)} \\ & BIC = \color{blue}{-2\log L} + \color{black}{\log(n)\times(p+1)} \end{aligned} \]
First Term: Decreases as p increases
\[ \begin{aligned} & AIC = -2\log L + \color{blue}{2(p+1)} \\ & BIC = -2\log L + \color{blue}{\log(n)\times(p+1)} \end{aligned} \]
Second term: Increases as p increases
\[ \begin{aligned} & AIC = -2\log L + \color{red}{2(p+1)} \\ & BIC = -2 \log L + \color{red}{\log(n)\times(p+1)} \end{aligned} \]
Choose model with the smaller value of AIC or BIC
If \(n \geq 8\), the penalty for BIC is larger than that of AIC, so BIC tends to favor more parsimonious models (i.e. models with fewer terms)
glance() functionLet’s look at the AIC for the model that includes age, totChol, and currentSmoker
Let’s compare the full and reduced models using AIC.
Based on AIC, which model would you choose?
Let’s compare the full and reduced models using BIC
Based on BIC, which model would you choose?
The form of the statistical model for logistic regression is
\[ \log\Big(\frac{\pi}{1-\pi}\Big) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p \]
where \(\pi\) is the probability \(Y = 1\).
Notice there is no error term when writing the statistical model for logistic regression. Why?
Recall the log likelihood function
\[ \log L = \sum\limits_{i=1}^n[y_i \log(\hat{\pi}_i) + (1 - y_i)\log(1 - \hat{\pi}_i)] \]
where
\(\hat{\pi} = \frac{exp\{\hat{\beta}_0 + \hat{\beta}_1X_1 + \dots + \hat{\beta}_pX_p\}}{1 + exp\{\hat{\beta}_0 + \hat{\beta}_1X_1 + \dots + \hat{\beta}_pX_p\}}\)
The coefficients \(\hat{\beta}_0, \ldots, \hat{\beta}_p\) are estimated using maximum likelihood estimation
Basic idea: Find the values of \(\hat{\beta}_0, \ldots, \hat{\beta}_p\) that give the observed data the maximum probability of occurring
high_risk_fit_full <- glm(high_risk ~ age + totChol + currentSmoker,
data = heart_disease, family = "binomial",
# print each iteration
control = glm.control(trace = TRUE))Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... on entry
Deviance = 3268.418 Iterations - 1
NULL
Deviance = 3225.648 Iterations - 2
[1] -4.4798060788 0.0496514919 0.0008626177 0.2457937103
Deviance = 3224.813 Iterations - 3
[1] -6.394448772 0.078826360 0.001808306 0.417970457
Deviance = 3224.812 Iterations - 4
[1] -6.665531647 0.082366229 0.001992166 0.442315330
Deviance = 3224.812 Iterations - 5
[1] -6.673085394 0.082464718 0.001997547 0.442980128
There are two approaches for testing coefficients in logistic regression
Drop-in-deviance test. Use to test…
(Wald) hypothesis test. Use to test
Hypotheses: \(H_0: \beta_j = 0 \hspace{2mm} \text{ vs } \hspace{2mm} H_a: \beta_j \neq 0\), given the other variables in the model
(Wald) Test Statistic: \[z = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)}\]
P-value: \(P(|Z| > |z|)\), where \(Z \sim N(0, 1)\), the Standard Normal distribution
We can calculate the C% confidence interval for \(\beta_j\) as the following:
\[ \hat{\beta}_j \pm z^* \times SE(\hat{\beta}_j) \]
where \(z^*\) is calculated from the \(N(0,1)\) distribution
Note
This is an interval for the change in the log-odds for every one unit increase in \(x_j\)
The change in odds for every one unit increase in \(x_j\).
\[ \exp\{\hat{\beta}_j \pm z^* \times SE(\hat{\beta}_j)\} \]
Interpretation: We are \(C\%\) confident that for every one unit increase in \(x_j\), the odds multiply by a factor of \(\exp\{\hat{\beta}_j - z^*\times SE(\hat{\beta}_j)\}\) to \(\exp\{\hat{\beta}_j + z^* \times SE(\hat{\beta}_j)\}\), holding all else constant.
age| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -6.67309 | 0.37815 | -17.64688 | 0.00000 | -7.42283 | -5.94008 |
| age | 0.08246 | 0.00575 | 14.34353 | 0.00000 | 0.07127 | 0.09382 |
| totChol | 0.00200 | 0.00103 | 1.94024 | 0.05235 | -0.00003 | 0.00401 |
| currentSmoker1 | 0.44298 | 0.09359 | 4.73305 | 0.00000 | 0.25998 | 0.62699 |
Hypotheses:
\[ H_0: \beta_{age} = 0 \hspace{2mm} \text{ vs } \hspace{2mm} H_a: \beta_{age} \neq 0 \], given total cholesterol and current smoker is in the model
age| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -6.67309 | 0.37815 | -17.64688 | 0.00000 | -7.42283 | -5.94008 |
| age | 0.08246 | 0.00575 | 14.34353 | 0.00000 | 0.07127 | 0.09382 |
| totChol | 0.00200 | 0.00103 | 1.94024 | 0.05235 | -0.00003 | 0.00401 |
| currentSmoker1 | 0.44298 | 0.09359 | 4.73305 | 0.00000 | 0.25998 | 0.62699 |
Test statistic:
\[z = \frac{0.08246 - 0}{0.00575} = 14.34 \]
age| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -6.67309 | 0.37815 | -17.64688 | 0.00000 | -7.42283 | -5.94008 |
| age | 0.08246 | 0.00575 | 14.34353 | 0.00000 | 0.07127 | 0.09382 |
| totChol | 0.00200 | 0.00103 | 1.94024 | 0.05235 | -0.00003 | 0.00401 |
| currentSmoker1 | 0.44298 | 0.09359 | 4.73305 | 0.00000 | 0.25998 | 0.62699 |
P-value:
\[ P(|Z| > |14.34|) \approx 0 \]
age| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -6.67309 | 0.37815 | -17.64688 | 0.00000 | -7.42283 | -5.94008 |
| age | 0.08246 | 0.00575 | 14.34353 | 0.00000 | 0.07127 | 0.09382 |
| totChol | 0.00200 | 0.00103 | 1.94024 | 0.05235 | -0.00003 | 0.00401 |
| currentSmoker1 | 0.44298 | 0.09359 | 4.73305 | 0.00000 | 0.25998 | 0.62699 |
Conclusion:
The p-value is very small, so we reject \(H_0\). The data provide sufficient evidence that age is a statistically significant predictor of whether someone is high risk of having heart disease, after accounting for education.
age| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -6.673 | 0.378 | -17.647 | 0.000 | -7.423 | -5.940 |
| age | 0.082 | 0.006 | 14.344 | 0.000 | 0.071 | 0.094 |
| totChol | 0.002 | 0.001 | 1.940 | 0.052 | 0.000 | 0.004 |
| currentSmoker1 | 0.443 | 0.094 | 4.733 | 0.000 | 0.260 | 0.627 |
Interpret the 95% confidence interval for age in terms of the odds of being high risk for heart disease.
Comparing logistic regression models using
Drop-in-deviance test
AIC
BIC
Inference for a single model coefficient