Logistic Regression: Model comparison + inference

Author

Prof. Maria Tackett

Published

Apr 01, 2025

Announcements

  • Next project milestone: Analysis and draft in April 14 lab

  • Team Feedback (email from TEAMMATES) due Tuesday, April 8 at 11:59pm (check email)

  • HW 04 due Tuesday, April 8 - released after class

  • Statistics experience due April 15

Topics

  • Comparing logistic regression models using

    • Drop-in-deviance test

    • AIC

    • BIC

  • Inference for a single model coefficient

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(openintro)
library(knitr)
library(kableExtra)  # for table embellishments
library(Stat2Data)   # for empirical logit

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Data: Risk of coronary heart disease

This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease.

  • high_risk:

    • 1: High risk of having heart disease in next 10 years
    • 0: Not high risk of having heart disease in next 10 years
  • age: Age at exam time (in years)

  • education: 1 = Some High School, 2 = High School or GED, 3 = Some College or Vocational School, 4 = College

  • currentSmoker: 0 = nonsmoker, 1 = smoker

  • totChol: Total cholesterol (in mg/dL)

Review: Drop-in-deviance test

Which model do we choose?

Model 1
term estimate
(Intercept) -6.673
age 0.082
totChol 0.002
currentSmoker1 0.443
Model 2
term estimate
(Intercept) -6.456
age 0.080
totChol 0.002
currentSmoker1 0.445
education2 -0.270
education3 -0.232
education4 -0.035

We will use the Drop-in-deviance test to compare the two models

Drop-in-deviance test

Hypotheses

\[ \begin{aligned} H_0&: \beta_{ed2} = \beta_{ed3} = \beta_{ed4} = 0 \\ H_a&: \text{ at least one }\beta_j \text{ is not } 0 \end{aligned} \]

. . .

Test in R

## reduced and full models
high_risk_fit_reduced <- glm(high_risk ~ age + totChol + currentSmoker,
              data = heart_disease, family = "binomial")

high_risk_fit_full <- glm(high_risk ~ age + totChol + currentSmoker +
                            education,
              data = heart_disease, family = "binomial")

## drop-in-deviance test 
anova(high_risk_fit_reduced, high_risk_fit_full, test = "Chisq") |>
  tidy() |> 
  kable(digits = 3)

Drop-in-deviance test

Hypotheses

term df.residual residual.deviance df deviance p.value
high_risk ~ age + totChol + currentSmoker 4082 3224.812 NA NA NA
high_risk ~ age + totChol + currentSmoker + education 4079 3217.600 3 7.212 0.065

Model comparison using AIC and BIC

AIC & BIC

Estimators of prediction error and relative quality of models:

Akaike’s Information Criterion (AIC)1: \[AIC = -2\log L + 2(p+1)\]

Schwarz’s Bayesian Information Criterion (BIC)2: \[ BIC = -2\log L + \log(n)\times(p+1)\]

AIC & BIC

\[ \begin{aligned} & AIC = \color{blue}{-2\log L} \color{black}{+ 2(p+1)} \\ & BIC = \color{blue}{-2\log L} + \color{black}{\log(n)\times(p+1)} \end{aligned} \]

. . .


First Term: Decreases as p increases

AIC & BIC

\[ \begin{aligned} & AIC = -2\log L + \color{blue}{2(p+1)} \\ & BIC = -2\log L + \color{blue}{\log(n)\times(p+1)} \end{aligned} \]


Second term: Increases as p increases

Using AIC & BIC

\[ \begin{aligned} & AIC = -2\log L + \color{red}{2(p+1)} \\ & BIC = -2 \log L + \color{red}{\log(n)\times(p+1)} \end{aligned} \]

  • Choose model with the smaller value of AIC or BIC

  • If \(n \geq 8\), the penalty for BIC is larger than that of AIC, so BIC tends to favor more parsimonious models (i.e. models with fewer terms)

AIC from the glance() function

Let’s look at the AIC for the model that includes age, totChol, and currentSmoker

glance(high_risk_fit)$AIC
[1] 3232.812


. . .

Calculating AIC

- 2 * glance(high_risk_fit)$logLik + 2 * (3 + 1)
[1] 3232.812

Comparing the models using AIC

Let’s compare the full and reduced models using AIC.

glance(high_risk_fit_reduced)$AIC
[1] 3232.812
glance(high_risk_fit_full)$AIC
[1] 3231.6


Based on AIC, which model would you choose?

Comparing the models using BIC

Let’s compare the full and reduced models using BIC

glance(high_risk_fit_reduced)$BIC
[1] 3258.074
glance(high_risk_fit_full)$BIC
[1] 3275.807


Based on BIC, which model would you choose?

Application exercise

Estimating coefficients

Statistical model

The form of the statistical model for logistic regression is

\[ \log\Big(\frac{\pi}{1-\pi}\Big) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p \]

where \(\pi\) is the probability \(Y = 1\).

. . .

Notice there is no error term when writing the statistical model for logistic regression. Why?

  • Recall that the statistical model is the “data-generating” model
  • Each individual observed \(Y\) is generated from a Bernoulli distribution, \(Bernoulli(\pi)\) (similarly we can think of \(n\) observed \(Y\)’s as generated from a Binomial distribution, \(Binomial(n,p)\))
  • Therefore, the randomness is not produced by an error term but rather in the distribution used to generate \(Y\)

Estimating coefficients

Recall the log likelihood function

\[ \log L = \sum\limits_{i=1}^n[y_i \log(\hat{\pi}_i) + (1 - y_i)\log(1 - \hat{\pi}_i)] \]

where

\(\hat{\pi} = \frac{exp\{\hat{\beta}_0 + \hat{\beta}_1X_1 + \dots + \hat{\beta}_pX_p\}}{1 + exp\{\hat{\beta}_0 + \hat{\beta}_1X_1 + \dots + \hat{\beta}_pX_p\}}\)

. . .

  • The coefficients \(\hat{\beta}_0, \ldots, \hat{\beta}_p\) are estimated using maximum likelihood estimation

  • Basic idea: Find the values of \(\hat{\beta}_0, \ldots, \hat{\beta}_p\) that give the observed data the maximum probability of occurring

Iterative model fitting in R

high_risk_fit_full <- glm(high_risk ~ age + totChol + currentSmoker,
                          data = heart_disease, family = "binomial", 
                          # print each iteration
                        control = glm.control(trace = TRUE))
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... on entry 
Deviance = 3268.418 Iterations - 1
NULL
Deviance = 3225.648 Iterations - 2
[1] -4.4798060788  0.0496514919  0.0008626177  0.2457937103
Deviance = 3224.813 Iterations - 3
[1] -6.394448772  0.078826360  0.001808306  0.417970457
Deviance = 3224.812 Iterations - 4
[1] -6.665531647  0.082366229  0.001992166  0.442315330
Deviance = 3224.812 Iterations - 5
[1] -6.673085394  0.082464718  0.001997547  0.442980128
#stop printing for future models
untrace(glm.fit)

Inference for coefficients

Inference for coefficients

There are two approaches for testing coefficients in logistic regression

  • Drop-in-deviance test. Use to test…

    • a single coefficient
    • a categorical predictor with 3+ levels
    • a group of predictor variables
  • (Wald) hypothesis test. Use to test

    • a single coefficient

Hypothesis test for \(\beta_j\)

Hypotheses: \(H_0: \beta_j = 0 \hspace{2mm} \text{ vs } \hspace{2mm} H_a: \beta_j \neq 0\), given the other variables in the model

. . .

(Wald) Test Statistic: \[z = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)}\]

. . .

P-value: \(P(|Z| > |z|)\), where \(Z \sim N(0, 1)\), the Standard Normal distribution

Confidence interval for \(\beta_j\)

We can calculate the C% confidence interval for \(\beta_j\) as the following:

\[ \hat{\beta}_j \pm z^* \times SE(\hat{\beta}_j) \]

where \(z^*\) is calculated from the \(N(0,1)\) distribution

. . .

Note

This is an interval for the change in the log-odds for every one unit increase in \(x_j\)

Interpretation in terms of the odds

The change in odds for every one unit increase in \(x_j\).

\[ \exp\{\hat{\beta}_j \pm z^* \times SE(\hat{\beta}_j)\} \]

. . .

Interpretation: We are \(C\%\) confident that for every one unit increase in \(x_j\), the odds multiply by a factor of \(\exp\{\hat{\beta}_j - z^*\times SE(\hat{\beta}_j)\}\) to \(\exp\{\hat{\beta}_j + z^* \times SE(\hat{\beta}_j)\}\), holding all else constant.

Coefficient for age

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -6.67309 0.37815 -17.64688 0.00000 -7.42283 -5.94008
age 0.08246 0.00575 14.34353 0.00000 0.07127 0.09382
totChol 0.00200 0.00103 1.94024 0.05235 -0.00003 0.00401
currentSmoker1 0.44298 0.09359 4.73305 0.00000 0.25998 0.62699

. . .

Hypotheses:

\[ H_0: \beta_{age} = 0 \hspace{2mm} \text{ vs } \hspace{2mm} H_a: \beta_{age} \neq 0 \], given total cholesterol and current smoker is in the model

Coefficient for age

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -6.67309 0.37815 -17.64688 0.00000 -7.42283 -5.94008
age 0.08246 0.00575 14.34353 0.00000 0.07127 0.09382
totChol 0.00200 0.00103 1.94024 0.05235 -0.00003 0.00401
currentSmoker1 0.44298 0.09359 4.73305 0.00000 0.25998 0.62699

Test statistic:

\[z = \frac{0.08246 - 0}{0.00575} = 14.34 \]

Coefficient for age

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -6.67309 0.37815 -17.64688 0.00000 -7.42283 -5.94008
age 0.08246 0.00575 14.34353 0.00000 0.07127 0.09382
totChol 0.00200 0.00103 1.94024 0.05235 -0.00003 0.00401
currentSmoker1 0.44298 0.09359 4.73305 0.00000 0.25998 0.62699

P-value:

\[ P(|Z| > |14.34|) \approx 0 \]

. . .

2 * pnorm(14.34,lower.tail = FALSE)
[1] 1.230554e-46

Coefficient for age

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -6.67309 0.37815 -17.64688 0.00000 -7.42283 -5.94008
age 0.08246 0.00575 14.34353 0.00000 0.07127 0.09382
totChol 0.00200 0.00103 1.94024 0.05235 -0.00003 0.00401
currentSmoker1 0.44298 0.09359 4.73305 0.00000 0.25998 0.62699

Conclusion:

The p-value is very small, so we reject \(H_0\). The data provide sufficient evidence that age is a statistically significant predictor of whether someone is high risk of having heart disease, after accounting for education.

CI for age

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -6.673 0.378 -17.647 0.000 -7.423 -5.940
age 0.082 0.006 14.344 0.000 0.071 0.094
totChol 0.002 0.001 1.940 0.052 0.000 0.004
currentSmoker1 0.443 0.094 4.733 0.000 0.260 0.627

Interpret the 95% confidence interval for age in terms of the odds of being high risk for heart disease.

Recap

  • Comparing logistic regression models using

    • Drop-in-deviance test

    • AIC

    • BIC

  • Inference for a single model coefficient

Footnotes

  1. Akaike, Hirotugu. “A new look at the statistical model identification.” IEEE transactions on automatic control 19.6 (1974): 716-723.↩︎

  2. Schwarz, Gideon. “Estimating the dimension of a model.” The annals of statistics (1978): 461-464.↩︎