Logistic regression

Prof. Maria Tackett

Mar 20, 2025

Announcements

  • Exploratory data analysis due TODAY at 11:59pm

    • Next milestone: Project presentations March 31 in lab
  • Statistics experience due April 15

Topics

  • Odds and probabilities

  • Interpret the coefficients of a logistic regression model with

    • a single categorical predictor
    • a single quantitative predictor
    • multiple predictors

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(knitr)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Probabilities vs. odds1

Scenario 1: Suppose the probability of a disease among a population of unvaccinated individuals is 0.00369, and the probability of the disease is 0.001 among a population of vaccinated individuals.

Scenario 2: Suppose the probability of a disease among a population of unvaccinated individuals is 0.48052 and the probability of the disease is 0.2 among a population of vaccinated individuals.

  • What is the difference in the probability of disease for these two populations?

  • What are the odds of disease in the population without a vaccine relative to the odds of disease in the with vaccine?

Logistic regression

From odds to probabilities

odds

\[\text{odds} = \frac{p}{1-p}\]

probability

\[p = \frac{\text{odds}}{1 + \text{odds}}\]

From odds to probabilities

  1. Logistic model: \(\log\big(\frac{p}{1-p}\big) = \beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\)

  2. Odds = \(\exp\big\{\log\big(\frac{p}{1-p}\big)\big\} = \frac{p}{1-p}\)

  3. Combining (1) and (2) with what we saw earlier

\[p = \frac{\exp\{\beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\}}{1 + \exp\{\beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\}}\]

Logistic regression model

Logit form: \[\log\Big(\frac{p}{1-p}\Big) = \beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\]

Probability form:

\[ \text{probability} = p = \frac{\exp\{\beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\}}{1 + \exp\{\beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\}} \]

Why is there no error term \(\epsilon\) when writing the statistical model for logistic regression?

Data: Concern about rising AI

This data comes from the 2023 Pew Research Center’s American Trends Panel. The survey aims to capture public opinion about a variety of topics including politics, religion, and technology, among others. We will use data from respondents in Wave 132 of the survey conducted July 31 - August 6, 2023 who completed the survey in 70 minutes or less.


The goal of this analysis is to understand the relationship between age, how much someone has heard about artificial intelligence (AI), and concern about the increased use of AI in daily life.


A more complete analysis on this topic can be found in the Pew Research Center article Growing public concern about the role of artificial intelligence in daily life by Alec Tyson and Emma Kikuchi.

Variables

  • ai_concern: Whether a respondent said they are “more concerned than excited” about in the increased use of AI in daily life (1: yes, 0: no)

Variables

  • survey_time: Time to complete the survey (in minutes)

  • age_cat: Age category

    • 18-29
    • 30-49
    • 50-64
    • 65+
    • Refused

Odds ratios: Concern about AI vs. age

Age Not Concerned Concerned
18-29 487 380
30-49 1661 1470
50-64 1257 1680
65+ 1252 1858
Refused 18 23

Let’s fit the model

ai_concern_fit <- glm(ai_concern ~ age_cat, data = pew_data, 
                      family = "binomial")

tidy(ai_concern_fit) |> kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) -0.248 0.068 -3.625 0.000
age_cat30-49 0.126 0.077 1.630 0.103
age_cat50-64 0.538 0.078 6.904 0.000
age_cat65+ 0.643 0.078 8.284 0.000
age_catRefused 0.493 0.322 1.531 0.126

The model

term estimate std.error statistic p.value
(Intercept) -0.248 0.068 -3.625 0.000
age_cat30-49 0.126 0.077 1.630 0.103
age_cat50-64 0.538 0.078 6.904 0.000
age_cat65+ 0.643 0.078 8.284 0.000
age_catRefused 0.493 0.322 1.531 0.126


\[\begin{aligned}\log\Big(\frac{\hat{p}}{1-\hat{p}}\Big) =& -0.248 + 0.126\times\text{age_cat30-49} + 0.538 \times \text{age_cat50-64}\\ &+ 0.643 \times \text{age_cat65+} + 0.493\times \text{age_catRefused} \end{aligned}\]

where \(\hat{p}\) is the predicted probability of being concerned about increased use of AI in daily life

Interpreting age_cat30-49: log-odds

term estimate std.error statistic p.value
(Intercept) -0.248 0.068 -3.625 0.000
age_cat30-49 0.126 0.077 1.630 0.103
age_cat50-64 0.538 0.078 6.904 0.000
age_cat65+ 0.643 0.078 8.284 0.000
age_catRefused 0.493 0.322 1.531 0.126

The log-odds of being concerned about increased use of AI in daily life are expected to be 0.126 higher for individuals 30 - 49 years old compared to 18-29 year-olds (the baseline group).

Warning

We would not use the interpretation in terms of log-odds in practice.

Interpreting age_cat30-49: odds

The odds of being concerned about increased use of AI in daily life for 30 - 49 year olds are expected to be 1.134 ( \(e^{0.126}\) ) times the odds for 18-29 year olds.

Coefficients & odds ratios

The model coefficient, 0.126, is the expected difference in the log-odds when comparing 30 - 49 year olds to 18 - 29 year olds.

Therefore, \(e^{0.126}\) = 1.134 is the expected change in the odds when comparing 30 - 49 year olds to 18-29 year olds.

\[ OR = e^{\hat{\beta}_j} = \exp\{\hat{\beta}_j\} \]

Interpret in terms of percent change

You can also interpret the change in the odds in terms of a percent change. The percent change in the odds can be computed as the following

\[\% \text{ change } = (e^{\hat{\beta}_j} - 1) \times 100\]


Interpret the coefficient of age_cat30-49 (0.126) in terms of the percent change in the odds.

Quantitative predictor

Now let’s look at the relationship between survey_time and ai_concern

ai_time_fit <- glm(ai_concern ~ survey_time, data = pew_data,
family = "binomial")
term estimate std.error statistic p.value
(Intercept) 0.069 0.037 1.853 0.064
survey_time 0.005 0.002 2.434 0.015

For each additional minute of taking the survey, the odds of being concerned about increased AI in daily life are expected to multiply by a factor of 1.005 ( \(e^{0.005}\)).

Multiple predictors

Now let’s consider a model that takes into account age , ai_heard and survey_time

ai_concern_full_fit <- glm(ai_concern ~ age_cat + ai_heard + 
                             survey_time, data = pew_data, family = "binomial")

Multiple predictors

term estimate std.error statistic p.value
(Intercept) -0.405 0.077 -5.230 0.000
age_cat30-49 0.117 0.078 1.504 0.132
age_cat50-64 0.519 0.079 6.587 0.000
age_cat65+ 0.604 0.079 7.611 0.000
age_catRefused 0.557 0.325 1.716 0.086
ai_heardA little 0.371 0.043 8.654 0.000
ai_heardNothing at all -0.243 0.085 -2.876 0.004
ai_heardRefused -0.571 0.505 -1.131 0.258
survey_time -0.001 0.002 -0.369 0.712

Interpretation


Use the model on the previous slide.

  • Describe the type of respondent represented by the intercept.
  • Interpret the effect of ai_heardNothing at all in terms of the odds of being concerned by increased use of AI in daily life.

Prediction

Predicted log odds

augment(ai_concern_full_fit) |> select(.fitted)
# A tibble: 5 × 1
  .fitted
    <dbl>
1 -0.0608
2  0.0756
3  0.473 
4  0.560 
5  0.563 

For observation 1

\[\text{predicted odds} = \hat{\text{odds}} = \frac{\hat{p}}{1-\hat{p}} = e^{-0.0608} = 0.941\]

Predicted probabilities


The predicted log-odds for observation 1 is -0.0608. What is the predicted probability this respondent is concerned about increased use of AI in daily life?

Predicted probabilities

We can calculate predicted probabilities using the argument type = "response" in predict.glm()1

predict.glm(ai_concern_full_fit, type = "response")

Showing the predictions for the first 10 observations

        1         2         3         4         5         6         7         8 
0.4848067 0.5188941 0.6161912 0.6364755 0.6371220 0.6366698 0.6159500 0.5257991 
        9        10 
0.4898898 0.6329262 

Predicted probability for new observation


Recall the model that includes predictors age_cat, ai_heard, and survey_time.

  • What are the predicted odds for a 70-year-old respondent who has heard nothing about AI and took 60 minutes to complete the survey?
  • What is the predicted probability this respondent is not concerned about increased use of AI in daily life?
  • Would you classify this person as someone who is concerned or someone who is not? Why?

Predicted probability for new observation

new_obs <- tibble(age_cat = "65+", ai_heard = "Nothing at all",  
                  survey_time = 60)

predict.glm(ai_concern_full_fit, newdata = new_obs, 
            type = "response")
        1 
0.4780527 

Recap

  • Use the odds ratio to compare the odds of two groups

  • Interpret the coefficients of a logistic regression model with

    • a single categorical predictor
    • a single quantitative predictor
    • multiple predictors

References

Ledolter, Johannes. 2003. “The Statistical Sleuth.” Taylor & Francis.