Logistic regression

Prof. Maria Tackett

Mar 20, 2025

Announcements

Exploratory data analysis due TODAY at 11:59pm
- Next milestone: Project presentations March 31 in lab
Statistics experience due April 15

Topics

Odds and probabilities
Interpret the coefficients of a logistic regression model with
- a single categorical predictor
- a single quantitative predictor
- multiple predictors

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(knitr)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Probabilities vs. odds¹

Scenario 1: Suppose the probability of a disease among a population of unvaccinated individuals is 0.00369, and the probability of the disease is 0.001 among a population of vaccinated individuals.

Scenario 2: Suppose the probability of a disease among a population of unvaccinated individuals is 0.48052 and the probability of the disease is 0.2 among a population of vaccinated individuals.

What is the difference in the probability of disease for these two populations?
What are the odds of disease in the population without a vaccine relative to the odds of disease in the with vaccine?

Logistic regression

From odds to probabilities

odds

\[\text{odds} = \frac{p}{1-p}\]

probability

\[p = \frac{\text{odds}}{1 + \text{odds}}\]

From odds to probabilities

Logistic model: \(\log\big(\frac{p}{1-p}\big) = \beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\)
Odds = \(\exp\big\{\log\big(\frac{p}{1-p}\big)\big\} = \frac{p}{1-p}\)
Combining (1) and (2) with what we saw earlier

\[p = \frac{\exp\{\beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\}}{1 + \exp\{\beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\}}\]

Logistic regression model

Logit form: \[\log\Big(\frac{p}{1-p}\Big) = \beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\]

Probability form:

\[ \text{probability} = p = \frac{\exp\{\beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\}}{1 + \exp\{\beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\}} \]

Why is there no error term \(\epsilon\) when writing the statistical model for logistic regression?

Data: Concern about rising AI

This data comes from the 2023 Pew Research Center’s American Trends Panel. The survey aims to capture public opinion about a variety of topics including politics, religion, and technology, among others. We will use data from respondents in Wave 132 of the survey conducted July 31 - August 6, 2023 who completed the survey in 70 minutes or less.

The goal of this analysis is to understand the relationship between age, how much someone has heard about artificial intelligence (AI), and concern about the increased use of AI in daily life.

A more complete analysis on this topic can be found in the Pew Research Center article Growing public concern about the role of artificial intelligence in daily life by Alec Tyson and Emma Kikuchi.

Variables

ai_concern: Whether a respondent said they are “more concerned than excited” about in the increased use of AI in daily life (1: yes, 0: no)

Variables

survey_time: Time to complete the survey (in minutes)
age_cat: Age category
- 18-29
- 30-49
- 50-64
- 65+
- Refused

Odds ratios: Concern about AI vs. age

Age	Not Concerned	Concerned
18-29	487	380
30-49	1661	1470
50-64	1257	1680
65+	1252	1858
Refused	18	23

Let’s fit the model

ai_concern_fit <- glm(ai_concern ~ age_cat, data = pew_data, 
                      family = "binomial")

tidy(ai_concern_fit) |> kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	-0.248	0.068	-3.625	0.000
age_cat30-49	0.126	0.077	1.630	0.103
age_cat50-64	0.538	0.078	6.904	0.000
age_cat65+	0.643	0.078	8.284	0.000
age_catRefused	0.493	0.322	1.531	0.126

The model

term	estimate	std.error	statistic	p.value
(Intercept)	-0.248	0.068	-3.625	0.000
age_cat30-49	0.126	0.077	1.630	0.103
age_cat50-64	0.538	0.078	6.904	0.000
age_cat65+	0.643	0.078	8.284	0.000
age_catRefused	0.493	0.322	1.531	0.126

\[\begin{aligned}\log\Big(\frac{\hat{p}}{1-\hat{p}}\Big) =& -0.248 + 0.126\times\text{age_cat30-49} + 0.538 \times \text{age_cat50-64}\\ &+ 0.643 \times \text{age_cat65+} + 0.493\times \text{age_catRefused} \end{aligned}\]

where \(\hat{p}\) is the predicted probability of being concerned about increased use of AI in daily life

Interpreting `age_cat30-49`: log-odds

term	estimate	std.error	statistic	p.value
(Intercept)	-0.248	0.068	-3.625	0.000
age_cat30-49	0.126	0.077	1.630	0.103
age_cat50-64	0.538	0.078	6.904	0.000
age_cat65+	0.643	0.078	8.284	0.000
age_catRefused	0.493	0.322	1.531	0.126

The log-odds of being concerned about increased use of AI in daily life are expected to be 0.126 higher for individuals 30 - 49 years old compared to 18-29 year-olds (the baseline group).

Warning

We would not use the interpretation in terms of log-odds in practice.

Interpreting `age_cat30-49`: odds

The odds of being concerned about increased use of AI in daily life for 30 - 49 year olds are expected to be 1.134 ( \(e^{0.126}\) ) times the odds for 18-29 year olds.

Coefficients & odds ratios

The model coefficient, 0.126, is the expected difference in the log-odds when comparing 30 - 49 year olds to 18 - 29 year olds.

Therefore, \(e^{0.126}\) = 1.134 is the expected change in the odds when comparing 30 - 49 year olds to 18-29 year olds.

\[ OR = e^{\hat{\beta}_j} = \exp\{\hat{\beta}_j\} \]

Interpret in terms of percent change

You can also interpret the change in the odds in terms of a percent change. The percent change in the odds can be computed as the following

\[\% \text{ change } = (e^{\hat{\beta}_j} - 1) \times 100\]

Interpret the coefficient of age_cat30-49 (0.126) in terms of the percent change in the odds.

Quantitative predictor

Now let’s look at the relationship between survey_time and ai_concern

ai_time_fit <- glm(ai_concern ~ survey_time, data = pew_data,
family = "binomial")

term	estimate	std.error	statistic	p.value
(Intercept)	0.069	0.037	1.853	0.064
survey_time	0.005	0.002	2.434	0.015

For each additional minute of taking the survey, the odds of being concerned about increased AI in daily life are expected to multiply by a factor of 1.005 ( \(e^{0.005}\)).

Multiple predictors

Now let’s consider a model that takes into account age , ai_heard and survey_time

ai_concern_full_fit <- glm(ai_concern ~ age_cat + ai_heard + 
                             survey_time, data = pew_data, family = "binomial")

Multiple predictors

term	estimate	std.error	statistic	p.value
(Intercept)	-0.405	0.077	-5.230	0.000
age_cat30-49	0.117	0.078	1.504	0.132
age_cat50-64	0.519	0.079	6.587	0.000
age_cat65+	0.604	0.079	7.611	0.000
age_catRefused	0.557	0.325	1.716	0.086
ai_heardA little	0.371	0.043	8.654	0.000
ai_heardNothing at all	-0.243	0.085	-2.876	0.004
ai_heardRefused	-0.571	0.505	-1.131	0.258
survey_time	-0.001	0.002	-0.369	0.712

Interpretation

Use the model on the previous slide.

Describe the type of respondent represented by the intercept.
Interpret the effect of ai_heardNothing at all in terms of the odds of being concerned by increased use of AI in daily life.

Prediction

Predicted log odds

augment(ai_concern_full_fit) |> select(.fitted)

# A tibble: 5 × 1
  .fitted
    <dbl>
1 -0.0608
2  0.0756
3  0.473 
4  0.560 
5  0.563

For observation 1

\[\text{predicted odds} = \hat{\text{odds}} = \frac{\hat{p}}{1-\hat{p}} = e^{-0.0608} = 0.941\]

Predicted probabilities

The predicted log-odds for observation 1 is -0.0608. What is the predicted probability this respondent is concerned about increased use of AI in daily life?

Predicted probabilities

We can calculate predicted probabilities using the argument type = "response" in predict.glm()¹

predict.glm(ai_concern_full_fit, type = "response")

Showing the predictions for the first 10 observations

        1         2         3         4         5         6         7         8 
0.4848067 0.5188941 0.6161912 0.6364755 0.6371220 0.6366698 0.6159500 0.5257991 
        9        10 
0.4898898 0.6329262

Predicted probability for new observation

Recall the model that includes predictors age_cat, ai_heard, and survey_time.

What are the predicted odds for a 70-year-old respondent who has heard nothing about AI and took 60 minutes to complete the survey?
What is the predicted probability this respondent is not concerned about increased use of AI in daily life?
Would you classify this person as someone who is concerned or someone who is not? Why?

Predicted probability for new observation

new_obs <- tibble(age_cat = "65+", ai_heard = "Nothing at all",  
                  survey_time = 60)

predict.glm(ai_concern_full_fit, newdata = new_obs, 
            type = "response")

        1 
0.4780527

Recap

Use the odds ratio to compare the odds of two groups
Interpret the coefficients of a logistic regression model with
- a single categorical predictor
- a single quantitative predictor
- multiple predictors

References

Ledolter, Johannes. 2003. “The Statistical Sleuth.” Taylor & Francis.

Logistic regression

Announcements

Topics

Computational setup

Probabilities vs. odds1

Logistic regression

From odds to probabilities

From odds to probabilities

Logistic regression model

Data: Concern about rising AI

Variables

Variables

Odds ratios: Concern about AI vs. age

Let’s fit the model

The model

Interpreting age_cat30-49: log-odds

Interpreting age_cat30-49: odds

Coefficients & odds ratios

Interpret in terms of percent change

Quantitative predictor

Multiple predictors

Multiple predictors

Interpretation

Prediction

Predicted log odds

Predicted probabilities

Predicted probabilities

Predicted probability for new observation

Predicted probability for new observation

Recap

References

Probabilities vs. odds¹

Interpreting `age_cat30-49`: log-odds

Interpreting `age_cat30-49`: odds