# load packages
library(tidyverse)
library(tidymodels)
library(knitr)
# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))Logistic regression
Announcements
Exploratory data analysis due TODAY at 11:59pm
- Next milestone: Project presentations March 31 in lab
Statistics experience due April 15
Topics
Odds and probabilities
Interpret the coefficients of a logistic regression model with
- a single categorical predictor
- a single quantitative predictor
- multiple predictors
Computational setup
Probabilities vs. odds1
Scenario 1: Suppose the probability of a disease among a population of unvaccinated individuals is 0.00369, and the probability of the disease is 0.001 among a population of vaccinated individuals.
Scenario 2: Suppose the probability of a disease among a population of unvaccinated individuals is 0.48052 and the probability of the disease is 0.2 among a population of vaccinated individuals.
What is the difference in the probability of disease for these two populations?
What are the odds of disease in the population without a vaccine relative to the odds of disease in the with vaccine?
Logistic regression
From odds to probabilities
odds
\[\text{odds} = \frac{p}{1-p}\]
probability
\[p = \frac{\text{odds}}{1 + \text{odds}}\]
From odds to probabilities
Logistic model: \(\log\big(\frac{p}{1-p}\big) = \beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\)
Odds = \(\exp\big\{\log\big(\frac{p}{1-p}\big)\big\} = \frac{p}{1-p}\)
Combining (1) and (2) with what we saw earlier
. . .
\[p = \frac{\exp\{\beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\}}{1 + \exp\{\beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\}}\]
Logistic regression model
Logit form: \[\log\Big(\frac{p}{1-p}\Big) = \beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\]
. . .
Probability form:
\[ \text{probability} = p = \frac{\exp\{\beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\}}{1 + \exp\{\beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\}} \]
Why is there no error term \(\epsilon\) when writing the statistical model for logistic regression?
Data: Concern about rising AI
This data comes from the 2023 Pew Research Center’s American Trends Panel. The survey aims to capture public opinion about a variety of topics including politics, religion, and technology, among others. We will use data from respondents in Wave 132 of the survey conducted July 31 - August 6, 2023 who completed the survey in 70 minutes or less.
The goal of this analysis is to understand the relationship between age, how much someone has heard about artificial intelligence (AI), and concern about the increased use of AI in daily life.
A more complete analysis on this topic can be found in the Pew Research Center article Growing public concern about the role of artificial intelligence in daily life by Alec Tyson and Emma Kikuchi.
Variables
ai_concern: Whether a respondent said they are “more concerned than excited” about in the increased use of AI in daily life (1: yes, 0: no)
Variables
survey_time: Time to complete the survey (in minutes)age_cat: Age category- 18-29
- 30-49
- 50-64
- 65+
- Refused
Odds ratios: Concern about AI vs. age
| Age | Not Concerned | Concerned |
|---|---|---|
| 18-29 | 487 | 380 |
| 30-49 | 1661 | 1470 |
| 50-64 | 1257 | 1680 |
| 65+ | 1252 | 1858 |
| Refused | 18 | 23 |
Let’s fit the model
ai_concern_fit <- glm(ai_concern ~ age_cat, data = pew_data,
family = "binomial")
tidy(ai_concern_fit) |> kable(digits = 3)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.248 | 0.068 | -3.625 | 0.000 |
| age_cat30-49 | 0.126 | 0.077 | 1.630 | 0.103 |
| age_cat50-64 | 0.538 | 0.078 | 6.904 | 0.000 |
| age_cat65+ | 0.643 | 0.078 | 8.284 | 0.000 |
| age_catRefused | 0.493 | 0.322 | 1.531 | 0.126 |
The model
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.248 | 0.068 | -3.625 | 0.000 |
| age_cat30-49 | 0.126 | 0.077 | 1.630 | 0.103 |
| age_cat50-64 | 0.538 | 0.078 | 6.904 | 0.000 |
| age_cat65+ | 0.643 | 0.078 | 8.284 | 0.000 |
| age_catRefused | 0.493 | 0.322 | 1.531 | 0.126 |
\[\begin{aligned}\log\Big(\frac{\hat{p}}{1-\hat{p}}\Big) =& -0.248 + 0.126\times\text{age_cat30-49} + 0.538 \times \text{age_cat50-64}\\
&+ 0.643 \times \text{age_cat65+} + 0.493\times \text{age_catRefused} \end{aligned}\]
where \(\hat{p}\) is the predicted probability of being concerned about increased use of AI in daily life
Interpreting age_cat30-49: log-odds
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.248 | 0.068 | -3.625 | 0.000 |
| age_cat30-49 | 0.126 | 0.077 | 1.630 | 0.103 |
| age_cat50-64 | 0.538 | 0.078 | 6.904 | 0.000 |
| age_cat65+ | 0.643 | 0.078 | 8.284 | 0.000 |
| age_catRefused | 0.493 | 0.322 | 1.531 | 0.126 |
The log-odds of being concerned about increased use of AI in daily life are expected to be 0.126 higher for individuals 30 - 49 years old compared to 18-29 year-olds (the baseline group).
. . .
We would not use the interpretation in terms of log-odds in practice.
Interpreting age_cat30-49: odds
The odds of being concerned about increased use of AI in daily life for 30 - 49 year olds are expected to be 1.134 ( \(e^{0.126}\) ) times the odds for 18-29 year olds.
Coefficients & odds ratios
The model coefficient, 0.126, is the expected difference in the log-odds when comparing 30 - 49 year olds to 18 - 29 year olds.
. . .
Therefore, \(e^{0.126}\) = 1.134 is the expected change in the odds when comparing 30 - 49 year olds to 18-29 year olds.
. . .
\[ OR = e^{\hat{\beta}_j} = \exp\{\hat{\beta}_j\} \]
Interpret in terms of percent change
You can also interpret the change in the odds in terms of a percent change. The percent change in the odds can be computed as the following
\[\% \text{ change } = (e^{\hat{\beta}_j} - 1) \times 100\]
Interpret the coefficient of age_cat30-49 (0.126) in terms of the percent change in the odds.
Quantitative predictor
Now let’s look at the relationship between survey_time and ai_concern
ai_time_fit <- glm(ai_concern ~ survey_time, data = pew_data,
family = "binomial")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.069 | 0.037 | 1.853 | 0.064 |
| survey_time | 0.005 | 0.002 | 2.434 | 0.015 |
. . .
For each additional minute of taking the survey, the odds of being concerned about increased AI in daily life are expected to multiply by a factor of 1.005 ( \(e^{0.005}\)).
Multiple predictors
Now let’s consider a model that takes into account age , ai_heard and survey_time
ai_concern_full_fit <- glm(ai_concern ~ age_cat + ai_heard +
survey_time, data = pew_data, family = "binomial")Multiple predictors
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.405 | 0.077 | -5.230 | 0.000 |
| age_cat30-49 | 0.117 | 0.078 | 1.504 | 0.132 |
| age_cat50-64 | 0.519 | 0.079 | 6.587 | 0.000 |
| age_cat65+ | 0.604 | 0.079 | 7.611 | 0.000 |
| age_catRefused | 0.557 | 0.325 | 1.716 | 0.086 |
| ai_heardA little | 0.371 | 0.043 | 8.654 | 0.000 |
| ai_heardNothing at all | -0.243 | 0.085 | -2.876 | 0.004 |
| ai_heardRefused | -0.571 | 0.505 | -1.131 | 0.258 |
| survey_time | -0.001 | 0.002 | -0.369 | 0.712 |
Interpretation
Use the model on the previous slide.
- Describe the type of respondent represented by the intercept.
- Interpret the effect of
ai_heardNothing at allin terms of the odds of being concerned by increased use of AI in daily life.
Prediction
Predicted log odds
augment(ai_concern_full_fit) |> select(.fitted)# A tibble: 5 × 1
.fitted
<dbl>
1 -0.0608
2 0.0756
3 0.473
4 0.560
5 0.563
. . .
For observation 1
\[\text{predicted odds} = \hat{\text{odds}} = \frac{\hat{p}}{1-\hat{p}} = e^{-0.0608} = 0.941\]
Predicted probabilities
The predicted log-odds for observation 1 is -0.0608. What is the predicted probability this respondent is concerned about increased use of AI in daily life?
Predicted probabilities
We can calculate predicted probabilities using the argument type = "response" in predict.glm()2
predict.glm(ai_concern_full_fit, type = "response")Showing the predictions for the first 10 observations
1 2 3 4 5 6 7 8
0.4848067 0.5188941 0.6161912 0.6364755 0.6371220 0.6366698 0.6159500 0.5257991
9 10
0.4898898 0.6329262
Predicted probability for new observation
Recall the model that includes predictors age_cat, ai_heard, and survey_time.
- What are the predicted odds for a 70-year-old respondent who has heard nothing about AI and took 60 minutes to complete the survey?
- What is the predicted probability this respondent is not concerned about increased use of AI in daily life?
- Would you classify this person as someone who is concerned or someone who is not? Why?
Predicted probability for new observation
new_obs <- tibble(age_cat = "65+", ai_heard = "Nothing at all",
survey_time = 60)
predict.glm(ai_concern_full_fit, newdata = new_obs,
type = "response") 1
0.4780527
Recap
Use the odds ratio to compare the odds of two groups
Interpret the coefficients of a logistic regression model with
- a single categorical predictor
- a single quantitative predictor
- multiple predictors