Mar 20, 2025
Exploratory data analysis due TODAY at 11:59pm
Statistics experience due April 15
Odds and probabilities
Interpret the coefficients of a logistic regression model with
Scenario 1: Suppose the probability of a disease among a population of unvaccinated individuals is 0.00369, and the probability of the disease is 0.001 among a population of vaccinated individuals.
Scenario 2: Suppose the probability of a disease among a population of unvaccinated individuals is 0.48052 and the probability of the disease is 0.2 among a population of vaccinated individuals.
What is the difference in the probability of disease for these two populations?
What are the odds of disease in the population without a vaccine relative to the odds of disease in the with vaccine?
odds
\[\text{odds} = \frac{p}{1-p}\]
probability
\[p = \frac{\text{odds}}{1 + \text{odds}}\]
Logistic model: \(\log\big(\frac{p}{1-p}\big) = \beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\)
Odds = \(\exp\big\{\log\big(\frac{p}{1-p}\big)\big\} = \frac{p}{1-p}\)
Combining (1) and (2) with what we saw earlier
\[p = \frac{\exp\{\beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\}}{1 + \exp\{\beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\}}\]
Logit form: \[\log\Big(\frac{p}{1-p}\Big) = \beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\]
Probability form:
\[ \text{probability} = p = \frac{\exp\{\beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\}}{1 + \exp\{\beta_0 + \beta_1~X_1 + \dots + \beta_pX_p\}} \]
Why is there no error term \(\epsilon\) when writing the statistical model for logistic regression?
This data comes from the 2023 Pew Research Center’s American Trends Panel. The survey aims to capture public opinion about a variety of topics including politics, religion, and technology, among others. We will use data from respondents in Wave 132 of the survey conducted July 31 - August 6, 2023 who completed the survey in 70 minutes or less.
The goal of this analysis is to understand the relationship between age, how much someone has heard about artificial intelligence (AI), and concern about the increased use of AI in daily life.
A more complete analysis on this topic can be found in the Pew Research Center article Growing public concern about the role of artificial intelligence in daily life by Alec Tyson and Emma Kikuchi.
ai_concern: Whether a respondent said they are “more concerned than excited” about in the increased use of AI in daily life (1: yes, 0: no)survey_time: Time to complete the survey (in minutes)
age_cat: Age category
| Age | Not Concerned | Concerned |
|---|---|---|
| 18-29 | 487 | 380 |
| 30-49 | 1661 | 1470 |
| 50-64 | 1257 | 1680 |
| 65+ | 1252 | 1858 |
| Refused | 18 | 23 |
ai_concern_fit <- glm(ai_concern ~ age_cat, data = pew_data,
family = "binomial")
tidy(ai_concern_fit) |> kable(digits = 3)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.248 | 0.068 | -3.625 | 0.000 |
| age_cat30-49 | 0.126 | 0.077 | 1.630 | 0.103 |
| age_cat50-64 | 0.538 | 0.078 | 6.904 | 0.000 |
| age_cat65+ | 0.643 | 0.078 | 8.284 | 0.000 |
| age_catRefused | 0.493 | 0.322 | 1.531 | 0.126 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.248 | 0.068 | -3.625 | 0.000 |
| age_cat30-49 | 0.126 | 0.077 | 1.630 | 0.103 |
| age_cat50-64 | 0.538 | 0.078 | 6.904 | 0.000 |
| age_cat65+ | 0.643 | 0.078 | 8.284 | 0.000 |
| age_catRefused | 0.493 | 0.322 | 1.531 | 0.126 |
\[\begin{aligned}\log\Big(\frac{\hat{p}}{1-\hat{p}}\Big) =& -0.248 + 0.126\times\text{age_cat30-49} + 0.538 \times \text{age_cat50-64}\\
&+ 0.643 \times \text{age_cat65+} + 0.493\times \text{age_catRefused} \end{aligned}\]
where \(\hat{p}\) is the predicted probability of being concerned about increased use of AI in daily life
age_cat30-49: log-odds| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.248 | 0.068 | -3.625 | 0.000 |
| age_cat30-49 | 0.126 | 0.077 | 1.630 | 0.103 |
| age_cat50-64 | 0.538 | 0.078 | 6.904 | 0.000 |
| age_cat65+ | 0.643 | 0.078 | 8.284 | 0.000 |
| age_catRefused | 0.493 | 0.322 | 1.531 | 0.126 |
The log-odds of being concerned about increased use of AI in daily life are expected to be 0.126 higher for individuals 30 - 49 years old compared to 18-29 year-olds (the baseline group).
Warning
We would not use the interpretation in terms of log-odds in practice.
age_cat30-49: oddsThe odds of being concerned about increased use of AI in daily life for 30 - 49 year olds are expected to be 1.134 ( \(e^{0.126}\) ) times the odds for 18-29 year olds.
The model coefficient, 0.126, is the expected difference in the log-odds when comparing 30 - 49 year olds to 18 - 29 year olds.
Therefore, \(e^{0.126}\) = 1.134 is the expected change in the odds when comparing 30 - 49 year olds to 18-29 year olds.
\[ OR = e^{\hat{\beta}_j} = \exp\{\hat{\beta}_j\} \]
You can also interpret the change in the odds in terms of a percent change. The percent change in the odds can be computed as the following
\[\% \text{ change } = (e^{\hat{\beta}_j} - 1) \times 100\]
Interpret the coefficient of age_cat30-49 (0.126) in terms of the percent change in the odds.
Now let’s look at the relationship between survey_time and ai_concern
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.069 | 0.037 | 1.853 | 0.064 |
| survey_time | 0.005 | 0.002 | 2.434 | 0.015 |
For each additional minute of taking the survey, the odds of being concerned about increased AI in daily life are expected to multiply by a factor of 1.005 ( \(e^{0.005}\)).
Now let’s consider a model that takes into account age , ai_heard and survey_time
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.405 | 0.077 | -5.230 | 0.000 |
| age_cat30-49 | 0.117 | 0.078 | 1.504 | 0.132 |
| age_cat50-64 | 0.519 | 0.079 | 6.587 | 0.000 |
| age_cat65+ | 0.604 | 0.079 | 7.611 | 0.000 |
| age_catRefused | 0.557 | 0.325 | 1.716 | 0.086 |
| ai_heardA little | 0.371 | 0.043 | 8.654 | 0.000 |
| ai_heardNothing at all | -0.243 | 0.085 | -2.876 | 0.004 |
| ai_heardRefused | -0.571 | 0.505 | -1.131 | 0.258 |
| survey_time | -0.001 | 0.002 | -0.369 | 0.712 |
Use the model on the previous slide.
ai_heardNothing at all in terms of the odds of being concerned by increased use of AI in daily life.# A tibble: 5 × 1
.fitted
<dbl>
1 -0.0608
2 0.0756
3 0.473
4 0.560
5 0.563
For observation 1
\[\text{predicted odds} = \hat{\text{odds}} = \frac{\hat{p}}{1-\hat{p}} = e^{-0.0608} = 0.941\]
The predicted log-odds for observation 1 is -0.0608. What is the predicted probability this respondent is concerned about increased use of AI in daily life?
We can calculate predicted probabilities using the argument type = "response" in predict.glm()1
Showing the predictions for the first 10 observations
1 2 3 4 5 6 7 8
0.4848067 0.5188941 0.6161912 0.6364755 0.6371220 0.6366698 0.6159500 0.5257991
9 10
0.4898898 0.6329262
Recall the model that includes predictors age_cat, ai_heard, and survey_time.
Use the odds ratio to compare the odds of two groups
Interpret the coefficients of a logistic regression model with