Mar 18, 2025
HW 03 due TODAY at 11:59pm
Exploratory data analysis due March 20
Statistics experience due April 15
Logistic regression for binary response variable
Relationship between odds and probabilities
Odds ratios
Quantitative outcome variable:
Categorical outcome variable:
Logistic regression
2 Outcomes
1: Yes, 0: No
Multinomial logistic regression
3+ Outcomes
1: Democrat, 2: Republican, 3: Independent
ESPN Analytics win probability for Duke vs. Louisville (March 15, 2025)
Students in grades 9 - 12 were surveyed about health risk behaviors including whether they usually get 7 or more hours of sleep.
Sleep7
1: yes
0: no
# A tibble: 446 × 2
Age Sleep7
<int> <int>
1 16 1
2 17 0
3 18 0
4 17 1
5 15 0
6 17 0
7 17 1
8 16 1
9 16 1
10 18 0
# ℹ 436 more rows
Outcome: \(Y\) = 1: yes, 0: no
Outcome: Probability of getting 7+ hours of sleep
Outcome: Probability of getting 7+ hours of sleep
🛑 This model produces predictions outside of 0 and 1.
✅ This model (called a logistic regression model) only produces predictions between 0 and 1.
| Method | Outcome | Model |
|---|---|---|
| Linear regression | Quantitative | \(y_i = \beta_0 + \beta_1~ x_i\) |
| Linear regression (transform Y) | Quantitative | \(\log(y_i) = \beta_0 + \beta_1~ x_i\) |
| Logistic regression | Binary | \(\log\big(\frac{p_i}{1-p_i}\big) = \beta_0 + \beta_1 ~ x_i\) |
State whether a linear regression model or logistic regression model is more appropriate for each scenario.
Use age and political party to predict if a randomly selected person will vote in the next election.
Use budget and run time (in minutes) to predict a movie’s total revenue.
Use age and sex to calculate the probability a randomly selected adult will visit Duke Health in the next year.
This data comes from the 2023 Pew Research Center’s American Trends Panel. The survey aims to capture public opinion about a variety of topics including politics, religion, and technology, among others. We will use data from 11201 respondents in Wave 132 of the survey conducted July 31 - August 6, 2023.
The goal of this analysis is to understand the relationship between age, how much someone has heard about artificial intelligence (AI), and concern about the increased use of AI in daily life.
A more complete analysis on this topic can be found in the Pew Research Center article Growing public concern about the role of artificial intelligence in daily life by Alec Tyson and Emma Kikuchi.
ai_concern: Whether a respondent said they are “more concerned than excited” about in the increased use of AI in daily life (1: yes, 0: no)Source: Pew Research
ai_heard : Response to the question “How much have you heard or read about AI?”
age_cat: Age category
# change variable names and recode categories
pew_data <- pew_data |>
mutate(ai_concern = if_else(CNCEXC_W132 == 2, 1, 0),
age_cat = case_when(F_AGECAT == 1 ~ "18-29",
F_AGECAT == 2 ~ "30-49",
F_AGECAT == 3 ~ "50-64",
F_AGECAT == 4 ~ "65+",
TRUE ~ "Refused"),
ai_heard = case_when(AI_HEARD_W132 == 1 ~ "A lot",
AI_HEARD_W132 == 2 ~ "A little",
AI_HEARD_W132 == 3 ~ "Nothing at all",
TRUE ~ "Refused"
))
# Make factors and relevel
pew_data <- pew_data |>
mutate(ai_concern = factor(ai_concern),
age_cat = factor(age_cat),
ai_heard = factor(ai_heard, levels = c("A lot", "A little", "Nothing at all", "Refused"))
)\(Y = 1: \text{ yes (success), } 0: \text{ no (failure)}\)
\(p\): probability that \(Y=1\), i.e., \(P(Y = 1)\)
\(\frac{p}{1-p}\): odds that \(Y = 1\)
\(\log\big(\frac{p}{1-p}\big)\): log odds
Go from \(p\) to \(\log\big(\frac{p}{1-p}\big)\) using the logit transformation
Suppose there is a 70% chance it will rain tomorrow
# A tibble: 2 × 3
ai_concern n p
<fct> <int> <dbl>
1 0 5245 0.468
2 1 5956 0.532
\(P(\text{Concerned about AI}) = P(Y = 1) = p = 0.532\)
\(P(\text{Not concerned about AI}) = P(Y = 0) = 1 - p = 0.468\)
\(\text{Odds of being concerned about AI} = \frac{0.532}{0.468} = 1.137\)
odds
\[\text{odds} = \frac{p}{1-p}\]
probability
\[p = \frac{\text{odds}}{1 + \text{odds}}\]
| Age | Not Concerned | Concerned |
|---|---|---|
| 18-29 | 550 | 416 |
| 30-49 | 1898 | 1681 |
| 50-64 | 1398 | 1818 |
| 65+ | 1376 | 2013 |
| Refused | 23 | 28 |
| Age | Not Concerned | Concerned |
|---|---|---|
| 18-29 | 550 | 416 |
| 30-49 | 1898 | 1681 |
| 50-64 | 1398 | 1818 |
| 65+ | 1376 | 2013 |
| Refused | 23 | 28 |
We want to compare concern about increased use of AI in daily life between individuals who are 18-29 years old to those who are 65+ years old
| Age | Not Concerned | Concerned |
|---|---|---|
| 18-29 | 550 | 416 |
| 30-49 | 1898 | 1681 |
| 50-64 | 1398 | 1818 |
| 65+ | 1376 | 2013 |
| Refused | 23 | 28 |
We’ll use the odds to compare the two groups
\[ \text{odds} = \frac{P(\text{success})}{P(\text{failure})} = \frac{\text{# of successes}}{\text{# of failures}} \]
| Age | Not Concerned | Concerned |
|---|---|---|
| 18-29 | 550 | 416 |
| 30-49 | 1898 | 1681 |
| 50-64 | 1398 | 1818 |
| 65+ | 1376 | 2013 |
| Refused | 23 | 28 |
Odds of being concerned with increased use of AI in daily life for 18-29 year olds: \(\frac{416}{550} = 0.756\)
Odds of being concerned with increased use of AI in daily life for those who are 65+ years old: \(\frac{2013}{1376} = 1.463\)
Based on this, we see that individuals 65+ years old are more likely to be concerned about the increased use of AI in daily life than 18-29 year olds.
| Age | Not Concerned | Concerned |
|---|---|---|
| 18-29 | 550 | 416 |
| 30-49 | 1898 | 1681 |
| 50-64 | 1398 | 1818 |
| 65+ | 1376 | 2013 |
| Refused | 23 | 28 |
Let’s summarize the relationship between the two groups. To do so, we’ll use the odds ratio (OR).
\[ OR = \frac{\text{odds}_1}{\text{odds}_2} \]
| Age | Not Concerned | Concerned |
|---|---|---|
| 18-29 | 550 | 416 |
| 30-49 | 1898 | 1681 |
| 50-64 | 1398 | 1818 |
| 65+ | 1376 | 2013 |
| Refused | 23 | 28 |
\[OR = \frac{\text{odds}_{18-29}}{\text{odds}_{65+}} = \frac{0.756}{1.463} = \mathbf{0.517}\]
The odds an 18-29 year old is concerned about increased use of AI in daily life are 0.517 times the odds a 65+ year old is concerned.
It’s more natural to interpret the odds ratio with a statement with the odds ratio greater than 1.
The odds a 65+ year old is concerned about increased use of AI in daily life are 1.934 (1/0.517) times the odds an 18-29 year old is concerned.
Click here to add your response to the Google Slide.
Introduced logistic regression for binary response variable
Showed the relationship between odds and probabilities
Introduced odds ratios