Apr 03, 2025
Next project milestone: Analysis and draft in April 14 lab
Team Feedback (email from TEAMMATES) due Tuesday, April 8 at 11:59pm (check email)
HW 04 due Tuesday, April 8 at 11:59pm
Statistics experience due April 15
This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease.
high_risk:
age: Age at exam time (in years)
education: 1 = Some High School, 2 = High School or GED, 3 = Some College or Vocational School, 4 = College
currentSmoker: 0 = nonsmoker, 1 = smoker
totChol: Total cholesterol (in mg/dL)
# A tibble: 4,086 × 6
age education TenYearCHD totChol currentSmoker high_risk
<dbl> <fct> <dbl> <dbl> <fct> <fct>
1 39 4 0 195 0 0
2 46 2 0 250 0 0
3 48 1 0 245 1 0
4 61 3 1 225 1 1
5 46 3 0 285 1 0
6 43 2 0 228 0 0
7 63 1 1 205 0 1
8 45 2 0 313 1 0
9 52 1 0 260 0 0
10 43 1 0 225 1 0
# ℹ 4,076 more rows
There are two approaches for testing coefficients in logistic regression
Drop-in-deviance test. Use to test…
(Wald) hypothesis test. Use to test
currentSmoker| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -6.673 | 0.378 | -17.647 | 0.000 | -7.423 | -5.940 |
| age | 0.082 | 0.006 | 14.344 | 0.000 | 0.071 | 0.094 |
| totChol | 0.002 | 0.001 | 1.940 | 0.052 | 0.000 | 0.004 |
| currentSmoker1 | 0.443 | 0.094 | 4.733 | 0.000 | 0.260 | 0.627 |
Interpret the value for currentSmoker in each column of the model output.
The 95% confidence interval for currentSmoker is [0.260, 0.627]. Interpret this value in the context of the data.
Linearity: The log-odds have a linear relationship with the predictors.
Randomness: The data were obtained from a random process
Independence: The observations are independent from one another.
The empirical logit is the log of the observed odds:
\[ \text{logit}(\hat{p}) = \log\Big(\frac{\hat{p}}{1 - \hat{p}}\Big) = \log\Big(\frac{\# \text{Yes}}{\# \text{No}}\Big) \]
If the predictor is categorical, we can calculate the empirical logit for each level of the predictor.
heart_disease |>
count(currentSmoker, high_risk) |>
group_by(currentSmoker) |>
mutate(prop = n/sum(n)) |>
filter(high_risk == "1") |>
mutate(emp_logit = log(prop/(1-prop)))# A tibble: 2 × 5
# Groups: currentSmoker [2]
currentSmoker high_risk n prop emp_logit
<fct> <fct> <int> <dbl> <dbl>
1 0 1 301 0.145 -1.77
2 1 1 318 0.158 -1.67
Divide the range of the predictor into intervals with approximately equal number of cases. (If you have enough observations, use 5 - 10 intervals.)
Compute the empirical logit for each interval
You can then calculate the mean value of the predictor in each interval and create a plot of the empirical logit versus the mean value of the predictor in each interval.
Created using dplyr and ggplot functions.
Created using dplyr and ggplot functions.
heart_disease |>
mutate(age_bin = cut_interval(age, n = 10)) |>
group_by(age_bin) |>
mutate(mean_age = mean(age)) |>
count(mean_age, high_risk) |>
mutate(prop = n/sum(n)) |>
filter(high_risk == "1") |>
mutate(emp_logit = log(prop/(1-prop))) |>
ggplot(aes(x = mean_age, y = emp_logit)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Mean Age",
y = "Empirical logit")Using the emplogitplot1 function from the Stat2Data R package
Using the emplogitplot2 function from the Stat2Data R package
✅ The linearity condition is satisfied. There is a linear relationship between the empirical logit and the predictor variables.
We can check the randomness condition based on the context of the data and how the observations were collected.
✅ The randomness condition is satisfied. We do not have reason to believe that the participants in this study differ systematically from adults in the U.S. in regards to health characteristics and risk of heart disease.
✅ The independence condition is satisfied. It is reasonable to conclude that the participants’ health characteristics are independent of one another.