In-class (35 -40 pts): 75 minutes during February 18 lecture
Take-home (10 -15 pts): released after class on Tuesday
If you miss any part of the exam for an excused absence (with academic dean’s note or other official documentation), your Exam 02 score will be counted twice
Tips for studying
Review exercises in AEs and assignments, asking “why” as you review your process and reasoning
e.g., Why do we include “holding all else constant” in interpretations?
Focus on understanding not memorization
Explain concepts / process to others
Ask questions in office hours
Review lecture recordings as needed
Topics
Model comparison AE
Inference for multiple linear regression
Computational setup
# load packageslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':
group_rows
library(countdown)library(rms)
Loading required package: Hmisc
Attaching package: 'Hmisc'
The following object is masked from 'package:parsnip':
translate
The following objects are masked from 'package:dplyr':
src, summarize
The following objects are masked from 'package:base':
format.pval, units
# set default theme and larger font size for ggplot2ggplot2::theme_set(ggplot2::theme_bw(base_size =20))
Fit, evaluate, and compare candidate models. Choose a final model based on summary of cross validation results.
Refit the model using the entire training set and do “final” evaluation on the test set (make sure you have not overfit the model).
Adjust as needed if there is evidence of overfit.
Use model fit on training set for inference and prediction.
Data: rail_trail
The Pioneer Valley Planning Commission (PVPC) collected data for ninety days from April 5, 2005 to November 15, 2005.
Data collectors set up a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station.
Rows: 90 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): season, day_type
dbl (5): volume, hightemp, avgtemp, cloudcover, precip
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
volume estimated number of trail users that day (number of breaks recorded)
. . .
Predictors
hightemp daily high temperature (in degrees Fahrenheit)
avgtemp average of daily low and daily high temperature (in degrees Fahrenheit)
season one of “Fall”, “Spring”, or “Summer”
cloudcover measure of cloud cover (in oktas)
precip measure of precipitation (in inches)
day_type one of “weekday” or “weekend”
Conduct a hypothesis test for \(\beta_j\)
Review: Simple linear regression (SLR)
`geom_smooth()` using formula = 'y ~ x'
SLR model summary
rt_slr_fit <-lm(volume ~ hightemp, data = rail_trail)tidy(rt_slr_fit) |>kable(digits =2)
term
estimate
std.error
statistic
p.value
(Intercept)
-17.08
59.40
-0.29
0.77
hightemp
5.70
0.85
6.72
0.00
SLR hypothesis test
term
estimate
std.error
statistic
p.value
(Intercept)
-17.08
59.40
-0.29
0.77
hightemp
5.70
0.85
6.72
0.00
Set hypotheses:\(H_0: \beta_1 = 0\) vs. \(H_a: \beta_1 \ne 0\)
. . .
Calculate test statistic and p-value: The test statistic is \(t= 6.72\) . The p-value is calculated using a \(t\) distribution with 88 degrees of freedom. The p-value is \(\approx 0\) .
. . .
State the conclusion: The p-value is small, so we reject \(H_0\). The data provide strong evidence that high temperature is a helpful predictor for the number of daily riders, i.e. there is a linear relationship between high temperature and number of daily riders.
As with SLR, we use \(\hat{\sigma}_{\epsilon}\) to calculate \(SE(\hat{\beta}_j)\), the standard error of the coefficient for predictor \(x_j\). See Matrix Form of Linear Regression for more detail.
MLR hypothesis test: hightemp
Set hypotheses:\(H_0: \beta_{hightemp} = 0\) vs. \(H_a: \beta_{hightemp} \ne 0\), given season is in the model
. . .
Calculate test statistic and p-value: The test statistic is \(t = 6.43\). The p-value is calculated using a \(t\) distribution with 86 \((n - p - 1)\) degrees of freedom. The p-value is \(\approx 0\).
. . .
State the conclusion: The p-value is small, so we reject \(H_0\). The data provide strong evidence that high temperature for the day is a useful predictor in a model that already contains the season as a predictor for number of daily riders.
Interaction terms
term
estimate
std.error
statistic
p.value
(Intercept)
-10.53
166.80
-0.06
0.95
hightemp
5.48
2.95
1.86
0.07
seasonSpring
-293.95
190.33
-1.54
0.13
seasonSummer
354.18
255.08
1.39
0.17
hightemp:seasonSpring
4.88
3.26
1.50
0.14
hightemp:seasonSummer
-4.54
3.75
-1.21
0.23
Do the data provide evidence of a significant interaction effect? Comment on the significance of the interaction terms.
Confidence interval for \(\beta_j\)
Confidence interval for \(\beta_j\)
The \(C\%\) confidence interval for \(\beta_j\)\[\hat{\beta}_j \pm t^* SE(\hat{\beta}_j)\] where \(t^*\) follows a \(t\) distribution with \(n - p - 1\) degrees of freedom.
Generically: We are \(C\%\) confident that the interval LB to UB contains the population coefficient of \(x_j\).
In context: We are \(C\%\) confident that for every one unit increase in \(x_j\), \(y\) changes by LB to UB units, on average, holding all else constant.
We are 95% confident that for every degree Fahrenheit the day is warmer, the number of riders increases by 5.21 to 9.87, on average, holding season constant.
CI for seasonSpring
term
estimate
std.error
statistic
p.value
conf.low
conf.high
(Intercept)
-125.23
71.66
-1.75
0.08
-267.68
17.22
hightemp
7.54
1.17
6.43
0.00
5.21
9.87
seasonSpring
5.13
34.32
0.15
0.88
-63.10
73.36
seasonSummer
-76.84
47.71
-1.61
0.11
-171.68
18.00
We are 95% confident that the number of riders on a Spring day is lower by 63.1 to higher by 73.4 compared to a Fall day, on average, holding high temperature for the day constant.
. . .
Is season a significant predictor of the number of riders, after accounting for high temperature?
Inference pitfalls
Large sample sizes
Caution
If the sample size is large enough, the test will likely result in rejecting \(H_0: \beta_j = 0\) even \(x_j\) has a very small effect on \(y\).
Consider the practical significance of the result not just the statistical significance.
Use the confidence interval to draw conclusions instead of relying only p-values.
Small sample sizes
Caution
If the sample size is small, there may not be enough evidence to reject \(H_0: \beta_j=0\).
When you fail to reject the null hypothesis, DON’T immediately conclude that the variable has no association with the response.
There may be a linear association that is just not strong enough to detect given your data, or there may be a non-linear association.
Recap
Reviewed model comparison
Introduced inference for multiple linear regression