MLR: Model comparison + Inference

Author

Prof. Maria Tackett

Announcements

HW 02 due TODAY at 11:59pm
Lab 03 due Thursday, February 13 at 11:59pm
Project topics due Sunday, February 16 at 11:59pm
Statistics experience due Tuesday, April 15

Exam 01

50 points total
- in-class: 35-40 points
- take-home: 10 - 15 points
In-class (35 -40 pts): 75 minutes during February 18 lecture
Take-home (10 -15 pts): released after class on Tuesday
If you miss any part of the exam for an excused absence (with academic dean’s note or other official documentation), your Exam 02 score will be counted twice

Tips for studying

Review exercises in AEs and assignments, asking “why” as you review your process and reasoning
- e.g., Why do we include “holding all else constant” in interpretations?
Focus on understanding not memorization
Explain concepts / process to others
Ask questions in office hours
Review lecture recordings as needed

Topics

Model comparison AE
Inference for multiple linear regression

Computational setup

# load packages
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.6     ✔ rsample      1.2.1
✔ dials        1.3.0     ✔ tune         1.2.1
✔ infer        1.0.7     ✔ workflows    1.1.4
✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
✔ parsnip      1.2.1     ✔ yardstick    1.3.1
✔ recipes      1.1.0     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.

library(patchwork)
library(knitr)
library(kableExtra)


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

library(countdown)
library(rms)

Loading required package: Hmisc

Attaching package: 'Hmisc'

The following object is masked from 'package:parsnip':

    translate

The following objects are masked from 'package:dplyr':

    src, summarize

The following objects are masked from 'package:base':

    format.pval, units

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Model comparison

RMSE

\[ RMSE = \sqrt{\frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{n}} = \sqrt{\frac{\sum_{i=1}^ne_i^2}{n}} \]

Ranges between 0 (perfect predictor) and infinity (terrible predictor)
Same units as the response variable
The value of RMSE is more useful for comparing across models than evaluating a single model

\(R^2\) and Adjusted \(R^2\)

\[R^2 = \frac{SSM}{SST} = 1 - \frac{SSR}{SST}\]

. . .

\[Adj. R^2 = 1 - \frac{SSR/(n-p-1)}{SST/(n-1)}\]

where

\(n\) is the number of observations used to fit the model
\(p\) is the number of terms (not including the intercept) in the model

Application exercise

📋 sta210-sp25.netlify.app/ae/ae-06-model-compare

Click here to share your group’s response.

Inference for multiple linear regression

Modeling workflow

Split data into training and test sets.
Fit, evaluate, and compare candidate models. Choose a final model based on summary of cross validation results.
Refit the model using the entire training set and do “final” evaluation on the test set (make sure you have not overfit the model).
- Adjust as needed if there is evidence of overfit.
Use model fit on training set for inference and prediction.

Data: `rail_trail`

The Pioneer Valley Planning Commission (PVPC) collected data for ninety days from April 5, 2005 to November 15, 2005.
Data collectors set up a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station.

Rows: 90 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): season, day_type
dbl (5): volume, hightemp, avgtemp, cloudcover, precip

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 8 × 7
  volume hightemp avgtemp season cloudcover precip day_type
   <dbl>    <dbl>   <dbl> <chr>       <dbl>  <dbl> <chr>   
1    501       83    66.5 Summer       7.60 0      Weekday 
2    419       73    61   Summer       6.30 0.290  Weekday 
3    397       74    63   Spring       7.5  0.320  Weekday 
4    385       95    78   Summer       2.60 0      Weekend 
5    200       44    48   Spring      10    0.140  Weekday 
6    375       69    61.5 Spring       6.60 0.0200 Weekday 
7    417       66    52.5 Spring       2.40 0      Weekday 
8    629       66    52   Spring       0    0      Weekend

Source: Pioneer Valley Planning Commission via the mosaicData package.

Variables

Response:

volume estimated number of trail users that day (number of breaks recorded)

. . .

Predictors

hightemp daily high temperature (in degrees Fahrenheit)
avgtemp average of daily low and daily high temperature (in degrees Fahrenheit)
season one of “Fall”, “Spring”, or “Summer”
cloudcover measure of cloud cover (in oktas)
precip measure of precipitation (in inches)
day_type one of “weekday” or “weekend”

Conduct a hypothesis test for \(\beta_j\)

Review: Simple linear regression (SLR)

`geom_smooth()` using formula = 'y ~ x'

SLR model summary

rt_slr_fit <- lm(volume ~ hightemp, data = rail_trail)

tidy(rt_slr_fit) |> kable(digits = 2)

term	estimate	std.error	statistic	p.value
(Intercept)	-17.08	59.40	-0.29	0.77
hightemp	5.70	0.85	6.72	0.00

SLR hypothesis test

term	estimate	std.error	statistic	p.value
(Intercept)	-17.08	59.40	-0.29	0.77
hightemp	5.70	0.85	6.72	0.00

Set hypotheses: \(H_0: \beta_1 = 0\) vs. \(H_a: \beta_1 \ne 0\)

. . .

Calculate test statistic and p-value: The test statistic is \(t= 6.72\) . The p-value is calculated using a \(t\) distribution with 88 degrees of freedom. The p-value is \(\approx 0\) .

. . .

State the conclusion: The p-value is small, so we reject \(H_0\). The data provide strong evidence that high temperature is a helpful predictor for the number of daily riders, i.e. there is a linear relationship between high temperature and number of daily riders.

Multiple linear regression

rt_mlr_main_fit <- lm(volume ~ hightemp + season, data = rail_trail)

tidy(rt_mlr_main_fit) |> kable(digits = 2)

term	estimate	std.error	statistic	p.value
(Intercept)	-125.23	71.66	-1.75	0.08
hightemp	7.54	1.17	6.43	0.00
seasonSpring	5.13	34.32	0.15	0.88
seasonSummer	-76.84	47.71	-1.61	0.11

Multiple linear regression

The multiple linear regression model assumes \[Y|X_1, X_2, \ldots, X_p \sim N(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p, \sigma_\epsilon^2)\]

. . .

For a given observation \((x_{i1}, x_{i2}, \ldots, x_{ip}, y_i)\), we can rewrite the previous statement as

\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \epsilon_{i} \hspace{10mm} \epsilon_i \sim N(0,\sigma_{\epsilon}^2)\]

Estimating \(\sigma_\epsilon\)

For a given observation \((x_{i1}, x_{i2}, \ldots,x_{ip}, y_i)\) the residual is \[e_i = y_{i} - (\hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \hat{\beta}_{2} x_{i2} + \dots + \hat{\beta}_p x_{ip})\]

. . .

The estimated value of the regression standard error , \(\sigma_{\epsilon}\), is

\[\hat{\sigma}_\epsilon = \sqrt{\frac{\sum_{i=1}^ne_i^2}{n-p-1}}\]

. . .

As with SLR, we use \(\hat{\sigma}_{\epsilon}\) to calculate \(SE(\hat{\beta}_j)\), the standard error of the coefficient for predictor \(x_j\). See Matrix Form of Linear Regression for more detail.

MLR hypothesis test: hightemp

Set hypotheses: \(H_0: \beta_{hightemp} = 0\) vs. \(H_a: \beta_{hightemp} \ne 0\), given season is in the model

. . .

Calculate test statistic and p-value: The test statistic is \(t = 6.43\). The p-value is calculated using a \(t\) distribution with 86 \((n - p - 1)\) degrees of freedom. The p-value is \(\approx 0\).

. . .

State the conclusion: The p-value is small, so we reject \(H_0\). The data provide strong evidence that high temperature for the day is a useful predictor in a model that already contains the season as a predictor for number of daily riders.

Interaction terms

term	estimate	std.error	statistic	p.value
(Intercept)	-10.53	166.80	-0.06	0.95
hightemp	5.48	2.95	1.86	0.07
seasonSpring	-293.95	190.33	-1.54	0.13
seasonSummer	354.18	255.08	1.39	0.17
hightemp:seasonSpring	4.88	3.26	1.50	0.14
hightemp:seasonSummer	-4.54	3.75	-1.21	0.23

Do the data provide evidence of a significant interaction effect? Comment on the significance of the interaction terms.

Confidence interval for \(\beta_j\)

The \(C\%\) confidence interval for \(\beta_j\) \[\hat{\beta}_j \pm t^* SE(\hat{\beta}_j)\] where \(t^*\) follows a \(t\) distribution with \(n - p - 1\) degrees of freedom.
Generically: We are \(C\%\) confident that the interval LB to UB contains the population coefficient of \(x_j\).
In context: We are \(C\%\) confident that for every one unit increase in \(x_j\), \(y\) changes by LB to UB units, on average, holding all else constant.

Confidence interval for \(\beta_j\)

tidy(rt_mlr_main_fit, conf.int = TRUE, conf.level = 0.95) |>
  kable(digits= 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-125.23	71.66	-1.75	0.08	-267.68	17.22
hightemp	7.54	1.17	6.43	0.00	5.21	9.87
seasonSpring	5.13	34.32	0.15	0.88	-63.10	73.36
seasonSummer	-76.84	47.71	-1.61	0.11	-171.68	18.00

CI for `hightemp`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-125.23	71.66	-1.75	0.08	-267.68	17.22
hightemp	7.54	1.17	6.43	0.00	5.21	9.87
seasonSpring	5.13	34.32	0.15	0.88	-63.10	73.36
seasonSummer	-76.84	47.71	-1.61	0.11	-171.68	18.00

We are 95% confident that for every degree Fahrenheit the day is warmer, the number of riders increases by 5.21 to 9.87, on average, holding season constant.

CI for `seasonSpring`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-125.23	71.66	-1.75	0.08	-267.68	17.22
hightemp	7.54	1.17	6.43	0.00	5.21	9.87
seasonSpring	5.13	34.32	0.15	0.88	-63.10	73.36
seasonSummer	-76.84	47.71	-1.61	0.11	-171.68	18.00

We are 95% confident that the number of riders on a Spring day is lower by 63.1 to higher by 73.4 compared to a Fall day, on average, holding high temperature for the day constant.

. . .

Is season a significant predictor of the number of riders, after accounting for high temperature?

Inference pitfalls

Large sample sizes

Caution

If the sample size is large enough, the test will likely result in rejecting \(H_0: \beta_j = 0\) even \(x_j\) has a very small effect on \(y\).

Consider the practical significance of the result not just the statistical significance.
Use the confidence interval to draw conclusions instead of relying only p-values.

Small sample sizes

Caution

If the sample size is small, there may not be enough evidence to reject \(H_0: \beta_j=0\).

When you fail to reject the null hypothesis, DON’T immediately conclude that the variable has no association with the response.
There may be a linear association that is just not strong enough to detect given your data, or there may be a non-linear association.

Recap

Reviewed model comparison
Introduced inference for multiple linear regression

Next class

Exam 01 review

Announcements

Exam 01

Tips for studying

Topics

Computational setup

Model comparison

RMSE

\(R^2\) and Adjusted \(R^2\)

Application exercise

Inference for multiple linear regression

Modeling workflow

Data: rail_trail

Variables

Conduct a hypothesis test for \(\beta_j\)

Review: Simple linear regression (SLR)

SLR model summary

SLR hypothesis test

Multiple linear regression

Multiple linear regression

Estimating \(\sigma_\epsilon\)

MLR hypothesis test: hightemp

Interaction terms

Confidence interval for \(\beta_j\)

Confidence interval for \(\beta_j\)

Confidence interval for \(\beta_j\)

CI for hightemp

CI for seasonSpring

Inference pitfalls

Large sample sizes

Small sample sizes

Recap

Next class

Data: `rail_trail`

CI for `hightemp`

CI for `seasonSpring`