MLR: Model comparison + Inference

Author

Prof. Maria Tackett

Announcements

  • HW 02 due TODAY at 11:59pm

  • Lab 03 due Thursday, February 13 at 11:59pm

  • Project topics due Sunday, February 16 at 11:59pm

  • Statistics experience due Tuesday, April 15

Exam 01

  • 50 points total
    • in-class: 35-40 points

    • take-home: 10 - 15 points

  • In-class (35 -40 pts): 75 minutes during February 18 lecture
  • Take-home (10 -15 pts): released after class on Tuesday
  • If you miss any part of the exam for an excused absence (with academic dean’s note or other official documentation), your Exam 02 score will be counted twice

Tips for studying

  • Review exercises in AEs and assignments, asking “why” as you review your process and reasoning

    • e.g., Why do we include “holding all else constant” in interpretations?
  • Focus on understanding not memorization

  • Explain concepts / process to others

  • Ask questions in office hours

  • Review lecture recordings as needed

Topics

  • Model comparison AE
  • Inference for multiple linear regression

Computational setup

# load packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.6     ✔ rsample      1.2.1
✔ dials        1.3.0     ✔ tune         1.2.1
✔ infer        1.0.7     ✔ workflows    1.1.4
✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
✔ parsnip      1.2.1     ✔ yardstick    1.3.1
✔ recipes      1.1.0     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.
library(patchwork)
library(knitr)
library(kableExtra)

Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows
library(countdown)
library(rms)
Loading required package: Hmisc

Attaching package: 'Hmisc'

The following object is masked from 'package:parsnip':

    translate

The following objects are masked from 'package:dplyr':

    src, summarize

The following objects are masked from 'package:base':

    format.pval, units
# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Model comparison

RMSE

\[ RMSE = \sqrt{\frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{n}} = \sqrt{\frac{\sum_{i=1}^ne_i^2}{n}} \]

  • Ranges between 0 (perfect predictor) and infinity (terrible predictor)

  • Same units as the response variable

  • The value of RMSE is more useful for comparing across models than evaluating a single model

\(R^2\) and Adjusted \(R^2\)

\[R^2 = \frac{SSM}{SST} = 1 - \frac{SSR}{SST}\]


. . .

\[Adj. R^2 = 1 - \frac{SSR/(n-p-1)}{SST/(n-1)}\]

where

  • \(n\) is the number of observations used to fit the model

  • \(p\) is the number of terms (not including the intercept) in the model

Application exercise


Click here to share your group’s response.

Inference for multiple linear regression

Modeling workflow

  • Split data into training and test sets.

  • Fit, evaluate, and compare candidate models. Choose a final model based on summary of cross validation results.

  • Refit the model using the entire training set and do “final” evaluation on the test set (make sure you have not overfit the model).

    • Adjust as needed if there is evidence of overfit.
  • Use model fit on training set for inference and prediction.

Data: rail_trail

  • The Pioneer Valley Planning Commission (PVPC) collected data for ninety days from April 5, 2005 to November 15, 2005.
  • Data collectors set up a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station.
Rows: 90 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): season, day_type
dbl (5): volume, hightemp, avgtemp, cloudcover, precip

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 8 × 7
  volume hightemp avgtemp season cloudcover precip day_type
   <dbl>    <dbl>   <dbl> <chr>       <dbl>  <dbl> <chr>   
1    501       83    66.5 Summer       7.60 0      Weekday 
2    419       73    61   Summer       6.30 0.290  Weekday 
3    397       74    63   Spring       7.5  0.320  Weekday 
4    385       95    78   Summer       2.60 0      Weekend 
5    200       44    48   Spring      10    0.140  Weekday 
6    375       69    61.5 Spring       6.60 0.0200 Weekday 
7    417       66    52.5 Spring       2.40 0      Weekday 
8    629       66    52   Spring       0    0      Weekend 

Source: Pioneer Valley Planning Commission via the mosaicData package.

Variables

Response:

volume estimated number of trail users that day (number of breaks recorded)

. . .

Predictors

  • hightemp daily high temperature (in degrees Fahrenheit)
  • avgtemp average of daily low and daily high temperature (in degrees Fahrenheit)
  • season one of “Fall”, “Spring”, or “Summer”
  • cloudcover measure of cloud cover (in oktas)
  • precip measure of precipitation (in inches)
  • day_type one of “weekday” or “weekend”

Conduct a hypothesis test for \(\beta_j\)

Review: Simple linear regression (SLR)

`geom_smooth()` using formula = 'y ~ x'

SLR model summary

rt_slr_fit <- lm(volume ~ hightemp, data = rail_trail)

tidy(rt_slr_fit) |> kable(digits = 2)
term estimate std.error statistic p.value
(Intercept) -17.08 59.40 -0.29 0.77
hightemp 5.70 0.85 6.72 0.00

SLR hypothesis test

term estimate std.error statistic p.value
(Intercept) -17.08 59.40 -0.29 0.77
hightemp 5.70 0.85 6.72 0.00
  1. Set hypotheses: \(H_0: \beta_1 = 0\) vs. \(H_a: \beta_1 \ne 0\)

. . .

  1. Calculate test statistic and p-value: The test statistic is \(t= 6.72\) . The p-value is calculated using a \(t\) distribution with 88 degrees of freedom. The p-value is \(\approx 0\) .

. . .

  1. State the conclusion: The p-value is small, so we reject \(H_0\). The data provide strong evidence that high temperature is a helpful predictor for the number of daily riders, i.e. there is a linear relationship between high temperature and number of daily riders.

Multiple linear regression

rt_mlr_main_fit <- lm(volume ~ hightemp + season, data = rail_trail)

tidy(rt_mlr_main_fit) |> kable(digits = 2)
term estimate std.error statistic p.value
(Intercept) -125.23 71.66 -1.75 0.08
hightemp 7.54 1.17 6.43 0.00
seasonSpring 5.13 34.32 0.15 0.88
seasonSummer -76.84 47.71 -1.61 0.11

Multiple linear regression

The multiple linear regression model assumes \[Y|X_1, X_2, \ldots, X_p \sim N(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p, \sigma_\epsilon^2)\]


. . .

For a given observation \((x_{i1}, x_{i2}, \ldots, x_{ip}, y_i)\), we can rewrite the previous statement as

\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \epsilon_{i} \hspace{10mm} \epsilon_i \sim N(0,\sigma_{\epsilon}^2)\]


Estimating \(\sigma_\epsilon\)

For a given observation \((x_{i1}, x_{i2}, \ldots,x_{ip}, y_i)\) the residual is \[e_i = y_{i} - (\hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \hat{\beta}_{2} x_{i2} + \dots + \hat{\beta}_p x_{ip})\]


. . .

The estimated value of the regression standard error , \(\sigma_{\epsilon}\), is

\[\hat{\sigma}_\epsilon = \sqrt{\frac{\sum_{i=1}^ne_i^2}{n-p-1}}\]

. . .

As with SLR, we use \(\hat{\sigma}_{\epsilon}\) to calculate \(SE(\hat{\beta}_j)\), the standard error of the coefficient for predictor \(x_j\). See Matrix Form of Linear Regression for more detail.

MLR hypothesis test: hightemp

  1. Set hypotheses: \(H_0: \beta_{hightemp} = 0\) vs. \(H_a: \beta_{hightemp} \ne 0\), given season is in the model

. . .

  1. Calculate test statistic and p-value: The test statistic is \(t = 6.43\). The p-value is calculated using a \(t\) distribution with 86 \((n - p - 1)\) degrees of freedom. The p-value is \(\approx 0\).

. . .

  1. State the conclusion: The p-value is small, so we reject \(H_0\). The data provide strong evidence that high temperature for the day is a useful predictor in a model that already contains the season as a predictor for number of daily riders.

Interaction terms

term estimate std.error statistic p.value
(Intercept) -10.53 166.80 -0.06 0.95
hightemp 5.48 2.95 1.86 0.07
seasonSpring -293.95 190.33 -1.54 0.13
seasonSummer 354.18 255.08 1.39 0.17
hightemp:seasonSpring 4.88 3.26 1.50 0.14
hightemp:seasonSummer -4.54 3.75 -1.21 0.23


Do the data provide evidence of a significant interaction effect? Comment on the significance of the interaction terms.

Confidence interval for \(\beta_j\)

Confidence interval for \(\beta_j\)

  • The \(C\%\) confidence interval for \(\beta_j\) \[\hat{\beta}_j \pm t^* SE(\hat{\beta}_j)\] where \(t^*\) follows a \(t\) distribution with \(n - p - 1\) degrees of freedom.
  • Generically: We are \(C\%\) confident that the interval LB to UB contains the population coefficient of \(x_j\).
  • In context: We are \(C\%\) confident that for every one unit increase in \(x_j\), \(y\) changes by LB to UB units, on average, holding all else constant.

Confidence interval for \(\beta_j\)

tidy(rt_mlr_main_fit, conf.int = TRUE, conf.level = 0.95) |>
  kable(digits= 2)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -125.23 71.66 -1.75 0.08 -267.68 17.22
hightemp 7.54 1.17 6.43 0.00 5.21 9.87
seasonSpring 5.13 34.32 0.15 0.88 -63.10 73.36
seasonSummer -76.84 47.71 -1.61 0.11 -171.68 18.00

CI for hightemp

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -125.23 71.66 -1.75 0.08 -267.68 17.22
hightemp 7.54 1.17 6.43 0.00 5.21 9.87
seasonSpring 5.13 34.32 0.15 0.88 -63.10 73.36
seasonSummer -76.84 47.71 -1.61 0.11 -171.68 18.00


We are 95% confident that for every degree Fahrenheit the day is warmer, the number of riders increases by 5.21 to 9.87, on average, holding season constant.

CI for seasonSpring

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -125.23 71.66 -1.75 0.08 -267.68 17.22
hightemp 7.54 1.17 6.43 0.00 5.21 9.87
seasonSpring 5.13 34.32 0.15 0.88 -63.10 73.36
seasonSummer -76.84 47.71 -1.61 0.11 -171.68 18.00


We are 95% confident that the number of riders on a Spring day is lower by 63.1 to higher by 73.4 compared to a Fall day, on average, holding high temperature for the day constant.

. . .

Is season a significant predictor of the number of riders, after accounting for high temperature?

Inference pitfalls

Large sample sizes


Caution

If the sample size is large enough, the test will likely result in rejecting \(H_0: \beta_j = 0\) even \(x_j\) has a very small effect on \(y\).

  • Consider the practical significance of the result not just the statistical significance.

  • Use the confidence interval to draw conclusions instead of relying only p-values.

Small sample sizes


Caution

If the sample size is small, there may not be enough evidence to reject \(H_0: \beta_j=0\).

  • When you fail to reject the null hypothesis, DON’T immediately conclude that the variable has no association with the response.

  • There may be a linear association that is just not strong enough to detect given your data, or there may be a non-linear association.

Recap

  • Reviewed model comparison
  • Introduced inference for multiple linear regression

Next class

  • Exam 01 review