Multiple linear regression (MLR)

Prof. Maria Tackett

Jan 30, 2025

Announcements

  • Lab 01 due TODAY at 11:59pm
  • Team labs start on Monday
  • Click here to learn more about the Academic Resource Center
  • Statistics experience due Tuesday, April 15

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(openintro)
library(patchwork)
library(knitr)
library(kableExtra)
library(colorblindr)
library(palmerpenguins)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

Considering multiple variables

Data: Palmer penguins

The penguins data set contains data for penguins found on three islands in the Palmer Archipelago, Antarctica. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. These data can be found in the palmerpenguins R package.

# A tibble: 342 × 4
   body_mass_g flipper_length_mm bill_length_mm species
         <int>             <int>          <dbl> <fct>  
 1        3750               181           39.1 Adelie 
 2        3800               186           39.5 Adelie 
 3        3250               195           40.3 Adelie 
 4        3450               193           36.7 Adelie 
 5        3650               190           39.3 Adelie 
 6        3625               181           38.9 Adelie 
 7        4675               195           39.2 Adelie 
 8        3475               193           34.1 Adelie 
 9        4250               190           42   Adelie 
10        3300               186           37.8 Adelie 
# ℹ 332 more rows

Variables

Predictors:

  • bill_length_mm: Bill length in millimeters
  • flipper_length_mm: Flipper length in millimeters
  • species: Adelie, Gentoo, or Chinstrap species

Response: body_mass_g: Body mass in grams


The goal of this analysis is to use the bill length, flipper length, and species to predict body mass.

Response: body_mass_g

min median max iqr
2700 4050 6300 1200

Predictors

Response vs. predictors

Why do we want to use a single model with all the predictors instead of 3 separate models?

Multiple linear regression

Multiple linear regression (MLR)

Based on the analysis goals, we will use a multiple linear regression model of the following form

\[ \begin{aligned}\widehat{\text{body_mass_g}} ~ = \hat{\beta}_0 & + \hat{\beta}_1 \times \text{flipper_length_mm} \\ & + \hat{\beta}_2 \times \text{species}_1 \\ &+\hat{\beta}_3 \times \text{species}_2 \\ &+ \hat{\beta}_4 \times \text{bill_length_mm} \end{aligned} \]

Similar to simple linear regression, this model assumes that at each combination of the predictor variables, the values body_mass_g follow a Normal distribution.

Multiple linear regression

Recall: The simple linear regression model assumes

\[ Y|X\sim N(\beta_0 + \beta_1 X, \sigma_{\epsilon}^2) \]

Similarly: The multiple linear regression model assumes

\[ Y|X_1, X_2, \ldots, X_p \sim N(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p, \sigma_{\epsilon}^2) \]

Multiple linear regression

At any combination of the predictors, the mean value of the response \(Y\), is

\[ E(Y|X_1, \ldots, X_p) = \beta_0 + \beta_1 X_{1} + \beta_2 X_2 + \dots + \beta_p X_p \]

Using multiple linear regression, we can estimate the mean response for any combination of predictors

\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_{1} + \hat{\beta}_2 X_2 + \dots + \hat{\beta}_p X_{p} \]

Model fit

penguin_fit <- lm(body_mass_g ~ flipper_length_mm + species + 
                bill_length_mm, data = penguins)

tidy(penguin_fit) |>
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) -3904.387 529.257 -7.377 0.000
flipper_length_mm 27.429 3.176 8.638 0.000
speciesChinstrap -748.562 81.534 -9.181 0.000
speciesGentoo 90.435 88.647 1.020 0.308
bill_length_mm 61.736 7.126 8.664 0.000

Model equation

\[ \begin{align}\widehat{\text{body_mass_g}} = -3904.387 &+27.429 \times \text{flipper_length_mm}\\ & -748.562 \times \text{Chinstrap}\\ &+ 90.435 \times \text{Gentoo}\\ &+ 61.736 \times \text{bill_length_mm} \end{align} \]

Note

We will talk about why there are two terms in the model for species shortly!

Interpreting \(\hat{\beta}_j\)

  • The estimated coefficient \(\hat{\beta}_j\) is the expected change in the mean of \(Y\) when \(X_j\) increases by one unit, holding the values of all other predictor variables constant.
  • Example: The estimated coefficient for flipper_length_mm is 27.429. This means for each additional millimeter in a penguin’s flipper length, its body mass is expected to be greater by 27.429 grams, on average, holding species and bill length constant.

Prediction

What is the predicted body mass for a Gentoo penguin with a flipper length of 200 millimeters and bill length of 45 millimeters?


-3904.387 + 27.429 * 200 - 748.562 * 0 + 90.435 * 1 + 61.736 * 45
[1] 4449.968


The predicted body mass for a Gentoo penguin with a flipper length of 200 millimeters and bill length of 45 millimeters is 4449.968 grams.

Prediction, revisited

Just like with simple linear regression, we can use the predict() function in R to calculate the appropriate intervals for our predicted values:

new_penguin <- tibble(
  flipper_length_mm  = 200, 
  species = "Gentoo", 
  bill_length_mm = 45
)

predict(penguin_fit, new_penguin)
       1 
4449.955 

Note

Difference in predicted value due to rounding the coefficients on the previous slide.

Confidence interval for \(\hat{\mu}_y\)

Calculate a 90% confidence interval for the estimated mean body mass a Gentoo penguins with a flipper length of 200 millimeters and bill length of 45 millimeters.


predict(penguin_fit, new_penguin, interval = "confidence", 
        level = 0.90) |> 
  kable(digits = 3)
fit lwr upr
4449.955 4355.238 4544.671

Prediction interval for \(\hat{y}\)

Calculate a 90% prediction interval for the estimated body mass for an individual Gentoo penguin with a flipper length of 200 millimeters and bill length of 45 millimeters.


predict(penguin_fit, new_penguin, interval = "prediction", 
        level = 0.90) |>
  kable(digits = 3)
fit lwr upr
4449.955 3881.035 5018.875

Cautions

  • Do not extrapolate! Because there are multiple predictor variables, there is the potential to extrapolate in many directions
  • The multiple regression model only shows association, not causality
    • To show causality, you must have a carefully designed experiment or carefully account for confounding variables in an observational study

Types of predictors

Categorical predictors

Indicator variables

  • Suppose there is a categorical variable with \(k\) categories (levels)

  • We can make \(k\) indicator variables - one indicator for each category

  • An indicator variable takes values 1 or 0

    • 1 if the observation belongs to that category
    • 0 if the observation does not belong to that category

Indicator variables for species

penguins <- penguins |>
  mutate(
    adelie = if_else(species == "Adelie", 1, 0),
    chinstrap = if_else(species == "Chinstrap", 1, 0),
    gentoo = if_else(species == "Gentoo", 1, 0)
  )
# A tibble: 3 × 4
  species   adelie chinstrap gentoo
  <fct>      <dbl>     <dbl>  <dbl>
1 Adelie         1         0      0
2 Gentoo         0         0      1
3 Chinstrap      0         1      0

Indicators in the model

  • We will use \(k-1\) of the indicator variables in the model.
  • The baseline is the category that doesn’t have a term in the model. This is also called the reference level.
  • The coefficients of the indicator variables in the model are interpreted as the expected change in the response compared to the baseline, holding all other variables constant.
  • This approach is also called dummy coding.
penguins |>
  select(species, chinstrap, gentoo) |>
  slice(1, 152, 283)
# A tibble: 3 × 3
  species   chinstrap gentoo
  <fct>         <dbl>  <dbl>
1 Adelie            0      0
2 Gentoo            0      1
3 Chinstrap         1      0

Interpreting species

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -3904.387 529.257 -7.377 0.000 -4945.450 -2863.324
flipper_length_mm 27.429 3.176 8.638 0.000 21.182 33.675
speciesChinstrap -748.562 81.534 -9.181 0.000 -908.943 -588.182
speciesGentoo 90.435 88.647 1.020 0.308 -83.937 264.807
bill_length_mm 61.736 7.126 8.664 0.000 47.720 75.753
  • The baseline category is Adelie.
  • Penguins from the Chinstrap species are expected to have a body mass that is 748.562 grams less, on average, compared to penguins from the Adelie species, holding flipper length and bill length constant.

Interpret the coefficient of Gentoo in the context of the data.

Transforming quantitative predictors

Interpreting the intercept

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -3904.387 529.257 -7.377 0.000 -4945.450 -2863.324
flipper_length_mm 27.429 3.176 8.638 0.000 21.182 33.675
speciesChinstrap -748.562 81.534 -9.181 0.000 -908.943 -588.182
speciesGentoo 90.435 88.647 1.020 0.308 -83.937 264.807
bill_length_mm 61.736 7.126 8.664 0.000 47.720 75.753


The intercept -3904.387 is not meaningful. Let’s transform some variables to make this intercept meaningful.

Centering

  • Centering a quantitative predictor means shifting every value by some constant \(C\)

\[ X_{cent} = X - C \]

  • One common type of centering is mean-centering, in which every value of a predictor is shifted by its mean

  • Only quantitative predictors are centered

  • Center all quantitative predictors in the model for ease of interpretation

What is one reason one might want to center the quantitative predictors? What is are the units of centered variables?

Centering

Use the scale() function with center = TRUE and scale = FALSE to mean-center variables

penguins <- penguins |>
  mutate(flipper_length_cent = scale(flipper_length_mm, center = TRUE, scale = FALSE), 
         bill_length_cent = scale(bill_length_mm, center = TRUE, scale = FALSE))

Original vs. mean-centered variable

Original variable


Mean SD
200.915 14.062

Mean-centered variable


Mean SD
0 14.062


Using mean-centered variables in the model

How do you expect the model to change if we use flipper_length_cent and bill_length_cent in the model?

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 4318.066 45.674 94.542 0.000 4228.225 4407.908
flipper_length_cent 27.429 3.176 8.638 0.000 21.182 33.675
speciesChinstrap -748.562 81.534 -9.181 0.000 -908.943 -588.182
speciesGentoo 90.435 88.647 1.020 0.308 -83.937 264.807
bill_length_cent 61.736 7.126 8.664 0.000 47.720 75.753

Original vs. mean-centered model

term estimate
(Intercept) -3904.387
flipper_length_mm 27.429
speciesChinstrap -748.562
speciesGentoo 90.435
bill_length_mm 61.736
term estimate
(Intercept) 4318.066
flipper_length_cent 27.429
speciesChinstrap -748.562
speciesGentoo 90.435
bill_length_cent 61.736

What has changed? What is the same?

Standardizing

  • Standardizing a quantitative predictor mean shifting every value by the mean and dividing by the standard deviation of that variable

\[ X_{std} = \frac{X - \bar{X}}{S_X} \]

  • Only quantitative predictors are standardized

  • Standardize all quantitative predictors in the model for ease of interpretation

What is one reason one might want to standardize the quantitative predictors? What is are the units of standardized variables?

Standardizing

Use the scale() function with center = TRUE and scale = TRUE to standardized variables


penguins <- penguins |>
  mutate(flipper_length_std = scale(flipper_length_mm, center = TRUE, scale = TRUE), 
         bill_length_std = scale(bill_length_mm, center = TRUE, scale = TRUE))

Original vs. standardized variable

Original variable


Mean SD
200.915 14.062

Standardized variable


Mean SD
0 1


Using standardized variables in the model

How do you expect the model to change if we use flipper_length_std and bill_length_std in the model?

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 4318.066 45.674 94.542 0.000 4228.225 4407.908
flipper_length_std 385.696 44.654 8.638 0.000 297.862 473.531
speciesChinstrap -748.562 81.534 -9.181 0.000 -908.943 -588.182
speciesGentoo 90.435 88.647 1.020 0.308 -83.937 264.807
bill_length_std 337.055 38.902 8.664 0.000 260.533 413.577

Original vs. standardized model

term estimate
(Intercept) -3904.387
flipper_length_mm 27.429
speciesChinstrap -748.562
speciesGentoo 90.435
bill_length_mm 61.736
term estimate
(Intercept) 4318.066
flipper_length_std 385.696
speciesChinstrap -748.562
speciesGentoo 90.435
bill_length_std 337.055

What has changed? What is the same?

Interaction terms

Interaction terms

  • Sometimes the relationship between a predictor variable and the response depends on the value of another predictor variable.
  • This is an interaction effect.
  • To account for this, we can include interaction terms in the model.

Bill length versus species

If the lines are not parallel, there is indication of a potential interaction effect, i.e., the slope of bill length may differ based on the species.

Interaction term in model

penguin_int_fit <- lm(body_mass_g ~ flipper_length_mm + species + bill_length_mm + species * bill_length_mm,
      data = penguins)
term estimate std.error statistic p.value
(Intercept) -4297.905 645.054 -6.663 0.000
flipper_length_mm 27.263 3.175 8.586 0.000
speciesChinstrap 1146.287 726.217 1.578 0.115
speciesGentoo 54.716 619.934 0.088 0.930
bill_length_mm 72.692 10.642 6.831 0.000
speciesChinstrap:bill_length_mm -41.035 16.104 -2.548 0.011
speciesGentoo:bill_length_mm -1.163 14.436 -0.081 0.936

Interpreting interaction terms

  • What the interaction means: The effect of bill length on the body mass is 41.035 less when the penguin is from the Chinstrap species compared to the effect for the Adelie species, holding all else constant.
  • Interpreting bill_length_mm for Chinstrap: For each additional millimeter in bill length, we expect the body mass of Chinstrap penguins to increase by 31.657 grams (72.692 - 41.035), holding all else constant.

Summary


In general, how do

indicators for categorical predictors impact the model equation?

interaction terms impact the model equation?

Recap

  • Introduced multiple linear regression

  • Interpreted coefficients in the multiple linear regression model

  • Calculated predictions and associated intervals for multiple linear regression models

  • Mean-centered and standardized quantitative predictors

  • Used indicator variables for categorical predictors

  • Used interaction terms