Multicollinearity

Author

Prof. Maria Tackett

Published

Feb 27, 2025

Announcements

Exam corrections (optional) due Tuesday, March 4 at 11:59pm on Canvas
Project proposal due TODAY at 11:59pm
Team Feedback (email from Teammates) due Tuesday, March 4 at 11:59pm
DataFest: April 4 - 6 - https://dukestatsci.github.io/datafest/

Computing set up

# load packages
library(tidyverse)  
library(tidymodels)  
library(knitr)       
library(patchwork)
library(GGally)   # for pairwise plot matrix
library(corrplot) # for correlation matrix

# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw())

Topics

Multicollinearity
- Definition
- How it impacts the model
- How to detect it
- What to do about it

Data: Trail users

The Pioneer Valley Planning Commission (PVPC) collected data at the beginning a trail in Florence, MA for ninety days from April 5, 2005 to November 15, 2005
Data collectors set up a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station.

# A tibble: 5 × 7
  volume hightemp avgtemp season cloudcover precip day_type
   <dbl>    <dbl>   <dbl> <chr>       <dbl>  <dbl> <chr>   
1    501       83    66.5 Summer       7.60  0     Weekday 
2    419       73    61   Summer       6.30  0.290 Weekday 
3    397       74    63   Spring       7.5   0.320 Weekday 
4    385       95    78   Summer       2.60  0     Weekend 
5    200       44    48   Spring      10     0.140 Weekday

Source: Pioneer Valley Planning Commission via the mosaicData package.

Variables

Outcome:

volume estimated number of trail users that day (number of breaks recorded)

Predictors

hightemp daily high temperature (in degrees Fahrenheit)
avgtemp average of daily low and daily high temperature (in degrees Fahrenheit)
season one of “Fall”, “Spring”, or “Summer”
precip measure of precipitation (in inches)

EDA: Relationship between predictors

We can create a pairwise plot matrix using the ggpairs function from the GGally R package

rail_trail |>
  select(hightemp, avgtemp, season, precip) |>
  ggpairs()

EDA: Relationship between predictors

EDA: Correlation matrix

We can. use corrplot() in the corrplot R package to make a matrix of pairwise correlations between quantitative predictors

correlations <- rail_trail |>
  select(hightemp, avgtemp, precip) |>
  cor()

corrplot(correlations, method = "number")

EDA: Correlation matrix

What might be a potential concern with a model that uses high temperature, average temperature, season, and precipitation to predict volume?

Multicollinearity

Ideally the predictors are completely independent of one another
In practice, there is typically some relationship between predictors but it is often not a major issue in the model
If there predictors are perfectly correlated, we cannot find values of \(\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_p\) that best fit the model
If predictors are strongly correlated, we can find \(\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_p\), but there may be other issues with the model
Multicollinearity: predictors are strongly correlated with each other

Source: Montgomery, Peck, and Vining (2021)

Sources of multicollinearity

Data collection method - only sample from a subspace of the region of predictors
Constraints in the population - e.g., predictors family income and size of house
Choice of model - e.g., adding high order or interaction terms to the model
Overdefined model - have more predictors than observations

Example: Issue with multicollinearity

Let’s assume the true population regression equation is \(y = 3 + 4x\)

. . .

Suppose we try estimating that equation using a model with variables \(x\) and \(z = x/10\)

\[ \begin{aligned}\hat{y}&= \hat{\beta}_0 + \hat{\beta}_1x + \hat{\beta}_2z\\ &= \hat{\beta}_0 + \hat{\beta}_1x + \hat{\beta}_2\frac{x}{10}\\ &= \hat{\beta}_0 + \bigg(\hat{\beta}_1 + \frac{\hat{\beta}_2}{10}\bigg)x \end{aligned} \]

Example: Issue with mulitcollinearity

\[\hat{y} = \hat{\beta}_0 + \bigg(\hat{\beta}_1 + \frac{\hat{\beta}_2}{10}\bigg)x\]

We can set \(\hat{\beta}_1\) and \(\hat{\beta}_2\) to any two numbers such that \(\hat{\beta}_1 + \frac{\hat{\beta}_2}{10} = 4\)
Therefore, we are unable to choose the “best” combination of \(\hat{\beta}_1\) and \(\hat{\beta}_2\)

Variance inflation factor

The variance inflation factor (VIF) is a measure of the collinearity between predictor \(x_j\) and all other predictors in the model

\[ VIF_{j} = \frac{1}{1 - R^2_j} \]

where \(R^2_j\) is the proportion of variation in \(x_j\) that is explained by all the other predictors

Detecting multicollinearity

Common practice uses threshold \(VIF > 10\) as indication of concerning multicollinearity (some say VIF > 5 is worth investigation)
Variables with similar values of VIF are typically the ones correlated with each other
Use the vif() function in the rms R package to calculate VIF

library(rms)

trail_fit <- lm(volume ~ hightemp + avgtemp + precip, data = rail_trail)

vif(trail_fit)

hightemp  avgtemp   precip 
7.161882 7.597154 1.193431

Application exercise

📋 sta210-sp25.netlify.app/ae/ae-08-multicollinearity.html

Part 1

How multicollinearity impacts model

When we have perfect collinearities, we are unable to get estimates for the coefficients
When we have almost perfect collinearities (i.e. highly correlated predictor variables), the standard errors for our regression coefficients inflate
- In other words, we lose precision in our estimates of the regression coefficients
- This impedes our ability to use the model for inference
It is also difficult to interpret the model coefficients

Dealing with multicollinearity

Collect more data (often not feasible given practical constraints)
Redefine the correlated predictors to keep the information from predictors but eliminate collinearity
- e.g., if \(x_1, x_2, x_3\) are correlated, use a new variable \((x_1 + x_2) / x_3\) in the model
For categorical predictors, avoid using levels with very few observations as the baseline
Remove one of the correlated variables
- Be careful about substantially reducing predictive power of the model

Application exercise

📋 sta210-sp25.netlify.app/ae/ae-08-multicollinearity.html

Part 2

Recap

Introduced multicollinearity
- Definition
- How it impacts the model
- How to detect it
- What to do about it

References

Montgomery, Douglas C, Elizabeth A Peck, and G Geoffrey Vining. 2021. Introduction to Linear Regression Analysis. John Wiley & Sons.