Mar 04, 2025
Exam corrections (optional) due TODAY at 11:59pm on Canvas
Team Feedback (email from Teammates) due TODAY at 11:59pm
HW 03 due Tuesday March 18 at 11:59pm
Next project milestone: Exploratory data analysis due March 20
DataFest: April 4 - 6 - https://dukestatsci.github.io/datafest/
Multicollinearity
Recap
What to do about it
# A tibble: 5 × 7
volume hightemp avgtemp season cloudcover precip day_type
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 501 83 66.5 Summer 7.60 0 Weekday
2 419 73 61 Summer 6.30 0.290 Weekday
3 397 74 63 Spring 7.5 0.320 Weekday
4 385 95 78 Summer 2.60 0 Weekend
5 200 44 48 Spring 10 0.140 Weekday
Source: Pioneer Valley Planning Commission via the mosaicData package.
Outcome:
volume estimated number of trail users that day (number of breaks recorded)Predictors
hightemp daily high temperature (in degrees Fahrenheit)
avgtemp average of daily low and daily high temperature (in degrees Fahrenheit)
season one of “Fall”, “Spring”, or “Summer”
precip measure of precipitation (in inches)
Multicollinearity: near-linear dependence among predictors
The variance inflation factor (VIF) measures how much the linear dependencies impact the variance of the predictors
\[ VIF_{j} = \frac{1}{1 - R^2_j} \]
where \(R^2_j\) is the proportion of variation in \(x_j\) that is explained by all the other predictors
Thresholds:
VIF > 10: concerning multicollinearity
VIF > 5: potentially worth further investigationApplication exercise
When we have perfect collinearities, we are unable to get estimates for the coefficients
When we have almost perfect collinearities (i.e. highly correlated predictor variables), the standard errors for our regression coefficients inflate
In other words, we lose precision in our estimates of the regression coefficients
This impedes our ability to use the model for inference
It is also difficult to interpret the model coefficients
Collect more data (often not feasible given practical constraints)
Redefine the correlated predictors to keep the information from predictors but eliminate collinearity
For categorical predictors, avoid using levels with very few observations as the baseline
Remove one of the correlated variables