Lab 01: Linear regression

Ice duration and air temperature in Madison, WI

Important

This lab is due on Thursday, January 30 at 11:59pm. To be considered on time, the following must be done by the due date:

  • Final .qmd and .pdf files pushed to your GitHub repo

  • Final .pdf file submitted on Gradescope

Introduction

In this lab, you will use linear regression to understand the changes in ice duration over time and explore the relationship between air temperature and ice duration for two lakes in Wisconsin.

Learning goals

By the end of the lab you will…

  • Fit and interpret a linear regression models in R.
  • Use simulation-based inference to draw conclusions about the slope.
  • Continue developing a workflow for reproducible data analysis.

Getting started

  • Go to the sta210-sp25 organization on GitHub. Click on the repo with the prefix lab-01-. It contains the starter documents you need to complete the lab.

  • Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo, starting a new project in R, and configuring git.

Packages

The following packages are used in the lab:

library(tidyverse)
library(tidymodels)
library(knitr)

# add other packages as needed

Data: Ice cover and air temperature

The datasets wi-icecover.csv and wi-air-temperature.csv contain information about ice cover and air temperature, respectively, at Lake Monona and Lake Mendota (both in Madison, Wisconsin) for days in 1886 through 2019. The data were obtained from the ntl_icecover and ntl_airtemp data frames in the lterdatasampler R package. They were originally collected by the US Long Term Ecological Research program (LTER) Network.

The primary variables for this analysis are

  • year: year of observation

  • lakeid: lake name

  • ice_duration: number of days between the freeze and ice breakup dates of each lake

  • air_temp_avg: yearly average air temperature in Madison, WI (degrees Celsius)

icecover <- read_csv("data/wi-icecover.csv")
airtemp <- read_csv("data/wi-air-temperature.csv")

Exercises

Goal: The goal of this analysis is to use linear regression understand the change in ice duration over time and the association between average air temperature and ice duration for two lakes in Madison, Wisconsin that freeze for a portion of the year. Because ice cover is impacted by various environmental factors, researchers are interested in examining the association between these factors to better understand the changing climate.


Write all code and narrative in your Quarto file. Write all narrative in complete sentences. Throughout the assignment, you should periodically render your Quarto document to produce the updated PDF, commit the changes in the Git pane, and push the updated files to GitHub.

Tip

Make sure we can read all of your code in your PDF document. This means you will need to break up long lines of code. One way to help avoid long lines of code is is start a new line after every pipe (|>) and plus sign (+).

Exercise 1

We’ll start by combining the data into a single data frame.

  1. Join icecover and airtemp to create a new data frame that contains both the ice temperature and average air temperature. Your new data frame will have one row per lake per year.

  2. How many observations (rows) are in the new data frame? How many variables (columns)?

  3. Part of this analysis will focus on the change in ice duration over time. Add a new variable to the combined data frame that is the number of years since 1886 (the first year in the data set).

Important

You will use the combined data for the remainder of the lab.

Exercise 2

  1. Make a histogram to visualize the distribution of the response variable ice_duration , then calculate summary statistics for the center and spread of the distribution. Include informative axis labels and title on the visualization.

  2. Use the visualization and summary statistics to describe the distribution of ice_duration. Include the shape, center, spread, and potential presence of outliers in the description.

Exercise 3

  1. Visualize the distribution of ice_duration using a different type of plot.
Tip
  1. What type of plot did you make? What is one feature of the distribution of ice duration that is more apparent in the plot from this exercise than the histogram? What is one feature of the distribution that is more apparent in the histogram than the plot in this exercise?

This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message (e.g., “Completed Exercises 1 - 3”), and push. After you push the changes, the Git pane in RStudio should be empty.

Exercise 4

No we will explore the change in ice duration over time.

  1. Visualize ice duration over time for each lake. Include informative axis labels and title on the visualization.

  2. Use the visualization to describe how ice duration has generally changed over time.

  3. Based on the visualization, do you think lakeid is a useful predictor of ice duration? Briefly explain why or why not.

Exercise 5

  1. Fit the linear regression model using years since 1886 to understand variability in ice duration. Neatly display the results using 3 digits.

  2. Interpret the slope in the context of the data.

  3. Does the intercept have a meaningful interpretation in this context? If so, interpret the intercept in the context of the data. Otherwise, explain why not.

Exercise 6

Conduct a permutation test for the slope to assess whether there is sufficient evidence of a change in ice duration over time. In your response:

  • State the null and alternative hypotheses in words and mathematical notation.
  • Show all relevant code and output used to conduct the test. Use set.seed(2025) and 1000 iterations to construct the appropriate distribution.
  • Visualize the simulated null distribution.
  • State the conclusion in the context of the data.

This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message (e.g., “Completed Exercises 4 - 6”), and push. After you push the changes, the Git pane in RStudio should be empty.

Exercise 7

Now let’s analyze the primary relationship of interest - the relationship between air temperature and ice duration.

  1. Fit the linear regression model using average air temperature to understand variability in ice duration. Neatly display the results using 3 digits.

  2. Construct a 93% bootstrap confidence interval for the slope using the same seed as before (2025). In your response:

    • Show all relevant code and output used construct the bootstrap distribution and compute the confidence interval. Use 1000 iterations to construct the distribution.
    • Visualize the bootstrap distribution.
  3. Interpret the 93% confidence interval for the slope in the context of the data.

Exercise 8

Previous exercises have shown evidence of changing ice duration over time and the effect air temperature on ice duration. Now let’s consider both of these factors in a single model.

  1. Fit a linear regression model using years since 1886 and average air temperature to understand variability in ice duration. Neatly display the results using 3 digits.
Tip

Below is template code for a model with 2 predictor variables:

model <- lm(y ~ x1 + x2, data = my_data)
  1. How does the slope of air_temp_avg in this model compare to the slope in the model from the previous exercise? Does the slope indicate a stronger, weaker, or equal effect of air temperature on ice duration?

  2. Briefly explain why the slopes of air temperature differ between the two models.

Render, commit, and push your final changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

  • Access Gradescope through the STA 210 Canvas site.

  • Click on the assignment, and you’ll be prompted to submit it.

  • Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).

  • Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading (50 points)


Component Points
Ex 1 4
Ex 2 5
Ex 3 4
Ex 4 6
Ex 5 6
Ex 6 8
Ex 7 9
Ex 8 5
Workflow & formatting 3