HW 01: Simple linear regression

Education & median income in US counties

Important

This assignment is due on Tuesday, January 28 at 11:59pm. To be considered on time, the following must be done by the due date:

Final .qmd and .pdf files pushed to your GitHub repo
Final .pdf file submitted on Gradescope

Introduction

In this assignment, you will use simple linear regression to examine the association between the percent of adults with a bachelor’s degree and the median household income for counties in the United States.

Learning goals

In this assignment, you will…

Fit and interpret simple linear regression models.
Construct and interpret bootstrap confidence intervals for the population slope, \(\beta_1\).
Create and interpret spatial data visualizations using R.
Continue developing a workflow for reproducible data analysis.

Getting started

Go to the sta210-sp25 organization on GitHub. Click on the repo with the prefix hw-01. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 00 for details on cloning a repo and starting a new project in R.

The following packages will be used in this assignment:

#| message: false 

library(tidyverse)  # for data wrangling 
library(tidymodels) # for modeling and inference 
library(knitr)      # to neatly format tables 
library(scales)     # to format visualizations 

# load other packages as needed

Data: US Counties

The data are from the county_2019 data frame in the usdata R package. These data were originally collected in the 2019 American Community Survey (ACS), an annual survey conducted by the United States Census Bureau that collects demographics and other information from a sample of households in the United States. The data frame county_2019 contains county-level statistics from the ACS.

The data for this analysis are available in the file us-counties-sample.csv in the data folder of your repo. It contains a random sample of 600 counties in the United States. This is about 19% of the counties in the United States.

This analysis focuses on the following variables:

bachelors: Percent of population 25 years old and older that earned a Bachelor’s degree or higher
median_household_income: Median household income in US dollars

Click here for the full codebook for the county_2019 data set.

You will use two other data sets county-map-sample.csv and county-map-all.csv to create spatial visualizations of the ACS variables. Use the code below to load all of the data sets.

#| message: false 
#| eval: false 

county_data_sample <- read_csv("data/us-counties-sample.csv") 
map_data_sample <-  read_csv("data/county-map-sample.csv") 
map_data_all <- read_csv("data/county-map-all.csv")

Exercises

There is a lot of public interest in understanding the impact of graduating college, i.e., obtaining a bachelor’s degree, on one’s future career and lifetime earnings. The common convention is that individuals who have earned a bachelor’s degree (or higher) will earn more income over the course of a lifetime than an individual who does not have such a degree.

We will explore this at a county-level and examine the association between the percent of adults 25 years old + with a Bachelor’s degree median household income. Specifically we’d like to answer questions such as, “do counties that have a higher percentage of college graduates have higher median household incomes, on average, compared to counties with a lower percentage of college graduates?”.

Instructions

Type your responses to each question in your Quarto document. Write all narrative using complete sentences and include informative axis labels and titles on visualizations. Use a reproducible workflow by periodically rendering the Quarto document, writing an informative commit message, and pushing the updated .qmd and .pdf files to GitHub.

Part 1: Exploratory data analysis

Exercise 1

Create a histogram of the distribution of the predictor variable bachelors and calculate appropriate summary statistics. Use the visualization and summary statistics to describe the distribution. Include an informative title and axis labels on the plot.

Exercise 2

Let’s view the data in another way. Use the code below to make a map of the United States with the color of the counties filled in based on the percent of residents 25 years old and older who have a Bachelor’s degree. Fill in title and axis labels.

Then use the plot answer the following:

What are 2 observations you have from the map?
What is a feature that is apparent in the map that wasn’t as easily apparent from the histogram in the previous exercise? What is a feature that is apparent in the histogram that is not as easily apparent from the map?

#| fig.show: hide 
#| message: false 

county_map_data <- left_join(county_data_sample, map_data_sample) 

ggplot(data = map_data_all) + 
  geom_polygon(aes(x = long, y = lat, group = group), 
    fill = "lightgray", color = "white" 
    ) + 
  geom_polygon(data = county_map_data, aes(x = long, y = lat, group = group, 
    fill = bachelors) 
    ) + 
  labs( 
    x = "Longitude", 
    y = "Latitude", 
    fill = "_____", 
    title = "_____" 
  ) + 
  scale_fill_viridis_c(labels = label_percent(scale = 1)) + 
  coord_quickmap()

Exercise 3

Create a visualization of the relationship between bachelors and median_household_income and calculate the correlation. Use the visualization and correlation to describe the relationship between the two variables.

This is a good place to render, commit, and push changes to your hw-01 repo on GitHub. Write an informative commit message (e.g. “Completed exercises 1 - 3”), and push every file to GitHub by clicking the check box next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.

Part 2: Modeling

Exercise 4

We will use a linear regression model to describe the relationship between bachelors and median_household_income.

Write the form of the statistical (theoretical) model we will use for this task using mathematical notation. Use variable names (bachelors and median_household_income) in the equation for your model¹.

Tip

Write median household income in LaTex as

\text{median\_household\_income}

to make it properly render in the .pdf document.

Exercise 5

Fit the regression line corresponding to the statistical model in the previous exercise. Neatly display the model output using 3 digits.
Write the equation of the fitted model using mathematical notation. Use variable names (bachelors and median_household_income) in the equation.

Exercise 6

Interpret the slope in the context of the data.
Is it useful to interpret the intercept for this data? If so, write the interpretation in the context of the data. Otherwise, briefly explain why not.

This is a good place to render, commit, and push changes to your hw-01 repo on GitHub. Write an informative commit message (e.g. “Completed exercises 4 -6”), and push every file to GitHub by clicking the check box next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.

Part 3: Inference for the U.S.

We will use the data from these 600 randomly selected counties to draw conclusions about the relationship between the percent of adults age 25 and older with a bachelor’s degree and median household income for the over 3,000 counties in the United States.

Exercise 7

What is the population in this analysis? What is the sample?
Is it reasonable to treat the sample in this analysis as representative of the population? Briefly explain why or why not.

Exercise 8

Next, compute a 98% bootstrap confidence interval for the slope. Use set.seed(2025) and 1000 iterations. Show all relevant code and output used to compute the interval.

Write the 98% confidence interval using 3 digits.

Exercise 9

Interpret the interval from the previous exercise in the context of the data.
Suppose you wanted to evaluate the claim that there is no linear relationship between the percent of adults age 25 and older with a bachelor’s degree and median household income for counties in the United States. In other words, you want to evaluate the claim that \(\beta_1 = 0\) in the model written in Exercise 4.

Does the confidence interval from the previous exercise support this claim? Briefly explain why or why not.

Now is a good time to render your document again if you haven’t done so recently, commit (with an informative commit message), and push all updates.

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

Access Gradescope through the STA 210 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading (50 points)

Component	Points
Ex 1	5
Ex 2	5
Ex 3	5
Ex 4	5
Ex 5	5
Ex 6	5
Ex 7	5
Ex 8	5
Ex 9	5
Workflow & formatting	5

The “Workflow & formatting” grade is to assess the reproducible workflow and document format for the applied exercises. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date updated in the YAML.

Footnotes

Click here for a guide on writing mathematical symbols using LaTex.↩︎