HW 04: Logistic regression

Due date

This assignment is due on Tuesday, April 8 at 11:59pm. To be considered on time, the following must be done by the due date:

  • Final .qmd and .pdf files pushed to your GitHub repo

  • Final .pdf file submitted on Gradescope

Introduction

In this assignment, you will analyze data from an online Ipsos survey that was conducted for the 2020 “Why Many Americans Don’t Vote” from the now defunct data journalism website FiveThirtyEight. You will use logistic regression for interpretation and prediction. You can find a PDF of the article on Canvas and read more about the polling design and respondents in the README of the GitHub repo for the data.

Learning goals

By the end of the assignment you will be able to…

  • Use logistic regression to explore the relationship between a binary response variable and multiple predictor variables

  • Conduct exploratory data analysis for logistic regression

  • Interpret coefficients of logistic regression model

  • Use statistics to help choose the best fit model

  • Use the logistic regression model for prediction and classification

Getting started

  • Go to the sta210-sp25 organization on GitHub. Click on the repo with the prefix hw-04. It contains the starter documents you need to complete the lab.

  • Clone the repo and start a new project in RStudio. See the Lab 00 for details on cloning a repo and starting a new project in R.

Packages

The following packages will be used for this assignment.

library(tidyverse)
library(tidymodels)
library(knitr)

# add other packages as needed

Data: “Why Many Americans Don’t Vote”

The data from the 2020 article “Why Many Americans Don’t Vote” includes information from polling done by Ipsos for FiveThirtyEight. Respondents were asked a variety of questions about their political beliefs, thoughts on multiple issues, and voting behavior. We will focus on using the demographic variables and someone’s party identification to understand whether an eligible voter is a “frequent” voter.

The codebook for the variable definitions can be found in the GitHub repo for the data. The variables we’ll focus on are:

  • ppage: Age of respondent

  • educ: Highest educational attainment category.

  • race: Race of respondent, census categories. Note: all categories except Hispanic are non-Hispanic.

  • gender: Gender of respondent

  • income_cat: Household income category of respondent

  • Q30: Response to the question “Generally speaking, do you think of yourself as a…”

    • 1:Republican
    • 2: Democrat
    • 3: Independent
    • 4: Another party, please specify
    • 5: No preference
    • -1: No response
  • voter_category: past voting behavior:

    • always: respondent voted in all or all-but-one of the elections they were eligible in
    • sporadic: respondent voted in at least two, but fewer than all-but-one of the elections they were eligible in
    • rarely/never: respondent voted in 0 or 1 of the elections they were eligible in

The data is in the file fivethirtyeight-nonvoters-data.csv in the data folder of your GitHub repo.

Caution

Note that the authors use weighting to make the final sample more representative on the US population for their article. We will not use the weighting in this assignment, so we should treat the sample as a convenience sample rather than a random sample of the population.

Exercises

The goal of this analysis is use the polling data to examine the relationship between U.S. adults’ political party identification and voting behavior.

Exercise 1

Why do you think the authors chose to only include data from people who were eligible to vote for at least four election cycles?

Exercise 2

Let’s prepare the data for analysis and modeling.

  • Create a new variable called frequent_voter that takes the value 1 if the voter_category is “always” and 0 otherwise.
  • Make a table of the distribution of frequent_voter.
  • What percentage of the respondents in the data say they voted “in all or all-but-one of the elections they were eligible in”?

Exercise 3

The variable Q30 contains the respondent’s political party identification. Make a new variable, party_id, that simplifies Q30 into three categories: “Democrat”, “Republican”, “Independent/Neither”, The category “Independent/Neither” will also include respondents who did not answer the question. Make party_id a factor and relevel it so that it is consistent with the ordering of the responses in Question 30 of the survey.

  • Make a plot of the distribution of party_id.
  • Which category of party_id occurs most frequently in this data set?

Exercise 4

In the FiveThirtyEight article, the authors include visualizations of the relationship between the voter category and demographic variables such as race, age, education, etc.

  • Make a segmented barplot (also known as a stacked barplot) displaying the distribution of frequent_voter for each category of party_id. Make the plot such that the percentages (instead of counts) are displayed.

  • Use the plot to describe the relationship between these two variables.

Tip

See the plots of demographic information by voting history in the FiveThirtyEight article for examples of segmented bar plots.

Exercise 5

Let’s start by fitting a model using the demographic factors - ppage, educ, race, gender, income_cat - to predict the odds a person is a frequent voter.

  • Split the data into training (75%) and testing sets (25%). Use a seed of 29.

  • Fit the model on the training data. Display the model using 3 digits.

  • Consider the relationship between ppage and one’s voting behavior. Interpret the coefficient of ppage in the context of the data in terms of the odds a person is a frequent voter.

Exercise 6

Should party identification be added to the model? Use a drop-in-deviance test to determine if party identification should be added to the model fit in the previous exercise. Include the hypotheses in mathematical notation, the output from the test, and the conclusion in the context of the data.

Exercise 7

Display the model chosen from the previous exercise using 3 digits.

Then use the model selected to write a short paragraph (2 - 5 sentences) describing the effect (or lack of effect) of political party on the odds a person is a frequent voter. The paragraph should include an indication of which levels (if any) are statistically significant along with specifics about the differences in the odds between the political parties, as appropriate.

Exercise 8

In the article, the authors write

“Nonvoters were more likely to have lower incomes; to be young; to have lower levels of education; and to say they don’t belong to either political party, which are all traits that square with what we know about people less likely to engage with the political system.”

Consider the model you selected in Exercise 6. Is it consistent with this statement? Briefly explain why or why not.

Exercise 9

Use the testing data to produce the ROC curve and calculate the area under curve (AUC) for the model selected in Exercise 6. Write 1 - 2 sentences describing how well the model fits the data.

Exercise 10

You have been tasked by a local political organization to identify adults in the community who are frequent voters. These adults will receive targeted political mailings that will be different from the mailings sent to adults who are not frequent voters. You will use the model selected in Exercise 6 to identify the frequent voters.

Make a confusion matrix based on the cut-off probability of 0.25. Use the confusion matrix to calculate the following:

  • Sensitivity

  • Specificity

  • False negative rate

  • False positive rate

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

  • Access Gradescope through the STA 210 Canvas site.

  • Click on the assignment, and you’ll be prompted to submit it.

  • Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).

  • Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading

Component Points
Ex 1 2
Ex 2 3
Ex 3 4
Ex 4 4
Ex 5 6
Ex 6 8
Ex 7 5
Ex 8 4
Ex 9 5
Ex 10 6
Workflow & formatting 3

The “Workflow & formatting” grade is to assess the reproducible workflow and document format for the applied exercises. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date updated in the YAML.