library(tidyverse)
library(tidymodels)
library(knitr)
library(pROC)AE 12: Exam 02 Review
Go to the course GitHub organization and locate your ae-12 repo to get started.
Render, commit, and push your responses to GitHub by the end of class to submit your AE.
Packages
Exercise 1
Suppose you fit a simple linear regression model.
Draw or describe a scatterplot that contains an observation with large leverage but low Cook’s distance.
Draw or describe a scatterplot that contains an observation with large leverage and high Cook’s distance.
Draw or describe a scatterplot that contains an observation with a large studentized residual.
Data: Credit cards
The data for this analysis is about credit card customers. It can be found in the file credit.csv. The following variables are in the data set:
income: Income in $1,000’slimit: Credit limitrating: Credit ratingcards: Number of credit cardsage: Age in yearseducation: Number of years of educationown: A factor with levelsNoandYesindicating whether the individual owns their homestudent: A factor with levelsNoandYesindicating whether the individual was a studentmarried: A factor with levelsNoandYesindicating whether the individual was marriedregion: A factor with levelsSouth,East, andWestindicating the region of the US the individual is frombalance: Average credit card balance in $.
The objective of this analysis is to predict whether a person has maxed out their credit card, i.e., had $0 average card balance.
credit <- read_csv("data/credit.csv") |>
mutate(maxed = factor(if_else(balance == 0, 1, 0)))Exercise 2
Why is logistic regression the best modeling approach for this analysis?
Describe where each of the following show up in the analysis:
log-odds …
odds
probabilities
Exercise 3
We’ll start by splitting the model into training and testing data. Then we’ll using the training data to fit a model for predicting the odds of maxed = 1 using income, rating, and region.
# make training and test sets
set.seed(210)
credit_split <- initial_split(credit, prop = 0.8)
credit_train <- training(credit_split)
credit_test <- testing(credit_split)
credit_fit <- glm(maxed ~ income + rating + region, data = credit_train,
family = "binomial")
tidy(credit_fit) |>
kable(digits = 3)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 10.097 | 1.687 | 5.987 | 0.000 |
| income | 0.115 | 0.026 | 4.444 | 0.000 |
| rating | -0.058 | 0.009 | -6.428 | 0.000 |
| regionSouth | -0.632 | 0.715 | -0.884 | 0.377 |
| regionWest | -0.332 | 0.724 | -0.458 | 0.647 |
The logistic regression model takes the following form:
\[ \log(\frac{\pi_i}{1 - \pi_i}) = \beta_0 + \beta_1 ~ income + \beta_2 ~ rating + \beta_3 ~ regionSouth + \beta_4 ~ regionWest \]
Write the interpretation of
incomein terms of the odds of maxing out a credit card.Use the equation above to show the expected change in the odds of maxing out a credit card when the credit rating increases by 10 points. Assume income and region are constant. Write your answer in terms of \(\beta_0, \beta_1, \beta_2, \beta_3, \beta_4\).
Suppose there are two individuals. Individual 1 has an income of $64,000, a credit rating of 590, and is from the South region. Individual 2 has an income of $135,000, a credit rating of 695, and is from the East region. Use the equation above to show how the odds of maxing out a credit card differ between Individual 1 and Individual 2. Write your answer in terms of \(\beta_0, \beta_1, \beta_2\), etc.
Exercise 4
We consider adding the interaction between region and income to the current model. We’ll use a drop-in-deviance test to determine whether or not to add the interaction term.
- State the null and alternative hypotheses in words and using mathematical notation.
- Describe what the test statistic \(G\) means in the context of the data.
- Show why the degrees of freedom for the test statistic are equal to 2.
- Conduct the drop-in-deviance test and state your conclusion in the context of the data.
# add code hereExercise 5
Now let’s evaluate the performance of the selected model using the testing data.
Create a confusion matrix using a cutoff probability of 0.3.
# add code hereWhat is the sensitivity? What does it mean in the context of the data ?
What is the specificity? What does it mean in the context of the data?
What is the false positive rate? What does it mean in the context of the data?
What is the false negative rate? What does it mean in the context of the data?
Exercise 6
Produce the ROC curve.
# add code here- Describe how you can use this curve to select a cutoff probability (rather than just going with 0.5).
Exercise 7
Questions about checking conditions for logistic regression:
Do we assess conditions on the training or testing set?
Why do we not consider categorical predictors when checking linearity?
Why do we not need to check constant variance for logistic regression?
Submission
To submit the AE:
Render the document to produce the PDF with all of your work from today’s class.
Push all your work to your AE repo on GitHub. You’re done! 🎉