SLR: Permutation test for the slope

Author

Prof. Maria Tackett

Published

Jan 23, 2025

Announcements

HW 01 due Tuesday, January 28 at 11:59pm
Labs resume on Monday

Topics

Evaluate a claim about the slope using hypothesis testing
Construct a null distribution using simulation
Define mathematical models to conduct inference for slope

Computational setup

# load packages
library(tidyverse)   # for data wrangling and visualization
library(tidymodels)  # for modeling
library(usdata)      # for the county_2019 dataset
library(openintro)   # for Duke Forest dataset
library(scales)      # for pretty axis labels
library(glue)        # for constructing character strings
library(knitr)       # for neatly formatted tables
library(kableExtra)  # also for neatly formatted tablesf


# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 16))

Recap of last lecture

Data: Duke Forest houses

The regression model

df_fit <- lm(price ~ area, data = duke_forest)

tidy(df_fit) |>
  kable(digits = 2)

term	estimate	std.error	statistic	p.value
(Intercept)	116652.33	53302.46	2.19	0.03
area	159.48	18.17	8.78	0.00

. . .

Slope: For each additional square foot, we expect the sale price of Duke Forest houses to be higher by $159, on average.

Inference for simple linear regression

Calculate a confidence interval for the slope, $\beta_1$
Conduct a hypothesis test for the slope, $\beta_1$

Statistical inference

Image source: Eugene Morgan © Penn State

Sampling is natural

When you taste a spoonful of soup and decide the spoonful you tasted isn’t salty enough, that’s exploratory analysis
If you generalize and conclude that your entire soup needs salt, that’s an inference
For your inference to be valid, the spoonful you tasted (the sample) needs to be representative of the entire pot (the population)

Confidence interval via bootstrapping

Bootstrap new samples from the original sample
Fit models to each of the samples and estimate the slope
Use features of the distribution of the bootstrapped slopes to construct a confidence interval

Bootstrapping pipeline I

set.seed(210)

duke_forest |>
  specify(price ~ area)

Response: price (numeric)
Explanatory: area (numeric)
# A tibble: 98 × 2
     price  area
     <dbl> <dbl>
 1 1520000  6040
 2 1030000  4475
 3  420000  1745
 4  680000  2091
 5  428500  1772
 6  456000  1950
 7 1270000  3909
 8  557450  2841
 9  697500  3924
10  650000  2173
# ℹ 88 more rows

Bootstrapping pipeline II

set.seed(210)

duke_forest |>
  specify(price ~ area) |>
  generate(reps = 1000, type = "bootstrap")

Response: price (numeric)
Explanatory: area (numeric)
# A tibble: 98,000 × 3
# Groups:   replicate [1,000]
   replicate   price  area
       <int>   <dbl> <dbl>
 1         1  290000  2414
 2         1  285000  2108
 3         1  265000  1300
 4         1  416000  2949
 5         1  541000  2740
 6         1  525000  2256
 7         1 1270000  3909
 8         1  265000  1300
 9         1  815000  3904
10         1  535000  2937
# ℹ 97,990 more rows

Bootstrapping pipeline III

set.seed(210)

duke_forest |>
  specify(price ~ area) |>
  generate(reps = 1000, type = "bootstrap") |>
  fit()

# A tibble: 2,000 × 3
# Groups:   replicate [1,000]
   replicate term      estimate
       <int> <chr>        <dbl>
 1         1 intercept   80699.
 2         1 area          168.
 3         2 intercept  -18821.
 4         2 area          205.
 5         3 intercept  234297.
 6         3 area          117.
 7         4 intercept  134481.
 8         4 area          150.
 9         5 intercept   23861.
10         5 area          190.
# ℹ 1,990 more rows

Bootstrapping pipeline IV

set.seed(210)

boot_dist <- duke_forest |>
  specify(price ~ area) |>
  generate(reps = 1000, type = "bootstrap") |>
  fit()

Visualize the bootstrap distribution

boot_dist |>
  filter(term == "area") |>
  ggplot(aes(x = estimate)) +
  geom_histogram(binwidth = 10)

Compute the CI

But first…

obs_fit <- duke_forest |>
  specify(price ~ area) |>
  fit()

obs_fit

# A tibble: 2 × 2
  term      estimate
  <chr>        <dbl>
1 intercept  116652.
2 area          159.

Compute 95% confidence interval

boot_dist |>
  get_confidence_interval(
    point_estimate = obs_fit,
    level = 0.95,
    type = "percentile"
  )

# A tibble: 2 × 3
  term      lower_ci upper_ci
  <chr>        <dbl>    <dbl>
1 area          91.7     211.
2 intercept -18290.   287711.

Hypothesis test for the slope

Research question and hypotheses

“Do the data provide sufficient evidence that $\beta_1$ (the true slope for the population) is different from 0?”

. . .

Null hypothesis: there is no linear relationship between area and price

\[ H_0: \beta_1 = 0 \]

. . .

Alternative hypothesis: there is a linear relationship between area and price

\[ H_a: \beta_1 \ne 0 \]

Hypothesis testing as a US court trial

Null hypothesis, $H_0$: Defendant is innocent
Alternative hypothesis, $H_a$: Defendant is guilty
Present the evidence: Collect data
Judge the evidence: “Could these data plausibly have happened by chance if the null hypothesis were true?”
- Yes: Fail to reject $H_0$
- No: Reject $H_0$

Hypothesis testing framework

Start with a null hypothesis, $H_0$ that represents the status quo
Set an alternative hypothesis, $H_a$ that represents the research question, i.e. claim we’re testing
Conduct a hypothesis test under the assumption that the null hypothesis is true and calculate a p-value (probability of getting the observed or a more extreme outcome given that the null hypothesis is true)
- if the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, stick with the null hypothesis
- if they do, then reject the null hypothesis in favor of the alternative

Quantify the variability of the slope

for testing

Two approaches:
1. Via simulation
2. Via mathematical models
Use Permutation to quantify the variability of the slope for the purpose of testing, under the assumption that the null hypothesis is true:
- Simulate new samples from the original sample via permutation
- Fit models to each of the samples and estimate the slope
- Use features of the distribution of the permuted slopes to conduct a hypothesis test

Permutation, described

Use permuting to simulate data under the assumption the null hypothesis is true and measure the natural variability in the data due to sampling, not due to variables being correlated
- Permute one variable to eliminate any existing relationship between the variables
Each price value is randomly assigned to the area of a given house, i.e. area and price are no longer matched for a given house

# A tibble: 98 × 3
   price_Observed price_Permuted  area
            <dbl>          <dbl> <dbl>
 1        1520000         342500  6040
 2        1030000         750000  4475
 3         420000         645000  1745
 4         680000         697500  2091
 5         428500         428500  1772
 6         456000         481000  1950
 7        1270000         610000  3909
 8         557450         680000  2841
 9         697500         485000  3924
10         650000         105000  2173
# ℹ 88 more rows

Permutation, visualized

Each of the observed values for area (and for price) exist in both the observed data plot as well as the permuted price plot
The permutation removes the relationship between area and price

Permutation, repeated

Repeated permutations allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e., that the null hypothesis is true)

Concluding the hypothesis test

Is the observed slope of $\hat{\beta_1} = 159$ (or an even more extreme slope) a likely outcome under the null hypothesis that $\beta = 0$? What does this mean for our original question: “Do the data provide sufficient evidence that $\beta_1$ (the true slope for the population) is different from 0?”

Permutation pipeline I

set.seed(1125)

duke_forest |>
  specify(price ~ area)

Response: price (numeric)
Explanatory: area (numeric)
# A tibble: 98 × 2
     price  area
     <dbl> <dbl>
 1 1520000  6040
 2 1030000  4475
 3  420000  1745
 4  680000  2091
 5  428500  1772
 6  456000  1950
 7 1270000  3909
 8  557450  2841
 9  697500  3924
10  650000  2173
# ℹ 88 more rows

Permutation pipeline II

set.seed(1125)

duke_forest |>
  specify(price ~ area) |>
  hypothesize(null = "independence")

Response: price (numeric)
Explanatory: area (numeric)
Null Hypothesis: independence
# A tibble: 98 × 2
     price  area
     <dbl> <dbl>
 1 1520000  6040
 2 1030000  4475
 3  420000  1745
 4  680000  2091
 5  428500  1772
 6  456000  1950
 7 1270000  3909
 8  557450  2841
 9  697500  3924
10  650000  2173
# ℹ 88 more rows

Permutation pipeline III

set.seed(1125)

duke_forest |>
  specify(price ~ area) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute")

Response: price (numeric)
Explanatory: area (numeric)
Null Hypothesis: independence
# A tibble: 98,000 × 3
# Groups:   replicate [1,000]
     price  area replicate
     <dbl> <dbl>     <int>
 1  465000  6040         1
 2  481000  4475         1
 3 1020000  1745         1
 4  520000  2091         1
 5  592000  1772         1
 6  650000  1950         1
 7  473000  3909         1
 8  705000  2841         1
 9  785000  3924         1
10  671500  2173         1
# ℹ 97,990 more rows

Permutation pipeline IV

set.seed(1125)

duke_forest |>
  specify(price ~ area) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  fit()

# A tibble: 2,000 × 3
# Groups:   replicate [1,000]
   replicate term       estimate
       <int> <chr>         <dbl>
 1         1 intercept 553355.  
 2         1 area           2.35
 3         2 intercept 635824.  
 4         2 area         -27.3 
 5         3 intercept 536072.  
 6         3 area           8.57
 7         4 intercept 598649.  
 8         4 area         -13.9 
 9         5 intercept 556202.  
10         5 area           1.33
# ℹ 1,990 more rows

Permutation pipeline V

set.seed(1125)

null_dist <- duke_forest |>
  specify(price ~ area) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  fit()

Visualize the null distribution

null_dist |>
  filter(term == "area") |>
  ggplot(aes(x = estimate)) +
  geom_histogram(binwidth = 10, color = "white")

Reason around the p-value

In a world where the there is no relationship between the area of a Duke Forest house and in its price ($\beta_1 = 0$), what is the probability that we observe a sample of 98 houses where the slope fo the model predicting price from area is 159 or even more extreme?

Compute the p-value

What does this warning mean?

get_p_value(
  null_dist,
  obs_stat = obs_fit,
  direction = "two-sided"
)

Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.

# A tibble: 2 × 2
  term      p_value
  <chr>       <dbl>
1 area            0
2 intercept       0

Application exercise

📋 https://sta210-sp25.netlify.app/ae/ae-05-sim-testing.html

Mathematical models for inference

The regression model, revisited

df_fit <- lm(price ~ area, data = duke_forest)

tidy(df_fit) |>
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	116652.325	53302.463	2.188	0.031
area	159.483	18.171	8.777	0.000

Inference, revisited

Earlier we computed a confidence interval and conducted a hypothesis test via simulation:
- CI: Bootstrap the observed sample to simulate the distribution of the slope
- HT: Permute the observed sample to simulate the distribution of the slope under the assumption that the null hypothesis is true
Now we’ll do these based on theoretical results, i.e., by using the Central Limit Theorem to define the distribution of the slope and use features (shape, center, spread) of this distribution to compute bounds of the confidence interval and the p-value for the hypothesis test

Mathematical representation of the model

\[ \begin{aligned} Y &= \text{Model} + \text{Error} \\ &= f(X) + \epsilon \\ &= \mu_{Y|X} + \epsilon \\ &= \beta_0 + \beta_1 X + \epsilon \end{aligned} \]

where the errors are independent and normally distributed (now we are making an assumption about the distribution of the error terms)

. . .

independent: Knowing the error term for one observation doesn’t tell you anything about the error term for another observation
normally distributed: $\epsilon \sim N(0, \sigma_\epsilon^2)$

Simple linear regression model fully specified

\[ \mathbf{Y} = \beta_0 + \beta_1X + \epsilon, \quad \epsilon \sim N(0, \sigma^2_{\epsilon}) \]

$\beta_0$: the population intercept
$\beta_1$: the population slope
$\epsilon$ : independent and identically distributed (i.i.d.) error terms

Mathematical representation, visualized

\[ Y|X \sim N(\beta_0 + \beta_1 X, \sigma_\epsilon^2) \]

Image source: *Introduction to the Practice of Statistics (5th ed)*

Mean: $\beta_0 + \beta_1 X$, the expected value of $Y$ based on the regression model
Variance: $\sigma_\epsilon^2$, constant across the range of $X$
- How do we estimate $\sigma_\epsilon^2$?

Regression standard error

Once we fit the model, we can use the residuals to estimate the regression standard error (the spread of the distribution of the response, for a given value of the predictor variable):

\[ \hat{\sigma}_\epsilon = \sqrt{\frac{\sum_\limits{i=1}^n(y_i - \hat{y}_i)^2}{n-2}} = \sqrt{\frac{\sum_\limits{i=1}^ne_i^2}{n-2}} \]

. . .

Why divide by $n - 2$?
Why do we care about the value of the regression standard error?

Standard error of $\hat{\beta}_1$

\[ SE_{\hat{\beta}_1} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{(n-1)s_X^2}} \]

. . .

or…

term	estimate	std.error	statistic	p.value
(Intercept)	116652.33	53302.46	2.19	0.03
area	159.48	18.17	8.78	0.00

Announcements

Topics

Computational setup

Recap of last lecture

Data: Duke Forest houses

The regression model

Inference for simple linear regression

Statistical inference

Sampling is natural

Confidence interval via bootstrapping

Bootstrapping pipeline I

Bootstrapping pipeline II

Bootstrapping pipeline III

Bootstrapping pipeline IV

Visualize the bootstrap distribution

Compute the CI

But first…

Compute 95% confidence interval

Hypothesis test for the slope

Research question and hypotheses

Hypothesis testing as a US court trial

Hypothesis testing framework

Quantify the variability of the slope

Permutation, described

Permutation, visualized

Permutation, repeated

Concluding the hypothesis test

Permutation pipeline I

Permutation pipeline II

Permutation pipeline III

Permutation pipeline IV

Permutation pipeline V

Visualize the null distribution

Reason around the p-value

Compute the p-value

Application exercise

Mathematical models for inference

The regression model, revisited

Inference, revisited

Mathematical representation of the model

Simple linear regression model fully specified

Mathematical representation, visualized

Regression standard error

Standard error of \(\hat{\beta}_1\)