Bootstrapping

STAT 20: Introduction to Probability and Statistics

While you’re waiting

If you’ve been given an index card, please write on it:

Your first name
Your year at Cal (1 is first year, 2 is second year, etc)
Whether or not you are interested in majoring in a business- or econ-related field. 1 = yes, 0 = no

Concept Questions

Which of these is a valid bootstrap sample?

01:00

Original Sample
name	species	length
Gus	Chinstrap	50.7
Luz	Gentoo	48.5
Ida	Chinstrap	52.8
Ola	Gentoo	44.5
Abe	Adelie	42.0

BS A
name	species	length
Ida	Chinstrap	52.8
Luz	Gentoo	48.5
Abe	Adelie	42.0
Ola	Gentoo	44.5
Ida	Chinstrap	52.8

BS B
name	species	length
Ola	Gentoo	44.5
Gus	Chinstrap	50.7
Ida	Chinstrap	52.8
Luz	Gentoo	48.5
Gus	Chinstrap	50.7
Gus	Chinstrap	50.7

BS C
name	species	length
Gus	Chinstrap	50.7
Ola	Gentoo	48.5
Ola	Chinstrap	52.8
Ida	Gentoo	44.5
Ida	Adelie	42.0

BS D
name	species	length
Gus	Chinstrap	50.7
Abe	Adelie	42.0
Gus	Chinstrap	50.7
Gus	Chinstrap	50.7
Gus	Chinstrap	50.7

The Bootstrap

Parameters and Statistics

Our Goal: Assess the sampling variability in our estimate of the median year at Cal and the proportion of students in an econ-related field.

Our Tool: The Bootstrap

Collecting a sample of data

If you’ve been given an index card, please write on it:

Your first name
Your year at Cal (1 is first year, 2 is second year, etc)
Whether or not you are interested in majoring in a business- or econ-related field. 1 = yes, 0 = no

boardwork

Collect index cards from students and record the data into a data frame on the board labelled “Observed sample”. Calculate the sample median and sample proportion of econ-related majors.

Ask for a volunteer to generate the first bootstrap sample. Hand them the stack of cards and have them randomly choose a single card and read off the data to you. As they do so, write out the first row of a “Bootstrap Sample 1” data frame on the board. Be sure to label the row with the student name - that helps emphasis when there are repeats.Have them return the card to the deck, shuffle, and randomly choose a card and read off the data. Repeat until you have filled out the same number of rows as in the original data set. Calculate the median and proportion (you may want to write dplyr code to do this using summarize()).

Ask for a second volunteer to generate the second bootstrap sample. Repeat the procedure as before, drawing a third data frame on the board and computing a second set of statistics (median and proportion).

Collect the bootstrapped medians and proportions and sketch them as the first few points in a broader density plot that we’ll be able to see when we take more and more bootstrap samples. Label this as the “bootstrap distribution” and speak of it as an approximation to the true sampling distribution. You can explain the 1 - alpha bootstrap interval as the interval that captures the middle 95% of bootstrapped statistics.

Bootstrapping with Infer

Example: Penguins

Let’s consider our 344 penguins to be a SRS from the broader population of Antarctic penguins. What is a point and interval estimate for the population proportion of penguins that are Adelie?

penguins <- penguins |>
  mutate(is_adelie = species == "Adelie")

penguins |>
  ggplot(aes(x = is_adelie)) +
  geom_bar()

Point estimate

obs_stat <- penguins |>
  summarize(p_adelie = mean(is_adelie))
obs_stat

# A tibble: 1 × 1
  p_adelie
     <dbl>
1    0.442

Generating one bootstrap sample

library(infer)
penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 1, 
           type = "bootstrap")

Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 FALSE    
 3         1 TRUE     
 4         1 FALSE    
 5         1 TRUE     
 6         1 TRUE     
 7         1 FALSE    
 8         1 TRUE     
 9         1 TRUE     
10         1 TRUE     
# ℹ 334 more rows

Two more bootstrap samples

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 1, 
           type = "bootstrap")

Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 TRUE     
 3         1 FALSE    
 4         1 FALSE    
 5         1 FALSE    
 6         1 TRUE     
 7         1 TRUE     
 8         1 FALSE    
 9         1 FALSE    
10         1 FALSE    
# ℹ 334 more rows

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 1, 
           type = "bootstrap")

Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 TRUE     
 3         1 TRUE     
 4         1 FALSE    
 5         1 FALSE    
 6         1 TRUE     
 7         1 TRUE     
 8         1 FALSE    
 9         1 FALSE    
10         1 FALSE    
# ℹ 334 more rows

Visualizing 9 bs samples

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 9, 
           type = "bootstrap") |>
  ggplot(aes(x = is_adelie)) +
  geom_bar() +
  facet_wrap(vars(replicate),
             nrow = 3)

Calculating 9 \(\hat{p}\)

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 9, 
           type = "bootstrap") |>
  calculate(stat = "prop")

Response: is_adelie (factor)
# A tibble: 9 × 2
  replicate  stat
      <int> <dbl>
1         1 0.404
2         2 0.430
3         3 0.404
4         4 0.433
5         5 0.468
6         6 0.448
7         7 0.427
8         8 0.413
9         9 0.474

Note the change in data frame size.

The bootstrap dist (reps = 500)

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 500, 
           type = "bootstrap") |>
  calculate(stat = "prop") |>
  ggplot(aes(x = stat)) +
  geom_histogram()

Interval Estimate

We can extract the middle 95% by identifying the .025 quantile and the .975 quantile of the bootstrap distribution with get_ci().

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 500, 
           type = "bootstrap") |>
  calculate(stat = "prop") |>
  get_ci(level = .95)

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.392    0.494

Documentation: `infer.tidymodels.org`

Your Turn

Create a 95% confidence interval for the median bill length of penguins.

05:00