Bootstrapping

STAT 20: Introduction to Probability and Statistics

While you’re waiting

If you’ve been given an index card, please write on it:

  1. Your first name
  2. Your year at Cal (1 is first year, 2 is second year, etc)
  3. Whether or not you are interested in majoring in a business- or econ-related field. 1 = yes, 0 = no

Concept Questions

Which of these is a valid bootstrap sample?

01:00




Original Sample
name species length
Gus Chinstrap 50.7
Luz Gentoo 48.5
Ida Chinstrap 52.8
Ola Gentoo 44.5
Abe Adelie 42.0
BS A
name species length
Ida Chinstrap 52.8
Luz Gentoo 48.5
Abe Adelie 42.0
Ola Gentoo 44.5
Ida Chinstrap 52.8
BS B
name species length
Ola Gentoo 44.5
Gus Chinstrap 50.7
Ida Chinstrap 52.8
Luz Gentoo 48.5
Gus Chinstrap 50.7
Gus Chinstrap 50.7
BS C
name species length
Gus Chinstrap 50.7
Ola Gentoo 48.5
Ola Chinstrap 52.8
Ida Gentoo 44.5
Ida Adelie 42.0
BS D
name species length
Gus Chinstrap 50.7
Abe Adelie 42.0
Gus Chinstrap 50.7
Gus Chinstrap 50.7
Gus Chinstrap 50.7

The Bootstrap

Parameters and Statistics


Our Goal: Assess the sampling variability in our estimate of the median year at Cal and the proportion of students in an econ-related field.


Our Tool: The Bootstrap

Collecting a sample of data

If you’ve been given an index card, please write on it:

  1. Your first name
  2. Your year at Cal (1 is first year, 2 is second year, etc)
  3. Whether or not you are interested in majoring in a business- or econ-related field. 1 = yes, 0 = no

boardwork

Bootstrapping with Infer

Example: Penguins

Let’s consider our 344 penguins to be a SRS from the broader population of Antarctic penguins. What is a point and interval estimate for the population proportion of penguins that are Adelie?


penguins <- penguins |>
  mutate(is_adelie = species == "Adelie")

penguins |>
  ggplot(aes(x = is_adelie)) +
  geom_bar()




Point estimate

obs_stat <- penguins |>
  summarize(p_adelie = mean(is_adelie))
obs_stat
# A tibble: 1 × 1
  p_adelie
     <dbl>
1    0.442

Generating one bootstrap sample

library(infer)
penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 1, 
           type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 FALSE    
 3         1 TRUE     
 4         1 FALSE    
 5         1 TRUE     
 6         1 TRUE     
 7         1 FALSE    
 8         1 TRUE     
 9         1 TRUE     
10         1 TRUE     
# ℹ 334 more rows

Two more bootstrap samples

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 1, 
           type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 TRUE     
 3         1 FALSE    
 4         1 FALSE    
 5         1 FALSE    
 6         1 TRUE     
 7         1 TRUE     
 8         1 FALSE    
 9         1 FALSE    
10         1 FALSE    
# ℹ 334 more rows
penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 1, 
           type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 TRUE     
 3         1 TRUE     
 4         1 FALSE    
 5         1 FALSE    
 6         1 TRUE     
 7         1 TRUE     
 8         1 FALSE    
 9         1 FALSE    
10         1 FALSE    
# ℹ 334 more rows

Visualizing 9 bs samples

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 9, 
           type = "bootstrap") |>
  ggplot(aes(x = is_adelie)) +
  geom_bar() +
  facet_wrap(vars(replicate),
             nrow = 3)

Calculating 9 \(\hat{p}\)

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 9, 
           type = "bootstrap") |>
  calculate(stat = "prop")
Response: is_adelie (factor)
# A tibble: 9 × 2
  replicate  stat
      <int> <dbl>
1         1 0.404
2         2 0.430
3         3 0.404
4         4 0.433
5         5 0.468
6         6 0.448
7         7 0.427
8         8 0.413
9         9 0.474

Note the change in data frame size.

The bootstrap dist (reps = 500)

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 500, 
           type = "bootstrap") |>
  calculate(stat = "prop") |>
  ggplot(aes(x = stat)) +
  geom_histogram()

Interval Estimate

We can extract the middle 95% by identifying the .025 quantile and the .975 quantile of the bootstrap distribution with get_ci().

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 500, 
           type = "bootstrap") |>
  calculate(stat = "prop") |>
  get_ci(level = .95)
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.392    0.494

Documentation: infer.tidymodels.org

Your Turn

Create a 95% confidence interval for the median bill length of penguins.

05:00