Conditioning

STAT 20: Introduction to Probability and Statistics

Concept Questions

c("fruit", "fruit", "vegetable") == "fruit"

What will this line of code return?

01:00

Evaluating equivalence, cont.

In R, this evaluation happens element-wise when operating on vectors.

c("fruit", "fruit", "vegetable") == "fruit"
[1]  TRUE  TRUE FALSE
c("fruit", "fruit", "vegetable") != "fruit"
[1] FALSE FALSE  TRUE
c("fruit", "vegetable", "boba") %in% c("fruit", "vegetable")
[1]  TRUE  TRUE FALSE

Question 2

class_survey |>
  filter(coding_exp_scale < 3,
         olympics %in% c("Ice skating", "Speed skating"),
         is_entrepreneur == TRUE)

Which observations will be included in the following data frame?

01:00

Question 3

Which data frame will have fewer rows?

# A
filter(class_survey, year == "This is my first semester!")

# B
class_survey |>
  mutate(first_sem = (year == "This is my first semester!")) |>
  filter(first_sem)
01:00

Building data pipelines

Consider the subset of students here:

class_survey |>
  filter(coding_exp_scale < 3,
         olympics %in% c("Ice skating", "Speed skating"),
         is_entrepreneur == TRUE)

How do we extract the average of these students’ chance that class will be disrupted by a new COVID variant?

Let’s look at three different ways to answer this question

Nesting

filter(class_survey, 
       coding_exp_scale < 3,
       olympics %in% c("Ice skating", "Speed skating"),
       is_entrepreneur == TRUE)

Nesting

select(filter(class_survey, 
       coding_exp_scale < 3,
       olympics %in% c("Ice skating", "Speed skating"),
       is_entrepreneur == TRUE),
       coding_exp_xcale,
       olympics,
       is_entrepreneur,
       covid)

Nesting

summarize(select(filter(class_survey, 
       coding_exp_scale < 3,
       olympics %in% c("Ice skating", "Speed skating"),
       is_entrepreneur == TRUE),
       coding_exp_scale,
       olympics,
       is_entrepreneur,
       covid),
       covid_avg = mean(covid))

Nesting

summarize(select(filter(class_survey, 
       coding_exp_scale < 3,
       olympics %in% c("Ice skating", "Speed skating"),
       is_entrepreneur == TRUE),
       coding_exp_scale,
       olympics,
       is_entrepreneur,
       covid),
       covid_avg = mean(covid))
# A tibble: 1 × 1
  covid_avg
      <dbl>
1     0.428

Nesting

summarize(select(filter(class_survey, 
       coding_exp_scale < 3,
       olympics %in% c("Ice skating", "Speed skating"),
       is_entrepreneur == TRUE),
       coding_exp_scale,
       olympics,
       is_entrepreneur,
       covid),
       covid_avg = mean(covid))

Cons

  • Must be read from inside out 👎
  • Hard to keep track of arguments 👎

Pros

  • All in one line of code 👍
  • Only refer to one data frame 👍

Step-by-step

df1 <- filter(class_survey, 
              coding_exp_scale < 3,
              olympics %in% c("Ice skating", "Speed skating"),
              is_entrepreneur == TRUE)
df2 <- select(df1, 
              coding_exp_scale,
              olympics,
              is_entrepreneur,
              covid)
summarize(df2,
          covid_avg = mean(covid))

Cons

  • Have to repeat data frame names 👎
  • Creates unnecessary objects 👎

Pros

  • Stores intermediate objects 👍
  • Can be read top to bottom 👍

Using the pipe operator

class_survey |>
  filter(coding_exp_scale < 3,
         olympics %in% c("Ice skating", "Speed skating"),
         is_entrepreneur == TRUE) |>
  select(coding_exp_scale,
         olympics,
         is_entrepreneur,
         covid) |>
  summarize(covid_avg = mean(covid))

Cons

  • 🤷

Pros

  • Can be read like an english paragraph 👍
  • Only type the data once 👍
  • No leftovers objects 👍

Understanding your pipeline

It’s good practice to understand the output of each line of code by breaking the pipe.

class_survey |>
  select(covid) |>
  filter(year == "It's my first year.")
Error in `filter()`:
ℹ In argument: `year == "It's my first year."`.
Caused by error in `year == "It's my first year."`:
! comparison (==) is possible only for atomic and list types
class_survey |>
  select(covid)
# A tibble: 816 × 1
   covid
   <dbl>
 1  0   
 2  0.5 
 3  0.6 
 4  0.7 
 5 NA   
 6  0.15
 7  0.7 
 8  0   
 9  0.8 
10 NA   
# ℹ 806 more rows

Concept Question

class_survey |> # A #<<
  filter(coding_exp_scale < 3,
         olympics %in% c("Ice skating", 
                         "Speed skating"),
         is_entrepreneur == TRUE) |> # B #<<
  select(coding_exp_scale,
         olympics,
         is_entrepreneur,
         covid) |> # C #<<
  summarize(covid_avg = mean(covid)) # D #<<
# note
dim(class_survey)
[1] 816  30

What are the dimensions (rows x columns) of the data frames output at each stage of this pipe?

01:00

summarize(class_survey, mean(year == "I'm in my first year."))

What is will this line of code return?

20:00

Boolean Algebra

Logical vectors have a dual representation as TRUE FALSE and 1, 0, so you can do math on logicals accordingly.

TRUE + TRUE
[1] 2
TRUE * TRUE
[1] 1

Taking the mean of a logical vector is equivalent to find the proportion of rows that are TRUE (i.e. the proportion of rows that meet the condition).

Worksheet: Conditioning

20:00

Break

05:00

Lab Part I: Flights

25:00