Taxonomy of Data

STAT 20: Introduction to Probability and Statistics

Concept Questions

There’s no escape from the bird…

Images as data

  • Images are composed of pixels (this image is 1520 by 1012)

  • The color in each pixel is in RGB

  • Each band takes a value from 0-255

  • This image is data with 1520 x 1012 x 3 values.

A shoebill with a duck in its mouth.

Grayscale

  • Grayscale images have only one band
  • 0 is black, 255 is white
  • This image is data with 1520 x 1012 x 1 values.

A shoebill with a duck in its mouth in grayscale.

Grayscale

  • To simplify, assume our photos are 8 x 8 grayscale images.

An 8 x 8 grayscale image

Images in a Data Frame

Consider the following images which are our data:

  • Let’s simplify them to 8 x 8 grayscale images

Images in a Data Frame

If you were to put the data from these (8 x 8 grayscale) images into a data frame, what would the dimensions of that data frame be in rows x columns? Answer at pollev.com.

01:00

Concept Question

A note on variables

There are three things that “variable” could be referring to:

  1. a phenomenon
  2. how the phenomenon is being recorded or measured into data
    • what values can it take? (this is often an intent- or value-laden exercise!)
    • for numerical units, what unit should we express it in?
  3. How the recorded data is being analyzed
    • might you bin/discretizing income data? what are the consequences of this?
  • For the following question, you may work under the second definition.

What type of variable is age?

For each of the following scenarios where age could be a variable, choose the most appropriate taxonomy according to the Taxonomy of Data.

  1. Ages of television audiences/demographics
  2. Ages of UC Berkeley students
  3. The age of a rock

Answer at pollev.com.

01:00

Problem Set 1: Taxonomy of Data

20:00

Break

05:00

R Workshop

  • Time to make a series of educated guesses. Close your laptops!

Educated Guess 1

What will happen here?


Answer at pollev.com/<name>


1 + "one"
01:00

Educated Guess 2

What will happen here?


Answer at pollev.com/<name>


a <- c(1, 2, 3, 4)
sqrt(log(a))
01:00

Educated Guess 3

What will happen here?


Answer at pollev.com/<name>


a <- 1 + 2
a + 1
01:00

Educated Guess 4

What will happen here?


Answer at pollev.com/<name>


a <- c(1, 3.14, "seven")
class(a)
01:00

Functions on vectors

A vector is the simplest structure used in R to store data. It can be created using the function c().

my_vector <- c(1, 3, 4)
my_vector
[1] 1 3 4

A function operates on an R object and produces output. R has many of the mathematical functions that you would expect.

sum(my_vector)
[1] 8

Your Turn

  1. Create a vector named vec with the even integers between 1 and 10 as well as the number 99 (six elements total).

  2. Find the sum of that vector.

  3. Find the max of that vector.

  4. Take the mean of that vector and round it to the nearest integer.

These should all be solved with R code. If you don’t know the name of a function to use, with hazard a guess by looking for a help file (e.g. ?sum) or google it.

05:00

Working in a qmd file

Working in a new .qmd file allows you to save your code for later.

Demo

  1. Create a new qmd file from the RStudio menu, name it, and save it.
  2. Insert a new code cell.
  3. Write your code into the cell.
  4. Render the document.

Building a data frame

You can combine vectors into a data frame using data.frame()1

bill_depth_mm <- c(15.0, 17.1, 18.7, 18.9)
bill_length_mm <- c(47.5, 40.2, 39.0, 35.3)
species <- c("Gentoo", "Adelie", "Adelie", "Adelie")


penguins_df <- data.frame(bill_depth_mm, bill_length_mm, species)
penguins_df
  bill_depth_mm bill_length_mm species
1          15.0           47.5  Gentoo
2          17.1           40.2  Adelie
3          18.7           39.0  Adelie
4          18.9           35.3  Adelie

Your Turn

  1. Create a new .qmd file, name it, and save it.

  2. Insert a new code cell.

  3. Create three vectors, name, hometown, and sibs_and_pets that contain observations on those variables from 6 people in this class.

  4. Combine them into a data frame called my_classmates.

06:00