Draft

Baseball

The Data

The source of data for this lab is the public Lahman database. This database contains a number of data sets with different units of observation! Below are the first few rows and some of the columns for two of these data sets: Teams and Batting. They contain data going back to 1871! Use these excerpts to help you answer the following questions.

Teams data frame:

yearID teamID franchID G W L R RA name
2000 MON WSN 162 67 95 738 902 Montreal Expos
1981 PIT PIT 103 46 56 407 425 Pittsburgh Pirates
1945 WS1 MIN 156 87 67 622 562 Washington Senators
1922 SLN STL 154 85 69 863 819 St. Louis Cardinals
1904 CHA CHW 156 89 65 600 482 Chicago White Sox
1896 BLN BLO 132 90 39 995 662 Baltimore Orioles

Batting data frame:

playerID yearID teamID G AB R H BB SO
tekotbl01 2012 SDN 11 15 0 2 0 4
liriafr01 2010 MIN 31 2 0 0 0 1
streehu01 2007 OAK 48 0 0 0 0 0
eldreca01 2003 SLN 63 2 0 1 0 1
konerpa01 1998 CIN 26 73 7 16 6 10
hollica01 1921 DET 35 48 4 13 3 4

Questions

Question 1

part a

What is the unit of observation for the Teams data set?

part b

What about for the Batting data set?

Question 2

part a

Write out a question about baseball that could answered purely through the information in the Teams data set (numerical summaries or plots, etc.)

part b

Do the same thing, but for the Batting data set.

Question 3

For each subpart below, you will form a predictive questions that you could answer using the data frames above. Identify a response variable for each.

part a - classification question, Batting data frame

part b - regression question, Batting data frame

part c - classification question, Teams data frame

part d - regression question, Teams data frame


For the remainder of the lab, we’ll be working with the Teams data frame. It can be accessed through the Lahman library.

Question 4

Subset the Teams data further set to only include years from 2000 to present day (this is the data set that you’ll use for the remainder of this lab. However, there might be another year post-2000 that you might want to filter out: which one and why?).

Question 5

part a

Plot the relationship between runs and wins using ggplot2 code. Place runs on the x-axis and wins on the y-axis.

part b

Describe the relationship between the two variables, specifically commenting on the form, direction, and the strength of association.

Question 6

part a

Fit a simple linear model to predict wins by runs and save it into m1.

part b

Write out the equation for the linear model (using the estimated coefficients) in mathematical form.

part c

Calculate the \(R^2\) value of the linear model and interpret it in the context of the problem in a sentence.

Question 7

part a

Fit a multiple linear regression model to predict wins using runs *and* runs allowed (RA) and save it as m2.

part b

Write out the equation for the linear model in mathematical form and calculate its \(R^2\) value.

part c

How does this model compare to the simple linear regression from the previous question in terms of predictive power?

Question 8

part a

Fit a third, more complex model to predict wins and call it m3. This model should use:

  • at least three predictor variables
  • at least one non-linear transformation or polynomial term.

part b

Write out the equation for the linear model in mathematical form and calculate its \(R^2\) value.

Question 9

Revisit the definition of causation. If your predictive model has a positive coefficient between one of the predictors and the response, is that evidence that if you increase that predictor variable for a given observation, the response variable will increase? That is, can you (or a sports management team) use this model to draw causal conclusions? Why or why not? Answer in at least three sentences. There’s more than one possible answer, so make sure to justify your reasoning.