| yearID | teamID | franchID | G | W | L | R | RA | name |
|---|---|---|---|---|---|---|---|---|
| 2000 | MON | WSN | 162 | 67 | 95 | 738 | 902 | Montreal Expos |
| 1981 | PIT | PIT | 103 | 46 | 56 | 407 | 425 | Pittsburgh Pirates |
| 1945 | WS1 | MIN | 156 | 87 | 67 | 622 | 562 | Washington Senators |
| 1922 | SLN | STL | 154 | 85 | 69 | 863 | 819 | St. Louis Cardinals |
| 1904 | CHA | CHW | 156 | 89 | 65 | 600 | 482 | Chicago White Sox |
| 1896 | BLN | BLO | 132 | 90 | 39 | 995 | 662 | Baltimore Orioles |
Baseball
The Data
The source of data for this lab is the public Lahman database. This database contains a number of data sets with different units of observation! Below are the first few rows and some of the columns for two of these data sets: Teams and Batting. They contain data going back to 1871! Use these excerpts to help you answer the following questions.
Teams data frame:
Batting data frame:
| playerID | yearID | teamID | G | AB | R | H | BB | SO |
|---|---|---|---|---|---|---|---|---|
| tekotbl01 | 2012 | SDN | 11 | 15 | 0 | 2 | 0 | 4 |
| liriafr01 | 2010 | MIN | 31 | 2 | 0 | 0 | 0 | 1 |
| streehu01 | 2007 | OAK | 48 | 0 | 0 | 0 | 0 | 0 |
| eldreca01 | 2003 | SLN | 63 | 2 | 0 | 1 | 0 | 1 |
| konerpa01 | 1998 | CIN | 26 | 73 | 7 | 16 | 6 | 10 |
| hollica01 | 1921 | DET | 35 | 48 | 4 | 13 | 3 | 4 |
Questions
Question 1
part a
What is the unit of observation for the Teams data set?
part b
What about for the Batting data set?
Question 2
part a
Write out a question about baseball that could answered purely through the information in the Teams data set (numerical summaries or plots, etc.)
part b
Do the same thing, but for the Batting data set.
Question 3
For each subpart below, you will form a predictive questions that you could answer using the data frames above. Identify a response variable for each.
part a - classification question, Batting data frame
part b - regression question, Batting data frame
part c - classification question, Teams data frame
part d - regression question, Teams data frame
For the remainder of the lab, we’ll be working with the Teams data frame. It can be accessed through the Lahman library.
Question 4
Subset the Teams data further set to only include years from 2000 to present day (this is the data set that you’ll use for the remainder of this lab. However, there might be another year post-2000 that you might want to filter out: which one and why?).
Question 5
part a
Plot the relationship between runs and wins using ggplot2 code. Place runs on the x-axis and wins on the y-axis.
part b
Describe the relationship between the two variables, specifically commenting on the form, direction, and the strength of association.
Question 6
part a
Fit a simple linear model to predict wins by runs and save it into m1.
part b
Write out the equation for the linear model (using the estimated coefficients) in mathematical form.
part c
Calculate the \(R^2\) value of the linear model and interpret it in the context of the problem in a sentence.
Question 7
part a
Fit a multiple linear regression model to predict wins using runs *and* runs allowed (RA) and save it as m2.
part b
Write out the equation for the linear model in mathematical form and calculate its \(R^2\) value.
part c
How does this model compare to the simple linear regression from the previous question in terms of predictive power?
Question 8
part a
Fit a third, more complex model to predict wins and call it m3. This model should use:
- at least three predictor variables
- at least one non-linear transformation or polynomial term.
part b
Write out the equation for the linear model in mathematical form and calculate its \(R^2\) value.
Question 9
Revisit the definition of causation. If your predictive model has a positive coefficient between one of the predictors and the response, is that evidence that if you increase that predictor variable for a given observation, the response variable will increase? That is, can you (or a sports management team) use this model to draw causal conclusions? Why or why not? Answer in at least three sentences. There’s more than one possible answer, so make sure to justify your reasoning.