Baseball

The Data

The source of data for this lab is the public Lahman database. This database contains a number of data sets with different units of observation! Below are the first few rows and some of the columns for two of these data sets: Teams and Batting. They contain data going back to 1871! Use these excerpts to help you answer the following questions.

Get the data

library(Lahman)

You can now get information on the datasets with the help functions - ?Teams and ?Batting:

For Teams, some key variables are:

G: number of Games played that season
W: number of Wins that season
L: number of Losses that season
R: number of Runs (points, basically) the team scored that season, across all games.
RA: number of Runs Allowed (points scored by the opponents) across all games that season.

For Batting, some key variables are:

G: number of Games played that season by the player. Not every player plays in every game in a season, so this is the number of games that this person actually played in.
AB: number of At Bats (number of times the player went up to the plate and tried to hit the ball)
H: number of Hits (number of times the player successfully got on base by hitting the ball)
SO: number of Strike Outs (number of times the player went up to the plate and missed the ball too many times)
BB: number of Base on Balls (number of times the player went up to the plate and got a free pass to first base because the pitcher threw too many bad pitches)
R: number of Runs (points) directly scored by the player that season. This means getting all the way around the bases.

`Teams` data frame:

yearID	teamID	franchID	G	W	L	R	RA	name
2000	MON	WSN	162	67	95	738	902	Montreal Expos
1981	PIT	PIT	103	46	56	407	425	Pittsburgh Pirates
1945	WS1	MIN	156	87	67	622	562	Washington Senators
1922	SLN	STL	154	85	69	863	819	St. Louis Cardinals
1904	CHA	CHW	156	89	65	600	482	Chicago White Sox
1896	BLN	BLO	132	90	39	995	662	Baltimore Orioles

`Batting` data frame:

playerID	yearID	teamID	G	AB	R	H	BB	SO
tekotbl01	2012	SDN	11	15	0	2	0	4
liriafr01	2010	MIN	31	2	0	0	0	1
streehu01	2007	OAK	48	0	0	0	0	0
eldreca01	2003	SLN	63	2	0	1	0	1
konerpa01	1998	CIN	26	73	7	16	6	10
hollica01	1921	DET	35	48	4	13	3	4

Questions

Question 1

part a

What is the unit of observation for the Teams data set?

part b

What about for the Batting data set?

Question 2

part a

Write out a question about baseball that could answered purely through the information in the Teams data set (numerical summaries or plots, etc.)

part b

Do the same thing, but for the Batting data set.

Question 3

For each subpart below, you will form a predictive questions that you could answer using the data frames above. Identify a response variable for each.

part a - classification question, `Batting` data frame

part b - regression question, `Batting` data frame

part c - classification question, `Teams` data frame

part d - regression question, `Teams` data frame

For the remainder of the lab, we’ll be working with the Teams data frame. It can be accessed through the Lahman library.

Question 4

Subset the Teams data further set to only include years from 2000 to present day (this is the data set that you’ll use for the remainder of this lab. However, there might be another year post-2000 that you might want to filter out: which one and why?).

Question 5

part a

Plot the relationship between runs and wins using ggplot2 code. Place runs on the x-axis and wins on the y-axis.

part b

Describe the relationship between the two variables, specifically commenting on the form, direction, and the strength of association.

Question 6

part a

Fit a simple linear model to predict wins by runs and save it into m1.

part b

Write out the equation for the linear model (using the estimated coefficients) in mathematical form.

part c

Calculate the \(R^2\) value of the linear model and interpret it in the context of the problem in a sentence.

Question 7

part a

Fit a multiple linear regression model to predict wins using runs *and* runs allowed (RA) and save it as m2.

part b

Write out the equation for the linear model in mathematical form and calculate its \(R^2\) value.

part c

How does this model compare to the simple linear regression from the previous question in terms of predictive power?

Question 8

part a

Fit a third, more complex model to predict wins and call it m3. This model should use:

at least three predictor variables

at least one non-linear transformation or polynomial term.

part b

Write out the equation for the linear model in mathematical form and calculate its \(R^2\) value.

Question 9

Revisit the definition of causation. If your predictive model has a positive coefficient between one of the predictors and the response, is that evidence that if you increase that predictor variable for a given observation, the response variable will increase? That is, can you (or a sports management team) use this model to draw causal conclusions? Why or why not? Answer in at least three sentences. There’s more than one possible answer, so make sure to justify your reasoning.

The Data

Get the data

Teams data frame:

Batting data frame:

Questions

Question 1

part a

part b

Question 2

part a

part b

Question 3

part a - classification question, Batting data frame

part b - regression question, Batting data frame

part c - classification question, Teams data frame

part d - regression question, Teams data frame

Question 4

Question 5

part a

part b

Question 6

part a

part b

part c

Question 7

part a

part b

part c

Question 8

part a

part b

Question 9

`Teams` data frame:

`Batting` data frame:

part a - classification question, `Batting` data frame

part b - regression question, `Batting` data frame

part c - classification question, `Teams` data frame

part d - regression question, `Teams` data frame