ga_2020 <- read_csv("https://raw.githubusercontent.com/openelections/openelections-data-ga/refs/heads/master/2020/20201103__ga__general.csv")Lab 7: Elections
The Data
On June 12 2009, the Republic of Iran held an election where President (and conservative leader) Mahmoud Ahmadinejad sought re-election against three challengers. The foremost of these was Mir-Hossein Mousavi, a reformist and former prime minister. Mousavi ran a more left-leaning campagign that sought rapid political evolution. When it was announced that Ahmadinejad had won handily, there were widespread allegations of election fraud, especially for supporters of Mousavi and another like-minded challenger, Mehdi Karroubi. These allegations bled over into protests known as the Green Movement. Whether or not election fraud actually took place is still contested.
In this lab, we will attempt to explore the possibility of election fraud using the results from the 2009 Iran presidential eleection, which are contained in the iran data frame within the stat20data library.
Question 1
What is the unit of observation in the iran data frame? Take a good look at the rows of the data frame before answering this!
Question 2
part a
Visualize the distribution of the vote counts for Ahmadinejad with ggplot2 code. Label and title your axes.
part b
Calculate an appropriate measure of center for the data.
part c
Calculate an appropriate measure of spread for the data.
part d
Using parts a-c, describe the distribution of vote counts for Ahmadinejad in one to two sentences.
Question 3
One common theory on how to determine whether an election is fair is as follows:
In a normally occurring, fair election, the first digit of the vote counts for each voting precinct should follow Benford’s Law. If they do not, that might suggest that vote counts have been manually altered.
Benford’s Law is not any universal, binding statute but actually a probability distribution on the set \(\{1,2,3,4,5,6,7,8,9\}\). Let \(Y\) be a random variable following the Benford’s Law probability distribution. The probability mass function of $Y$ is as follows:
\[f(y) = log_{10}(1 + \frac{1}{y})\]
part a
Create two vectors:
The first vector,
y, contains the values 1,2,3, …, 9.The second vector,
f_y, contains the values of \(f(1), f(2), f(3), ..., f(9)\). See if you can look up the name of a function that can help you with this in the Taxonomy of Data tutorial!
part b
Use these vectors to calculate \(E(Y)\) , \(Var(Y)\) and \(SD(Y)\)
part c
What might 366 draws (the amount of rows in the iran dataframe) from \(Y\) (the Benford’s Law probability distribution) look like? To find out, use your vectors from part a.
Write R code to take random sample of size 366 from \(Y\).
Save your result into a vector called
benfords_sim.Create a one-column data frame called
benfords_df.
part d
Using benfords_df and ggplot2 code, create an empirical histogram of the values in the simulation. Label your axes and title your plot Benford’s Law Simulation.
part e
Describe what you see in the visualization in one to two sentences.
Question 4
What do the first digit empirical distributions look like for the four candidates in the Iranian presidential election?
part a
Add a new column to benfords_df which contains the first digit of Ahmadinejad’s vote counts. To help you, you may make use of the get_first() function which exists inside the stat20data package.
part b
Make a visualization of the first digit of the vote counts for Ahmadinejad. Label your axes and title the plot “Ahmadinejad”.
part c
Repeat the steps of parts a–b for the remaining three candidates in the data frame. Then, combine the four plots into a single visualization using the patchwork library.
part d
How well do the observed first digit distributions from this question follow the the one you created in Question 3 by sampling from Benford’s Law? Answer in two to three sentences.
Question 5
We will now take a look at precinct-level data from a presidential election in the United States; specifically, the election results from Georgia from the year 2020, which saw Joe Biden pull out a shocking victory in the state. This particular result was at the forefront of incumbent president Donald Trump’s claims of election fraud (these claims have been widely debunked by independent news agencies). You can read in the ga_2020 data frame, which contains these results, by copying in the following code cell to your document.
The data comes from OpenElections, a project which has compiled election results for all fifty states across many different election cycles, including 2020 and 2024. Take a look at the data before you proceed!
part a
Write code to create a plot for the first digit distributions of votes for Joe Biden and for Donald Trump. Combine the plots into a single visualization using the patchwork library.
part b
How well do the observed first digit distributions from this question (the U.S. election in 2020) follow:
the one you created in Question 3 by sampling from Benford’s Law?
the ones you created in Question 4 (Iranian election in 2009)?
Address each comparison in at least two sentences.
Question 6
part a
Assuming the theory of the Benford’s Law is true, do the results of Questions 4 and 5 suggest the votes in Iranian election may have been manually altered? Answer in two to three sentences.
part b
Assuming the theory of the Benford’s Law is true, do the results of Questions 5 suggest the Georgia votes in the U.S. presidential election may have been manually altered? Answer in two to three sentences.
part c
Comment on to the extent to which you believe the Benford’s Law is suitable to detect election fraud in two to three sentences.