Editing Statistics and Statistical Programming (Fall 2020)/pset7

From CommunityData
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
This problem set asks you to apply, extend, and interpret the widely influential "bread and peace" model of U.S. electoral behavior from the work of [https://douglas-hibbs.com/ Douglas Hibbs]. In brief, Hibbs argues that two variables almost perfectly predict U.S. presidential election vote-share for incumbent party candidates since 1950: economic growth and U.S. military fatalities (both calculated over the duration of the previous president's term). Since we're doing univariate (one predictor variable) regression this week, I ask you to work with the income measure (predictor) and the incumbent part vote share (outcome). 


== Programming challenges ==
=== PC1 Import and update data ===
Data for all U.S. presidential elections 1952-2012 are [https://github.com/avehtari/ROS-Examples/raw/master/ElectionsEconomy/data/hibbs.dat available here]. Note that this points to a ".dat" file, which in this case is just a raw text file format that you can import using the following command: <code>read.table(url(<insert.url.here>), header=TRUE)</code>. (inserting the URL for the dataset in the appropriate spot).


Each row corresponds to one presidential election since 1952. The variables provided are:
* <code>year</code> The year of the presidential election.
* <code>growth</code> Economic growth during the preceding four years (increase in per-capita income).
* <code>vote</code> Proportion of the popular vote won by the incumbent party candidate.
* <code>inc_party_candidate</code> Incumbent party candidate.
* <code>other_candidate</code> Out-party candidate.


The dataset does not include 2016, so we can add that by hand. You might recall that Hillary Clinton was the incumbent party candidate and Donald Trump was the out-party candidate that year. Clinton won approximately 51.1% of the popular vote and a reasonable estimate for per-capita income growth 2012-2016 is 2.2%. You can append this information to the imported dataset in a bunch of different ways. (I would personally do so using a call to <code>list()</code> nested inside a call to <code>rbind()</code> (e.g., <code>rbind(<hibbs_data>, list(<2016 row>))</code>). You could also explore the <code>add_row()</code> function in the tidyverse. As usual, your mileage may vary.)
== Programming challenges (Part II) ==


=== PC2 Summarize and visualize data ===
The second set of programming challenges this week pose a more open-ended set of questions about a simulated dataset from an observational study of high school graduates' academic achievement and subsequent income. Here is some information about the "study design:"
:: Data from twelve cohorts of public high school students was collected from across the Chicago suburbs. Each cohort incorporates a random sample of 142 students from a single suburban school district. For each student, researchers gathered a standardized measure of the students' aggregate GPA as a proxy for their academic achievement. The researchers then matched the students' names against IRS records five years later and collected each student's reported pre-tax earnings for that year.


You should be familiar with how to do this by now. Make sure to include a scatterplot of <code>growth</code> against <code>vote</code>.
I have provided you with a version of the dataset from this hypothetical study in which each row corresponds to one student. For each student, the dataset contains the following variables:
* <code>id</code>: A unique numeric identifier for each student in the study (randomly generated to preserve student anonymity).
* <code>cohort</code>: An anonymized label of the cohort (school district) the student was drawn from.
* <code>gpa</code>: Approximate GPA percentile of the student within the entire district. Note that this means all student GPAs within each district were aggregated and converted to an identical scale before percentiles were calculated. 
* <code>income</code>: Pre-tax income (in thousands of US dollars) reported to the U.S. federal government (IRS) by the student five years after graduation.


=== PC3 Calculate covariance and correlation ===
For the rest of this programming challenge, you should use this dataset to answer the following research questions:
Calculate the covariance and correlation of <code>growth</code> and <code>vote</code>.
* How does high school academic achievement relate to earnings?
* How does this relationship vary by school district?


See this week's R tutorial for example commands here and the Wikipedia articles on correlation and covariance for details about the underlying calculations.
You may use any analytical procedures you deem appropriate given the structure of the dataset and study design. Some things you may want to keep in mind:
 
* ANOVAs, T-tests, and linear regression can help you test different kinds of hypotheses.
=== PC4 Fit and summarize a linear model ===
* Adjusting for multiple comparisons is important.
 
Use the <code>lm()</code> function to fit a least squares regression of economic growth on incumbent party vote share. Use the <code>summary()</code> function to present a summary of the model results.
 
=== PC5 Assess the model fit ===
 
Evaluate the conditions for least squares regression (linearity, normal residuals, constant variability, independent observations). Wherever possible, present plots and/or calculations to support your evaluations. In particular, you probably want to produce the following (examples provided in this week's R tutorial):
(a) a histogram of the residuals
(b) a plot of the residuals against the (sequential) values of X
(c) a quantile-quantile plot
 
=== PC6 Calculate confidence interval for a coefficient ===
 
The very last part of `OpenIntro` §8 provides detailed instructions for estimating a confidence interval around a regression coefficient. Please calculate the confidence interval for the coefficient on <code>growth</code> from the results of your regression model.
 
=== PC7 Calculate an out-of-sample prediction and 95% prediction interval ===
 
What was/is the predicted vote share for Donald Trump in 2020 based on this model? The online supplement to `OpenIntro` §8 assigned this week provides detailed examples for how to produce a out-of-sample prediction from a regression model. Please calculate the point estimate and 95% prediction interval for the incumbent party candidate's share of the vote in 2020 given that (a [https://osf.io/preprints/socarxiv/xrf3t/ reasonable estimate] of) the per-capita income growth 2016-2020 is 2.5%.
 
== Statistical questions ==
The questions below refer to the univariate regression analysis you completed in the programming challenges above.
 
=== SQ1 Describe and interpret the results ===
Do this for any/all of the analysis you conducted in the programming challenges. In particular, be sure to:
* address any noteworthy observations from the descriptive summaries and plots
* summarize the regression results effectively (including the coefficients and <math>R^2</math> value).
* summarize the confidence interval around the estimate for <growth> that you calculated.
* provide a substantive interpretation of the results in terms of the variables/concepts included in the analysis.
 
=== SQ2 Discuss regression diagnostics ===
Describe the regression diagnostics and whether the conditions necessary to identify a least-squares fit seem to apply. If there are violations of these assumptions/conditions, consider how that might bias the results.
 
=== SQ3 Disambiguate: correlation vs. covariance vs. OLS estimate ===
You characterized the relationship between <code>growth</code> and <code>vote</code> in three different ways. What do you make of each of these? What are the similarities and differences between them?
 
=== SQ4 Interpret out-of-sample prediction ===
Discuss and interpret the out-of-sample prediction you calculated for Trump's vote share in 2020. As of the writing of the problem set, Trump seems to have received about [https://en.wikipedia.org/w/index.php?title=2020_United_States_presidential_election&oldid=988030609 47.6% of the popular vote]. How does this (not-yet-final) observed value relate to your prediction? How do you interpret this relationship?
 
=== SQ5 Revisit (vaguely stated) theory ===
 
Insofar as we've only considered one part of the "bread and peace" theory here, how would you interpret your results in light of the prior theory/findings as described at the beginning of the problem set? Any confounding factors not present in the original theory/models that you think might be important to include? Why would you argue to include them (or not)?
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)