Statistics and Statistical Programming (Fall 2020)/pset2: Difference between revisions

Revision as of 18:41, 23 September 2020

For this problem set, the programming challenges focus on some of the more advanced fundamentals (is that a thing?) of R, including some of the new types of data import, transformation, tidying, and visualization introduced in the weekly R tutorial materials.

The topics/skills covered here include: <TODO>

As before, the problem set is structured to model the sort of workflow you might pursue whenever you encounter a new dataset, starting with data import, summary and description of variables of interest, data transformation and tidying, before moving on to more sophisticated analysis and visualization. From here on out, I will assume that you have become familiar with some of the more basic fundamental skills (e.g., creating your R Markdown script or notebook) and that you have some ideas of where to turn for help and more information when you need it.

Programming Challenges

PC0. Get started

Open up RStudio, create a new file for this assignment (likely an R Markdown script), add relevant metadata (maybe your name, the date, and a title so that you/we know it is Problem Set 1 for this class?), and save it.

PC1. Access and describe a dataset provided in an R library

Load the openintro R package and the counties dataset so that they are available to you. Let's get to know this data! You may already be familiar with it from Chapter 1 of the OpenIntro textbook and a codebook is available on the openintro website.
Find out the class of the counties dataset object.
Find out how many rows and how many columns are in the counties dataset.
Find the names of all of the variables (columns) as well as the class of each of them.
Summarize at least one continuous or discrete numeric variable in the dataset. Calculate the length, range (minimum and maximum), mean, and standard deviation.
Plot a visual summary (maybe a boxplot or a histogram?) for the same numeric variable you used in PC1.4 above.
Summarize at least one categorical variable in the dataset (e.g., if the variable takes values of TRUE/FALSE or NA, how many of each are value are there?).

PC2. Work with a dataset from the web

Run the following two commands in your R script. Be sure to replace <your.birthdate> with your birthday in ddmmyy format (e.g., September 21, 2020 would be 210920) or at least something numeric. If you run the commands correctly (or maybe even not), R will return a single random integer value between 1 and 20. This integer will be your dataset number for the purposes of PC2.:

set.seed(<your.birthdate>)

sample(x= c(1:20), size=1))

Navigate to the data repository for the course and find the RData file in the week_03 subdirectory with your dataset number from PC2.1 (e.g., group_<output>.Rdata where <output> is replaced with the dataset number).
Load the .Rdata file for your dataset number into R. It should contain one variable. Find that variable!
Calculate summary statistics for your variable. Be sure to include the length, minimum, maximum, mean, and standard deviation.
Create a visualization of your variable: at the very least, create a boxplot or a histogram.
Some of you may have negative numbers. Whoops! This was due to a coding error. Write code to recode all negative numbers as missing (i.e. NA) in your dataset. Now compute the mean and standard deviation again and note any changes.
Log transform your dataset (i.e., take the natural logarithm for each value). If you have very small values (close to zero) it may be helpful to add 1 to each value before you take the natural logarithm (this avoids nonsense output in the results). Calculate the new mean and standard deviation of the transformed variable. Also create a new histogram or boxplot.

Statistical Questions

SQ1

Empirical paper questions: Emotional contagion in social networks

Refer to the following (controversial! highly cited! moderately straightforward!) paper to answer the questions below. Please be prepared to identify specific parts of the paper that support your answers. Note that several of the questions below correspond loosely to the questions I have asked you to answer with respect to your research project plan and dataset identification assignment due later this week (that's called "scaffolding" for those of you keeping score at home).

Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” Proceedings of the National Academy of Sciences 111(24):8788–90. [Open access]

EQ1: Research questions and objectives

Restate or describe, in your own words, (a) the main research question of the paper and be sure to identify (b) the population of interest ("target population").

EQ2: Sample and experiment design

Describe (a) the sample used in the study; (b) the treatment and control groups, and (c) the experimental manipulation(s).

EQ3: Data and variables

Describe (a) the unit of analysis or cases, (b) the main variables used and their "types" (e.g., continuous, categorical, etc. See OpenIntro chapter 1 for ideas).

EQ4: Results

Summarize the results of the study. There is one figure in the paper (Figure 1). Explain how the figure represents the results.

EQ5: Interpretation and contribution (significance)

(a) Summarize the authors' interpretation of the study results. (b) Discuss whether the results generalize from the sample to the target population. (c) Summarize the core contribution of the paper.

@@ Line 40: / Line 40: @@
 == Empirical paper questions: Emotional contagion in social networks ==
-Refer to the following (controversial! highly cited! moderately straightforward!) paper to answer the questions below. Please be prepared to identify specific parts of the paper that support your answers. Note that the questions below correspond loosely to the questions I have asked you to answer with respect to your [[Statistics_and_Statistical_Programming_(Fall_2020)#Research_project_plan_and_dataset_identification|research project plan and dataset identification assignment]] due later this week (that's called "scaffolding" for those of you keeping score at home).
+Refer to the following (controversial! highly cited! moderately straightforward!) paper to answer the questions below. Please be prepared to identify specific parts of the paper that support your answers. Note that several of the questions below correspond loosely to the questions I have asked you to answer with respect to your [[Statistics_and_Statistical_Programming_(Fall_2020)#Research_project_plan_and_dataset_identification|research project plan and dataset identification assignment]] due later this week (that's called "scaffolding" for those of you keeping score at home).
 :Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” ''Proceedings of the National Academy of Sciences'' 111(24):8788–90. [[http://www.pnas.org/content/111/24/8788.full Open access]]