Statistics and Statistical Programming (Fall 2020)/pset2: Difference between revisions

From CommunityData
Line 8: Line 8:


== Programming Challenges ==
== Programming Challenges ==
''Note: if you have trouble accessing or importing your dataset, please reach out for help ASAP as you will only be able to do the other challenges once you've done that one.''
The most recent R tutorials and problem set worked solutions contain example code that should help you do almost everything asked of you here. As always, please seek help online, from your peers, and the teaching team at any time.


=== PC0. Get started===
=== PC0. Get started===
Create and setup the metadata for a new RMarkdown script or notebook for this week's problem set (as usual). Make sure to confirm that R has the working directory location that you want.
=== PC1. Import data from a .csv file===
''' Revisit your problem set code from pset1 and recall what group number you were in (should be an integer between 1-20). Navigate to the [https://communitydata.science/~ads/teaching/2020/stats/data data repository for the course] and import the .csv file in the <code>week_04</code> subdirectory with your number (e.g., <code>group_<output>.csv</code>). Note that it is a .csv file and you'll need to use an appropriate procedure/commands to import it!
::'''Recommended sub-challenge:''' Inspect the dataset directly before you import. You might download the .csv file and use spreadsheet software (e.g., Google docs, LibreOffice, Excel, etc.) to do this. I often prefer look at the first few lines of a new dataset in a "raw" format via the command line or a text editor (e.g., NotePad) so that I can inspect the structure. This can help you figure out how best to import the data into R and clue you into any immediate data cleanup/tidying steps you'll need to take after import (e.g., do the columns have headers? are numbers/text formatted differently?). I won't ask about this in class, but I do recommend it.
===PC2. Explore and describe the data===
Take appropriate steps to gain a basic understanding of this dataset. How many columns and rows are there? What classes/types are the variables/columns? What appropriate summary statistics can you provide for each variable (e.g., what are the range, center, and spread of the continuous variables?). Generate univariate tables and visualizations (e.g., boxplots or histograms) to get a sense of what they look like.
===PC3. Use and write user-defined functions ===
Use the example function, <code>my.mean()</code> distributed in the most recent R tutorial materials to calculate the mean of the variable (column) named <code>x</code> in your dataset. Now, write your own function to calculate the median of <code>x</code>. Be ready to walk us through how your function works!
===PC4. Replicate the data import and cleanup from Problem Set #1===
Load your vector from Week 2 again and perform the same cleanup steps you did in PC6 and PC7 last week (recode negative values as missing and log-transform the data).
===PC5. Compare two vectors===
Compare the vector <code>x</code> from Problem Set #1 with the first column (<code>x</code>) of the data you imported for Problem Set #2 (the current dataset you just imported from a .csv file). They should be similar, but are they really the same? Write R code to demonstrate or support your answer.
===PC6. Cleanup/tidy your data===
A very common step when you import and prepare for data analysis is going to be cleaning and recoding data. Some of that is needed here. It turns out that the variables <code>i</code> and <code>j</code> are really dichotomous "true/false" variables that have been coded as 0 and 1 in this dataset. Recode these columns as <code>logical</code> (i.e., "TRUE" or "FALSE" values). The variable <code>k</code> is really a categorical variable. Recode this as a factor and change the numbers so that they are replaced with the following values or levels: 0="none", 1="some", 2="lots", 3="all". The goal is to end up with a factor where those text strings are the levels of the factor.
===PC6. Create a bivariate table===
Now that you have some categorical variables to work with, let's go ahead and create a bivariate table so that you can examine the distributions of some of these values. Use the <code>table()</code> command to create a cross-tabulation of the recoded versions of the <code>k</code> variable and the <code>j</code> variable.
===PC7. Create a bivariate visualization===
Visualize two variables in the Problem Set #2 dataset using <code>ggplot2</code> and the <code>geom_point()</code> function to produce a scatterplot. First, plot <code>x</code> on the x-axis and <code>y</code> on the y-axis. Second, visualize the other variables on other dimensions (e.g., color, shape, and size seem reasonable). If you run into any issues plotting these dimensions, revisit the examples in the tutorial and the ggplot2 documentation and consider that ggplot2 can be very picky about the classes of objects...


== Statistical Questions ==
== Statistical Questions ==

Revision as of 18:02, 23 September 2020

← Back to Week 4

For this problem set, the programming challenges focus on some of the more advanced fundamentals (is that a thing?) of R, including some of the new types of data import, transformation, tidying, and visualization introduced in the weekly R tutorial materials.

The topics/skills covered here include: <TODO>

As before, the problem set is structured to model the sort of workflow you might pursue whenever you encounter a new dataset, starting with data import, summary and description of variables of interest, data transformation and tidying, before moving on to more sophisticated analysis and visualization. From here on out, I will assume that you have become familiar with some of the more basic fundamental skills (e.g., creating your R Markdown script or notebook) and that you have some ideas of where to turn for help and more information when you need it.

Programming Challenges

Note: if you have trouble accessing or importing your dataset, please reach out for help ASAP as you will only be able to do the other challenges once you've done that one.

The most recent R tutorials and problem set worked solutions contain example code that should help you do almost everything asked of you here. As always, please seek help online, from your peers, and the teaching team at any time.

PC0. Get started

Create and setup the metadata for a new RMarkdown script or notebook for this week's problem set (as usual). Make sure to confirm that R has the working directory location that you want.

PC1. Import data from a .csv file

Revisit your problem set code from pset1 and recall what group number you were in (should be an integer between 1-20). Navigate to the data repository for the course and import the .csv file in the week_04 subdirectory with your number (e.g., group_<output>.csv). Note that it is a .csv file and you'll need to use an appropriate procedure/commands to import it!

Recommended sub-challenge: Inspect the dataset directly before you import. You might download the .csv file and use spreadsheet software (e.g., Google docs, LibreOffice, Excel, etc.) to do this. I often prefer look at the first few lines of a new dataset in a "raw" format via the command line or a text editor (e.g., NotePad) so that I can inspect the structure. This can help you figure out how best to import the data into R and clue you into any immediate data cleanup/tidying steps you'll need to take after import (e.g., do the columns have headers? are numbers/text formatted differently?). I won't ask about this in class, but I do recommend it.

PC2. Explore and describe the data

Take appropriate steps to gain a basic understanding of this dataset. How many columns and rows are there? What classes/types are the variables/columns? What appropriate summary statistics can you provide for each variable (e.g., what are the range, center, and spread of the continuous variables?). Generate univariate tables and visualizations (e.g., boxplots or histograms) to get a sense of what they look like.

PC3. Use and write user-defined functions

Use the example function, my.mean() distributed in the most recent R tutorial materials to calculate the mean of the variable (column) named x in your dataset. Now, write your own function to calculate the median of x. Be ready to walk us through how your function works!

PC4. Replicate the data import and cleanup from Problem Set #1

Load your vector from Week 2 again and perform the same cleanup steps you did in PC6 and PC7 last week (recode negative values as missing and log-transform the data).

PC5. Compare two vectors

Compare the vector x from Problem Set #1 with the first column (x) of the data you imported for Problem Set #2 (the current dataset you just imported from a .csv file). They should be similar, but are they really the same? Write R code to demonstrate or support your answer.

PC6. Cleanup/tidy your data

A very common step when you import and prepare for data analysis is going to be cleaning and recoding data. Some of that is needed here. It turns out that the variables i and j are really dichotomous "true/false" variables that have been coded as 0 and 1 in this dataset. Recode these columns as logical (i.e., "TRUE" or "FALSE" values). The variable k is really a categorical variable. Recode this as a factor and change the numbers so that they are replaced with the following values or levels: 0="none", 1="some", 2="lots", 3="all". The goal is to end up with a factor where those text strings are the levels of the factor.

PC6. Create a bivariate table

Now that you have some categorical variables to work with, let's go ahead and create a bivariate table so that you can examine the distributions of some of these values. Use the table() command to create a cross-tabulation of the recoded versions of the k variable and the j variable.

PC7. Create a bivariate visualization

Visualize two variables in the Problem Set #2 dataset using ggplot2 and the geom_point() function to produce a scatterplot. First, plot x on the x-axis and y on the y-axis. Second, visualize the other variables on other dimensions (e.g., color, shape, and size seem reasonable). If you run into any issues plotting these dimensions, revisit the examples in the tutorial and the ggplot2 documentation and consider that ggplot2 can be very picky about the classes of objects...

Statistical Questions

SQ1

Empirical Paper Questions: Emotional contagion in social networks

Refer to the following (controversial! highly cited! moderately straightforward!) paper to answer the questions below. Please be prepared to identify specific parts of the paper that support your answers. Note that several of the questions below correspond loosely to the questions I have asked you to answer with respect to your research project plan and dataset identification assignment due later this week (that's called "scaffolding" for those of you keeping score at home).

Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” Proceedings of the National Academy of Sciences 111(24):8788–90. [Open access]

EQ1: Research questions and objectives

Restate or describe, in your own words, (a) the main research question of the paper and be sure to identify (b) the population of interest ("target population").

EQ2: Sample and experiment design

Describe (a) the sample used in the study; (b) the treatment and control groups, and (c) the experimental manipulation(s).

EQ3: Data and variables

Describe (a) the unit of analysis or cases, (b) the main variables used and their "types" (e.g., continuous, categorical, etc. See OpenIntro chapter 1 for ideas).

EQ4: Results

Summarize the results of the study. There is one figure in the paper (Figure 1). Explain how the figure represents the results.

EQ5: Interpretation and contribution (significance)

(a) Summarize the authors' interpretation of the study results. (b) Discuss whether the results generalize from the sample to the target population. (c) Summarize the core contribution of the paper.