Statistics and Statistical Programming (Fall 2020)/pset2: Difference between revisions

From CommunityData
(Created page with "<small>← Back to Week 4</small> For this problem set, the programming challenges focus...")
 
(→‎PC5. Cleanup/tidy your data: Note that files may contain only 1,2, and 3)
 
(34 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<div class="noautonum">__TOC__</div>
<small>[[Statistics_and_Statistical_Programming_(Fall_2020)#Week_4_.2810.2F6.2C_10.2F8.29|← Back to Week 4]]</small>
<small>[[Statistics_and_Statistical_Programming_(Fall_2020)#Week_4_.2810.2F6.2C_10.2F8.29|← Back to Week 4]]</small>


For this problem set, the programming challenges focus on some of the more advanced fundamentals (is that a thing?) of R, including some of the new types of data import, transformation, tidying, and visualization introduced in the weekly R tutorial materials.  
For this problem set, the programming challenges focus on some of the more advanced fundamentals of R, including some of the new types of data import, transformation, tidying, and visualization introduced in the most recent R tutorial. These are followed by some questions about an empirical paper that focus on applying some of the concepts from the first few chapters of ''OpenIntro'' to a research context that may be familiar.


The topics/skills covered here include: <TODO>


As before, the problem set is structured to model the sort of workflow you might pursue whenever you encounter a new dataset, starting with data import, summary and description of variables of interest, data transformation and tidying, before moving on to more sophisticated analysis and visualization. From here on out, I will assume that you have become familiar with some of the more basic fundamental skills (e.g., creating your R Markdown script or notebook) and that you have some ideas of where to turn for help and more information when you need it.  
== Programming Challenges ==
 
The programming challenges below ask you to perform a series of fairly typical data import, exploration, tidying, and descriptive analysis steps. Once again, you'll work with some "fake" data that Aaron created to ensure consistency and illustrate some useful points. The most recent R tutorials and problem set worked solutions contain example code that should help you do almost everything asked of you here. From this point forward, we will start to assume that you have become familiar with some of the more basic fundamental skills (e.g., creating your R Markdown script or notebook) and that you have some ideas of where to turn for help and more information when you need it. That said, you should always seek whatever help you need at any time, whether online, from your peers, or the teaching team.


== Programming Challenges ==
''Note: if you have trouble accessing or importing your dataset, please reach out for help ASAP as you will only be able to do the other challenges once you've done that one.''


=== PC0. Get started===
=== PC0. Get started===
Open up RStudio, create a new file for this assignment (likely an R Markdown script), add relevant metadata (maybe your name, the date, and a title so that you/we know it is Problem Set 1 for this class?), and save it.
Create and setup the metadata for a new RMarkdown script or notebook for this week's problem set (as usual). Make sure to confirm that R has the working directory location that you want.


=== PC1. Access and describe a dataset provided in an R library ===
=== PC1. Import data from a .csv file===
Revisit your problem set code from [[Statistics_and_Statistical_Programming_(Fall_2020)/pset1|Problem Set #1]] and recall what group number you were in (should be an integer between 1-20). Navigate to the [https://communitydata.science/~ads/teaching/2020/stats/data data repository for the course] and import the .csv file in the <code>week_04</code> subdirectory with your number (e.g., <code>group_<output>.csv</code>). Note that it is a .csv file and you'll need to use an appropriate procedure/commands to import it!
::'''Recommended sub-challenge:''' Inspect the dataset directly before you import. You might download the .csv file and use spreadsheet software (e.g., Google docs, LibreOffice, Excel, etc.) to do this. I often prefer look at the first few lines of a new dataset in a "raw" format via the command line or a text editor (e.g., NotePad) so that I can inspect the structure. This can help you figure out how best to import the data into R and clue you into any immediate data cleanup/tidying steps you'll need to take after import (e.g., do the columns have headers? are numbers/text formatted differently?). I won't ask about this in class, but I do recommend it.


# Load the <code>openintro</code> R package and the <code>counties</code> dataset so that they are available to you. Let's get to know this data! You may already be familiar with it from Chapter 1 of the ''OpenIntro'' textbook and a codebook is available [https://www.openintro.org/data/index.php?data=county on the openintro website].
===PC2. Explore and describe the data===
# Find out the class of the <code>counties</code> dataset object.  
Take appropriate steps to gain a basic understanding of this dataset.  
# Find out how many rows and how many columns are in the <code>counties</code> dataset.
* How many columns and rows are there? What classes/types are the variables/columns?
# Find the names of all of the variables (columns) as well as the class of each of them.  
* What appropriate summary statistics can you provide for each variable (e.g., what are the range, center, and spread of the continuous variables?).  
# Summarize at least one continuous or discrete numeric variable in the dataset. Calculate the length, range (minimum and maximum), mean, and standard deviation.
* Generate univariate tables and visualizations (e.g., boxplots or histograms) to get a sense of what they look like.
# Plot a visual summary (maybe a boxplot or a histogram?) for the same numeric variable you used in PC1.4 above.  
If there additional steps you'd like to take, feel free to do so.
# Summarize at least one categorical variable in the dataset (e.g., if the variable takes values of TRUE/FALSE or NA, how many of each are value are there?).


=== PC2. Work with a dataset from the web ===
===PC3. Use and write user-defined functions ===
# Run the following two commands in your R script. Be sure to replace <code><your.birthdate></code> with your birthday in ''ddmmyy'' format (e.g., September 21, 2020 would be <code>210920</code>) or at least something numeric. If you run the commands correctly (or maybe even not), R will return a single random integer value between 1 and 20. This integer will be your dataset number for the purposes of PC2.:
Use the example function, <code>my.mean()</code> distributed in the most recent R tutorial materials to calculate the mean of the variable (column) named <code>x</code> in your dataset. Now, write your own function to calculate the median of <code>x</code>. Be ready to walk us through how your function works!
::<code>set.seed(<your.birthdate>)</code></br>
 
::<code>sample(x= c(1:20), size=1))</code>
===PC4. Compare two vectors===
# Navigate to the [https://communitydata.science/~ads/teaching/2020/stats/data data repository for the course] and find the RData file in the <code>week_03</code> subdirectory with your dataset number from PC2.1 (e.g., <code>group_<output>.Rdata</code> where <output> is replaced with the dataset number).  
Load your vector from [[Statistics_and_Statistical_Programming_(Fall_2020)/pset1|Problem Set #1]] (Week 3) again (you might want to give it a new name) and perform the same cleanup steps you did in PC2.5 and PC2.6 last week (recode negative values as missing and log-transform the data). Now, compare the vector <code>x</code> from Problem Set #1 with the first column (<code>x</code>) of the data you imported for this assignment (Problem Set #2, i.e., the current dataset you just imported from a .csv file). They should be similar, but are they ''exactly'' the same? Use R code to show your answer.
# Load the .Rdata file for your dataset number into R. It should contain one variable. Find that variable!
 
# Calculate summary statistics for your variable. Be sure to include the length, minimum, maximum, mean, and standard deviation.
===PC5. Cleanup/tidy your data===
# Create a visualization of your variable: at the very least, create a boxplot or a histogram.
Once again, some cleanup and recoding is needed for this week's data. It turns out that the variables <code>i</code> and <code>j</code> are really dichotomous "true/false" variables that have been coded as 0 and 1 respectively in this dataset. Recode these columns as <code>logical</code> (i.e., "TRUE" or "FALSE" values). The variable <code>k</code> is really a categorical variable. Recode <code>k</code> as a factor and change the numbers so that they are replaced with the following values or levels: 0="none", 1="some", 2="lots", 3="all". *Your data file may only contains the values 1,2,3. The goal is to end up with a factor (so the command <code>class(k)</code> should return the value <code>TRUE</code>) where those text strings are the levels of the factor.
# Some of you may have negative numbers. Whoops! This was due to a coding error. Write code to recode all negative numbers as missing (i.e. <code>NA</code>) in your dataset. Now compute the mean and standard deviation again and note any changes.
 
# Log transform your dataset (i.e., take the natural logarithm for each value). If you have very small values (close to zero) it may be helpful to add 1 to each value before you take the natural logarithm (this avoids nonsense output in the results). Calculate the new mean and standard deviation of the transformed variable. Also create a new histogram or boxplot.
===PC6. Calculate conditional summary statistics===
It's common to consider the conditional distributions of a continuous variable within the levels of a second, categorical variable. Please describe the distribution of <code>x</code> within each of the four levels of <code>k</code>. For each level of <code>k</code> calculate the mean and standard deviation of <code>x</code>.
 
===PC7. Create a bivariate table===
Now that you have some categorical variables to work with, let's go ahead and create a bivariate table so that you can examine the distributions of some of these values. Use the <code>table()</code> command to create a cross-tabulation of the recoded versions of the <code>k</code> variable and the <code>j</code> variable.  
 
===PC8. Create a bivariate visualization===
Visualize two variables in the Problem Set #2 dataset using <code>ggplot2</code> and the <code>geom_point()</code> function to produce a scatterplot of <code>x</code> on the x-axis and <code>y</code> on the y-axis. '''Optional bonus:''' Incorporate any of the other variables on other dimensions (e.g., color, shape, and/or size are all good options). If you run into any issues plotting these dimensions, revisit the examples in the tutorial and the ggplot2 documentation and consider that ggplot2 can be very picky about the classes of objects.


== Statistical Questions ==
== Statistical Questions ==


===SQ1===  
===SQ1. Interpret bivariate analyses===
Consider the results of PC1.4 and PC1.5. Do the mean and standard deviation seem likely to provide good representations of the central tendency and spread of this variable? Why or why not? If not, what alternative measures could you use to characterize the central tendency and spread respectively?
 
Return to the dataset you imported and worked with in the programming challenges above. Imagine that it comes from a year-long study of bicyclists using a combination of survey and ride-tracking data from the Divvy bikeshare members in the Chicagoland area conducted a few years ago (let's say 2018, just to pick a year). Each row in the data corresponds to a single Divvy cyclist/member and the variables correspond to the following measures:
* <code>x</code>: Average daily distance cycled (in miles) measured via bicycle dock check-in/check-out data. 
* <code>j</code>: An indicator (True/False) of whether any rides were recorded between January and March.
* <code>l</code>: An indicator (True/False) of whether the cyclist also uses vehicle rideshare provided by Lyft (the company that owns Divvy).
* <code>k</code>: A measure of how frequently the cyclist rode in bad weather, with bad weather defined using a standard measure provided by the U.S. NOAA (National Oceanic and Atmospheric Administration) and the categories (none, some, a lot, all) defined in terms of empirical quartiles within the dataset.
* <code>y</code>: A continuous measure of income calculated in tens of thousands of dollars and scaled so that "0" = average income for a Divvy member (i.e., a value of "5" = $50,000 more per year than an average Divvy member).
 
# Return to the conditional means you created in PC6 above. Given the information you now have about the study, how would you interpret them? Does there seem to be any sort of relationship between the two variables?
# Return to the bivariate contingency table you created in PC7 above. Given the information you now have about the study, how would you interpret it? Does there seem to be any sort of relationship between the two variables?
# Return to the scatterplot you created in PC8 above. Given the information you now have about the study, how would you interpret it? Does there seem to be any sort of relationship between the two variables?
 
===SQ2. Birthdays revisited (Optional bonus!)===
 
'''Optional bonus statistical question'''
 
''We talked about birthdays in the context of one of the textbook exercises for ''OpenIntro'' Chapter 3. Here's an opportunity to apply your knowledge and extend that exercise. Note that you can absolutely use R to help calculate the solutions to both parts of this problem. That said, it's a super famous problem and answers/examples are all over the internet, so if you want to challenge yourself, don't look at them while you're working on it! The only hint I'll give you is that you may find [https://en.wikipedia.org/wiki/Binomial_coefficient binomial coefficients] useful and the <code>choose()</code>) function can calculate them for you in R.''
 
# The first time I taught this course, there were 25 people in it (including the members of the teaching team). Imagine that I offered you a choice between two bets: Bet #1 is determined by the flip of a fair coin. You can choose heads or tails and you win the bet if your choice turns out to be correct). Bet #2 is determined by whether any two members of that previous version of the class shared a birthday. If a birthday was shared I win the bet, and if no shared birthdays were shared you win the bet. Assuming you want the best chance of winning, which bet should you choose?
# Now calculate the probability that any two members of our 7 person class share a birthday and compare this probability with the results of SQ2.1 above.
 
== Empirical Paper Questions: Emotional contagion in social networks ==
 
Refer to the following (controversial! highly cited! moderately straightforward!) paper to answer the questions below. Please be prepared to identify specific parts of the paper that support your answers. Note that several of the questions below correspond loosely to the questions I have asked you to answer with respect to your [[Statistics_and_Statistical_Programming_(Fall_2020)#Research_project_plan_and_dataset_identification|research project plan and dataset identification assignment]] due later this week (that's called "scaffolding" for those of you keeping score at home).
 
:Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” ''Proceedings of the National Academy of Sciences'' 111(24):8788–90. [[http://www.pnas.org/content/111/24/8788.full Open access]]
 
=== EQ1: Research questions and objectives ===
Restate or describe, in your own words, (a) the main research question of the paper and be sure to identify (b) the population of interest ("target population").
 
=== EQ2: Sample and experiment design ===
Describe (a) the sample used in the study; (b) the treatment and control groups, and (c) the experimental manipulation(s).
 
=== EQ3: Data and variables ===
Describe (a) the unit of analysis or cases, (b) the main variables used and their "types" (e.g., continuous, categorical, etc. See ''OpenIntro'' chapter 1 for ideas).


===SQ2===
=== EQ4: Results ===
Consider the results of PC2.3 and PC2.4. Do the mean and standard deviation seem likely to provide good representations of the central tendency and spread of this variable? Why or why not? If not, what alternative measures could you use to characterize the central tendency and spread respectively?
Summarize the results of the study. There is one figure in the paper (Figure 1). Explain how the figure represents the results.


===SQ3===
=== EQ5: Interpretation and contribution (significance) ===
Briefly discuss any differences you observe between the untransformed/uncleaned version of the variable you summarized in PC2.3 and PC2.4 and the transformed/cleaned version you summarized in PC2.6. Which summary should you prefer and why?
(a) Summarize the authors' interpretation of the study results. (b) Discuss whether the results generalize from the sample to the target population. (c) Summarize the core contribution of the paper.

Latest revision as of 22:59, 4 October 2020

← Back to Week 4

For this problem set, the programming challenges focus on some of the more advanced fundamentals of R, including some of the new types of data import, transformation, tidying, and visualization introduced in the most recent R tutorial. These are followed by some questions about an empirical paper that focus on applying some of the concepts from the first few chapters of OpenIntro to a research context that may be familiar.


Programming Challenges[edit]

The programming challenges below ask you to perform a series of fairly typical data import, exploration, tidying, and descriptive analysis steps. Once again, you'll work with some "fake" data that Aaron created to ensure consistency and illustrate some useful points. The most recent R tutorials and problem set worked solutions contain example code that should help you do almost everything asked of you here. From this point forward, we will start to assume that you have become familiar with some of the more basic fundamental skills (e.g., creating your R Markdown script or notebook) and that you have some ideas of where to turn for help and more information when you need it. That said, you should always seek whatever help you need at any time, whether online, from your peers, or the teaching team.

Note: if you have trouble accessing or importing your dataset, please reach out for help ASAP as you will only be able to do the other challenges once you've done that one.

PC0. Get started[edit]

Create and setup the metadata for a new RMarkdown script or notebook for this week's problem set (as usual). Make sure to confirm that R has the working directory location that you want.

PC1. Import data from a .csv file[edit]

Revisit your problem set code from Problem Set #1 and recall what group number you were in (should be an integer between 1-20). Navigate to the data repository for the course and import the .csv file in the week_04 subdirectory with your number (e.g., group_<output>.csv). Note that it is a .csv file and you'll need to use an appropriate procedure/commands to import it!

Recommended sub-challenge: Inspect the dataset directly before you import. You might download the .csv file and use spreadsheet software (e.g., Google docs, LibreOffice, Excel, etc.) to do this. I often prefer look at the first few lines of a new dataset in a "raw" format via the command line or a text editor (e.g., NotePad) so that I can inspect the structure. This can help you figure out how best to import the data into R and clue you into any immediate data cleanup/tidying steps you'll need to take after import (e.g., do the columns have headers? are numbers/text formatted differently?). I won't ask about this in class, but I do recommend it.

PC2. Explore and describe the data[edit]

Take appropriate steps to gain a basic understanding of this dataset.

  • How many columns and rows are there? What classes/types are the variables/columns?
  • What appropriate summary statistics can you provide for each variable (e.g., what are the range, center, and spread of the continuous variables?).
  • Generate univariate tables and visualizations (e.g., boxplots or histograms) to get a sense of what they look like.

If there additional steps you'd like to take, feel free to do so.

PC3. Use and write user-defined functions[edit]

Use the example function, my.mean() distributed in the most recent R tutorial materials to calculate the mean of the variable (column) named x in your dataset. Now, write your own function to calculate the median of x. Be ready to walk us through how your function works!

PC4. Compare two vectors[edit]

Load your vector from Problem Set #1 (Week 3) again (you might want to give it a new name) and perform the same cleanup steps you did in PC2.5 and PC2.6 last week (recode negative values as missing and log-transform the data). Now, compare the vector x from Problem Set #1 with the first column (x) of the data you imported for this assignment (Problem Set #2, i.e., the current dataset you just imported from a .csv file). They should be similar, but are they exactly the same? Use R code to show your answer.

PC5. Cleanup/tidy your data[edit]

Once again, some cleanup and recoding is needed for this week's data. It turns out that the variables i and j are really dichotomous "true/false" variables that have been coded as 0 and 1 respectively in this dataset. Recode these columns as logical (i.e., "TRUE" or "FALSE" values). The variable k is really a categorical variable. Recode k as a factor and change the numbers so that they are replaced with the following values or levels: 0="none", 1="some", 2="lots", 3="all". *Your data file may only contains the values 1,2,3. The goal is to end up with a factor (so the command class(k) should return the value TRUE) where those text strings are the levels of the factor.

PC6. Calculate conditional summary statistics[edit]

It's common to consider the conditional distributions of a continuous variable within the levels of a second, categorical variable. Please describe the distribution of x within each of the four levels of k. For each level of k calculate the mean and standard deviation of x.

PC7. Create a bivariate table[edit]

Now that you have some categorical variables to work with, let's go ahead and create a bivariate table so that you can examine the distributions of some of these values. Use the table() command to create a cross-tabulation of the recoded versions of the k variable and the j variable.

PC8. Create a bivariate visualization[edit]

Visualize two variables in the Problem Set #2 dataset using ggplot2 and the geom_point() function to produce a scatterplot of x on the x-axis and y on the y-axis. Optional bonus: Incorporate any of the other variables on other dimensions (e.g., color, shape, and/or size are all good options). If you run into any issues plotting these dimensions, revisit the examples in the tutorial and the ggplot2 documentation and consider that ggplot2 can be very picky about the classes of objects.

Statistical Questions[edit]

SQ1. Interpret bivariate analyses[edit]

Return to the dataset you imported and worked with in the programming challenges above. Imagine that it comes from a year-long study of bicyclists using a combination of survey and ride-tracking data from the Divvy bikeshare members in the Chicagoland area conducted a few years ago (let's say 2018, just to pick a year). Each row in the data corresponds to a single Divvy cyclist/member and the variables correspond to the following measures:

  • x: Average daily distance cycled (in miles) measured via bicycle dock check-in/check-out data.
  • j: An indicator (True/False) of whether any rides were recorded between January and March.
  • l: An indicator (True/False) of whether the cyclist also uses vehicle rideshare provided by Lyft (the company that owns Divvy).
  • k: A measure of how frequently the cyclist rode in bad weather, with bad weather defined using a standard measure provided by the U.S. NOAA (National Oceanic and Atmospheric Administration) and the categories (none, some, a lot, all) defined in terms of empirical quartiles within the dataset.
  • y: A continuous measure of income calculated in tens of thousands of dollars and scaled so that "0" = average income for a Divvy member (i.e., a value of "5" = $50,000 more per year than an average Divvy member).
  1. Return to the conditional means you created in PC6 above. Given the information you now have about the study, how would you interpret them? Does there seem to be any sort of relationship between the two variables?
  2. Return to the bivariate contingency table you created in PC7 above. Given the information you now have about the study, how would you interpret it? Does there seem to be any sort of relationship between the two variables?
  3. Return to the scatterplot you created in PC8 above. Given the information you now have about the study, how would you interpret it? Does there seem to be any sort of relationship between the two variables?

SQ2. Birthdays revisited (Optional bonus!)[edit]

Optional bonus statistical question

We talked about birthdays in the context of one of the textbook exercises for OpenIntro Chapter 3. Here's an opportunity to apply your knowledge and extend that exercise. Note that you can absolutely use R to help calculate the solutions to both parts of this problem. That said, it's a super famous problem and answers/examples are all over the internet, so if you want to challenge yourself, don't look at them while you're working on it! The only hint I'll give you is that you may find binomial coefficients useful and the choose()) function can calculate them for you in R.

  1. The first time I taught this course, there were 25 people in it (including the members of the teaching team). Imagine that I offered you a choice between two bets: Bet #1 is determined by the flip of a fair coin. You can choose heads or tails and you win the bet if your choice turns out to be correct). Bet #2 is determined by whether any two members of that previous version of the class shared a birthday. If a birthday was shared I win the bet, and if no shared birthdays were shared you win the bet. Assuming you want the best chance of winning, which bet should you choose?
  2. Now calculate the probability that any two members of our 7 person class share a birthday and compare this probability with the results of SQ2.1 above.

Empirical Paper Questions: Emotional contagion in social networks[edit]

Refer to the following (controversial! highly cited! moderately straightforward!) paper to answer the questions below. Please be prepared to identify specific parts of the paper that support your answers. Note that several of the questions below correspond loosely to the questions I have asked you to answer with respect to your research project plan and dataset identification assignment due later this week (that's called "scaffolding" for those of you keeping score at home).

Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” Proceedings of the National Academy of Sciences 111(24):8788–90. [Open access]

EQ1: Research questions and objectives[edit]

Restate or describe, in your own words, (a) the main research question of the paper and be sure to identify (b) the population of interest ("target population").

EQ2: Sample and experiment design[edit]

Describe (a) the sample used in the study; (b) the treatment and control groups, and (c) the experimental manipulation(s).

EQ3: Data and variables[edit]

Describe (a) the unit of analysis or cases, (b) the main variables used and their "types" (e.g., continuous, categorical, etc. See OpenIntro chapter 1 for ideas).

EQ4: Results[edit]

Summarize the results of the study. There is one figure in the paper (Figure 1). Explain how the figure represents the results.

EQ5: Interpretation and contribution (significance)[edit]

(a) Summarize the authors' interpretation of the study results. (b) Discuss whether the results generalize from the sample to the target population. (c) Summarize the core contribution of the paper.