Editing Statistics and Statistical Programming (Fall 2020)/pset2
From CommunityData
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 1: | Line 1: | ||
<small>[[Statistics_and_Statistical_Programming_(Fall_2020)#Week_4_.2810.2F6.2C_10.2F8.29|← Back to Week 4]]</small> | <small>[[Statistics_and_Statistical_Programming_(Fall_2020)#Week_4_.2810.2F6.2C_10.2F8.29|← Back to Week 4]]</small> | ||
For this problem set, the programming challenges focus on some of the more advanced fundamentals of R, including some of the new types of data import, transformation, tidying, and visualization introduced in the most recent R tutorial. These are followed by some questions about an empirical paper that focus on applying some of the concepts from the first few chapters of ''OpenIntro'' to a research context that | For this problem set, the programming challenges focus on some of the more advanced fundamentals of R, including some of the new types of data import, transformation, tidying, and visualization introduced in the most recent R tutorial materials. These are followed by some questions about an empirical paper that focus on applying some of the concepts from the first few chapters of ''OpenIntro'' to a research context that will likely be familiar. | ||
== Programming Challenges == | == Programming Challenges == | ||
The programming challenges below ask you to perform a series of fairly typical data import, exploration, tidying, and descriptive analysis steps. Once again, you'll work with some "fake" data that Aaron created to ensure consistency and illustrate some useful points. The most recent R tutorials and problem set worked solutions contain example code that should help you do almost everything asked of you here. From this point forward, | The programming challenges below ask you to perform a series of fairly typical data import, exploration, tidying, and descriptive analysis steps. Once again, you'll work with some "fake" data that Aaron created to ensure consistency and illustrate some useful points. The most recent R tutorials and problem set worked solutions contain example code that should help you do almost everything asked of you here. From this point forward, I will start to assume that you have become familiar with some of the more basic fundamental skills (e.g., creating your R Markdown script or notebook) and that you have some ideas of where to turn for help and more information when you need it. That said, you should always seek whatever help you need at any time, whether online, from your peers, or the teaching team. | ||
''Note: if you have trouble accessing or importing your dataset, please reach out for help ASAP as you will only be able to do the other challenges once you've done that one.'' | ''Note: if you have trouble accessing or importing your dataset, please reach out for help ASAP as you will only be able to do the other challenges once you've done that one.'' | ||
Line 20: | Line 18: | ||
===PC2. Explore and describe the data=== | ===PC2. Explore and describe the data=== | ||
Take appropriate steps to gain a basic understanding of this dataset. | Take appropriate steps to gain a basic understanding of this dataset. How many columns and rows are there? What classes/types are the variables/columns? What appropriate summary statistics can you provide for each variable (e.g., what are the range, center, and spread of the continuous variables?). Generate univariate tables and visualizations (e.g., boxplots or histograms) to get a sense of what they look like. | ||
===PC3. Use and write user-defined functions === | ===PC3. Use and write user-defined functions === | ||
Use the example function, <code>my.mean()</code> distributed in the most recent R tutorial materials to calculate the mean of the variable (column) named <code>x</code> in your dataset. Now, write your own function to calculate the median of <code>x</code>. Be ready to walk us through how your function works! | Use the example function, <code>my.mean()</code> distributed in the most recent R tutorial materials to calculate the mean of the variable (column) named <code>x</code> in your dataset. Now, write your own function to calculate the median of <code>x</code>. Be ready to walk us through how your function works! | ||
===PC4. | ===PC4. Replicate the data import and cleanup from Problem Set #1=== | ||
Load your vector from | Load your vector from Week 2 again and perform the same cleanup steps you did in PC6 and PC7 last week (recode negative values as missing and log-transform the data). | ||
===PC5. | ===PC5. Compare two vectors=== | ||
Compare the vector <code>x</code> from Problem Set #1 with the first column (<code>x</code>) of the data you imported for Problem Set #2 (the current dataset you just imported from a .csv file). They should be similar, but are they really the same? Write R code to demonstrate or support your answer. | |||
===PC6. | ===PC6. Cleanup/tidy your data=== | ||
A very common step when you import and prepare for data analysis is going to be cleaning and recoding data. Some of that is needed here. It turns out that the variables <code>i</code> and <code>j</code> are really dichotomous "true/false" variables that have been coded as 0 and 1 in this dataset. Recode these columns as <code>logical</code> (i.e., "TRUE" or "FALSE" values). The variable <code>k</code> is really a categorical variable. Recode this as a factor and change the numbers so that they are replaced with the following values or levels: 0="none", 1="some", 2="lots", 3="all". The goal is to end up with a factor where those text strings are the levels of the factor. | |||
=== | ===PC6. Create a bivariate table=== | ||
Now that you have some categorical variables to work with, let's go ahead and create a bivariate table so that you can examine the distributions of some of these values. Use the <code>table()</code> command to create a cross-tabulation of the recoded versions of the <code>k</code> variable and the <code>j</code> variable. | Now that you have some categorical variables to work with, let's go ahead and create a bivariate table so that you can examine the distributions of some of these values. Use the <code>table()</code> command to create a cross-tabulation of the recoded versions of the <code>k</code> variable and the <code>j</code> variable. | ||
=== | ===PC7. Create a bivariate visualization=== | ||
Visualize two variables in the Problem Set #2 dataset using <code>ggplot2</code> and the <code>geom_point()</code> function to produce a scatterplot | Visualize two variables in the Problem Set #2 dataset using <code>ggplot2</code> and the <code>geom_point()</code> function to produce a scatterplot. First, plot <code>x</code> on the x-axis and <code>y</code> on the y-axis. Second, visualize the other variables on other dimensions (e.g., color, shape, and size seem reasonable). If you run into any issues plotting these dimensions, revisit the examples in the tutorial and the ggplot2 documentation and consider that ggplot2 can be very picky about the classes of objects... | ||
== Statistical Questions == | == Statistical Questions == | ||
===SQ1 | ===SQ1=== | ||
== Empirical Paper Questions: Emotional contagion in social networks == | == Empirical Paper Questions: Emotional contagion in social networks == |