Editing Statistics and Statistical Programming (Winter 2021)/Problem set 5

From CommunityData

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 9: Line 9:
For this problem set, the programming challenges focus on some of the more advanced fundamentals of R, including some of the new types of data import, transformation, tidying, and visualization introduced in the most recent R tutorial.
For this problem set, the programming challenges focus on some of the more advanced fundamentals of R, including some of the new types of data import, transformation, tidying, and visualization introduced in the most recent R tutorial.


The programming challenges below ask you to perform a series of fairly typical data import, exploration, tidying, and descriptive analysis steps. Once again, you'll work with some "fake" data that Mako created to ensure consistency and illustrate some useful points. The most recent R tutorials and problem set worked solutions contain example code should help you do everything asked of you here. From this point forward, we will start to assume that you have become familiar with some of the more basic fundamental skills (e.g., creating your R Markdown script or notebook) and that you have some ideas of where to turn for help and more information when you need it. That said, you should always seek whatever help you need at any time, whether online in Discord, from your peers, or from me.
The programming challenges below ask you to perform a series of fairly typical data import, exploration, tidying, and descriptive analysis steps. Once again, you'll work with some "fake" data that Mako created to ensure consistency and illustrate some useful points. The most recent R tutorials and problem set worked solutions contain example code should help you do everything asked of you here. From this point forward, we will start to assume that you have become familiar with some of the more basic fundamental skills (e.g., creating your R Markdown script or notebook) and that you have some ideas of where to turn for help and more information when you need it. That said, you should always seek whatever help you need at any time, whether online in Disccord, from your peers, or from me.


'''Note:'''if you have trouble accessing or importing your dataset, please reach out for help ASAP as you will only be able to do the other challenges once you've done that one!
'''Note:'''if you have trouble accessing or importing your dataset, please reach out for help ASAP as you will only be able to do the other challenges once you've done that one!
Line 19: Line 19:
=== PC1. Import data from a .csv file===
=== PC1. Import data from a .csv file===


Revisit your problem set code from [[../Problem set 4]] and recall what group number you were in (should be an integer between 1-20). Hopefully it's recorded in your notebook! If not, generate a new one and make sure it's recorded this time!
Revisit your problem set code from [[../Problem set 4]] and recall what group number you were in (should be an integer between 1-20). Navigate to the import the .csv file in the <code>week_04</code> subdirectory with your number (e.g., <code>group_<output>.csv</code>). Note that it is a .csv file and you'll need to use an appropriate procedure/commands to import it!
 
::'''Recommended sub-challenge:''' Inspect the dataset directly before you import. You might download the .csv file and use spreadsheet software (e.g., Google docs, LibreOffice, Excel, etc.) to do this. I often prefer look at the first few lines of a new dataset in a "raw" format via the command line or a text editor (e.g., NotePad) so that I can inspect the structure. This can help you figure out how best to import the data into R and clue you into any immediate data cleanup/tidying steps you'll need to take after import (e.g., do the columns have headers? are numbers/text formatted differently?). I won't ask about this in class, but I do recommend it.
Navigate to the import the .csv file in the <code>datasets/problem_set_5</code> subdirectory in the class Dropbox folder with your number (e.g., <code>group_<output>.csv</code>). Note that it is a .csv file and you'll need to use an appropriate procedure/commands to import it!
 
:'''Recommended sub-challenge:''' Inspect the dataset directly before you import. You might download the .csv file and use spreadsheet software (e.g., Google docs, LibreOffice, Excel, etc.) to do this. I often prefer look at the first few lines of a new dataset in a "raw" format via the command line or a text editor (e.g., NotePad) so that I can inspect the structure. This can help you figure out how best to import the data into R and clue you into any immediate data cleanup/tidying steps you'll need to take after import (e.g., do the columns have headers? are numbers/text formatted differently?). I won't ask about this in class, but I do recommend it for reasons I describe in the tutorial.


===PC2. Explore and describe the data===
===PC2. Explore and describe the data===
Line 36: Line 33:


===PC4. Compare two vectors===
===PC4. Compare two vectors===
Load your vector from [[../Problem set 4]] again (you might want to give it a new name so it's not called <code>d</code>) and perform the same cleanup steps you did in PC2.5 and PC2.6 last week (recode negative values as missing and log-transform the data). Now, compare the vector <code>x</code> from Problem Set #1 with the first column (<code>x</code>) of the data you imported for this assignment (Problem Set #2, i.e., the current dataset you just imported from a .csv file). They should be similar, but are they ''exactly'' the same? Use R code to show your answer.
Load your vector from [[../Problem set 4]] again (you might want to give it a new name) and perform the same cleanup steps you did in PC2.5 and PC2.6 last week (recode negative values as missing and log-transform the data). Now, compare the vector <code>x</code> from Problem Set #1 with the first column (<code>x</code>) of the data you imported for this assignment (Problem Set #2, i.e., the current dataset you just imported from a .csv file). They should be similar, but are they ''exactly'' the same? Use R code to show your answer.


===PC5. Cleanup/tidy your data===
===PC5. Cleanup/tidy your data===
Line 46: Line 43:
===PC7. Create a bivariate table===
===PC7. Create a bivariate table===
Now that you have some categorical variables to work with, let's go ahead and create a bivariate table so that you can examine the distributions of some of these values. Use the <code>table()</code> command to create a cross-tabulation of the recoded versions of the <code>k</code> variable and the <code>j</code> variable.  
Now that you have some categorical variables to work with, let's go ahead and create a bivariate table so that you can examine the distributions of some of these values. Use the <code>table()</code> command to create a cross-tabulation of the recoded versions of the <code>k</code> variable and the <code>j</code> variable.  
===PC8. Create a bivariate visualization===
Visualize two variables in the Problem Set #2 dataset using <code>ggplot2</code> and the <code>geom_point()</code> function to produce a scatterplot of <code>x</code> on the x-axis and <code>y</code> on the y-axis. '''Optional bonus:''' Incorporate any of the other variables on other dimensions (e.g., color, shape, and/or size are all good options). If you run into any issues plotting these dimensions, revisit the examples in the tutorial and the ggplot2 documentation and consider that ggplot2 can be very picky about the classes of objects.


== Statistical Questions ==
== Statistical Questions ==
Line 51: Line 51:
===SQ1. Interpret bivariate analyses===
===SQ1. Interpret bivariate analyses===


Return to the dataset you imported and worked with in the programming challenges above. Imagine that it comes from a year-long study of bicyclists using a combination of survey and ride-tracking data from Seattle JUMP bikeshare users conducted a few years ago (let's say 2018, just to pick a year). Each row in the data corresponds to a single cyclist/member and the variables correspond to the following measures:  
Return to the dataset you imported and worked with in the programming challenges above. Imagine that it comes from a year-long study of bicyclists using a combination of survey and ride-tracking data from the Divvy bikeshare members in the Chicagoland area conducted a few years ago (let's say 2018, just to pick a year). Each row in the data corresponds to a single Divvy cyclist/member and the variables correspond to the following measures:  
* <code>x</code>: Average daily distance cycled (in miles) measured via bicycle dock check-in/check-out data.   
* <code>x</code>: Average daily distance cycled (in miles) measured via bicycle dock check-in/check-out data.   
* <code>j</code>: An indicator (True/False) of whether any rides were recorded between January and March.
* <code>j</code>: An indicator (True/False) of whether any rides were recorded between January and March.
* <code>l</code>: An indicator (True/False) of whether the cyclist also uses vehicle rideshare provided by Uber (the company that owns JUMP).
* <code>l</code>: An indicator (True/False) of whether the cyclist also uses vehicle rideshare provided by Lyft (the company that owns Divvy).
* <code>k</code>: A measure of how frequently the cyclist rode in bad weather, with bad weather defined using a standard measure provided by the U.S. NOAA (National Oceanic and Atmospheric Administration) and the categories (none, some, a lot, all) defined in terms of empirical quartiles within the dataset.
* <code>k</code>: A measure of how frequently the cyclist rode in bad weather, with bad weather defined using a standard measure provided by the U.S. NOAA (National Oceanic and Atmospheric Administration) and the categories (none, some, a lot, all) defined in terms of empirical quartiles within the dataset.
* <code>y</code>: A continuous measure of income calculated in tens of thousands of dollars and scaled so that "0" = average income for a JUMP user (i.e., a value of "5" = $50,000 more per year than an average JUMP user).
* <code>y</code>: A continuous measure of income calculated in tens of thousands of dollars and scaled so that "0" = average income for a Divvy member (i.e., a value of "5" = $50,000 more per year than an average Divvy member).


# Return to the conditional means you created in PC6 above. Given the information you now have about the study, how would you interpret them? Does there seem to be any sort of relationship between the two variables?
# Return to the conditional means you created in PC6 above. Given the information you now have about the study, how would you interpret them? Does there seem to be any sort of relationship between the two variables?
Line 66: Line 66:
'''Optional bonus statistical question'''
'''Optional bonus statistical question'''


You did a question about birthdays in the context of one of the textbook exercises for ''OpenIntro'' Chapter 3. Here's an opportunity to apply your knowledge and extend that exercise. Note that you can absolutely use R to help calculate the solutions to both parts of this problem. That said, it's a super famous problem and answers/examples are all over the internet, so if you want to challenge yourself, don't look at them while you're working on it! The only hint I'll give you is that you may find [https://en.wikipedia.org/wiki/Binomial_coefficient binomial coefficients] useful and the <code>choose()</code>) function can calculate them for you in R.
''We talked about birthdays in the context of one of the textbook exercises for ''OpenIntro'' Chapter 3. Here's an opportunity to apply your knowledge and extend that exercise. Note that you can absolutely use R to help calculate the solutions to both parts of this problem. That said, it's a super famous problem and answers/examples are all over the internet, so if you want to challenge yourself, don't look at them while you're working on it! The only hint I'll give you is that you may find [https://en.wikipedia.org/wiki/Binomial_coefficient binomial coefficients] useful and the <code>choose()</code>) function can calculate them for you in R.''


# Imagine that there were 25 people in this class and that I offered you a choice between two bets: Bet #1 is determined by the flip of a fair coin. You can choose heads or tails and you win the bet if your choice turns out to be correct). Bet #2 is determined by whether any two members of that previous version of the class shared a birthday. If a birthday was shared I win the bet, and if no shared birthdays were shared you win the bet. Assuming you want the best chance of winning, which bet should you choose?
# The first time I taught this course, there were 25 people in it (including the members of the teaching team). Imagine that I offered you a choice between two bets: Bet #1 is determined by the flip of a fair coin. You can choose heads or tails and you win the bet if your choice turns out to be correct). Bet #2 is determined by whether any two members of that previous version of the class shared a birthday. If a birthday was shared I win the bet, and if no shared birthdays were shared you win the bet. Assuming you want the best chance of winning, which bet should you choose?
# Now calculate the probability that any two members of our 5 person class share a birthday and compare this probability with the results of SQ2.1 above.
# Now calculate the probability that any two members of our 7 person class share a birthday and compare this probability with the results of SQ2.1 above.
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)