Not logged in
Talk
Contributions
Create account
Log in
Navigation
Main page
About
People
Publications
Teaching
Resources
Research Blog
Wiki Functions
Recent changes
Help
Licensing
Page
Discussion
Edit
View history
Editing
Statistics and Statistical Programming (Fall 2020)/pset2
(section)
From CommunityData
Jump to:
navigation
,
search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Programming Challenges == The programming challenges below ask you to perform a series of fairly typical data import, exploration, tidying, and descriptive analysis steps. Once again, you'll work with some "fake" data that Aaron created to ensure consistency and illustrate some useful points. The most recent R tutorials and problem set worked solutions contain example code that should help you do almost everything asked of you here. From this point forward, we will start to assume that you have become familiar with some of the more basic fundamental skills (e.g., creating your R Markdown script or notebook) and that you have some ideas of where to turn for help and more information when you need it. That said, you should always seek whatever help you need at any time, whether online, from your peers, or the teaching team. ''Note: if you have trouble accessing or importing your dataset, please reach out for help ASAP as you will only be able to do the other challenges once you've done that one.'' === PC0. Get started=== Create and setup the metadata for a new RMarkdown script or notebook for this week's problem set (as usual). Make sure to confirm that R has the working directory location that you want. === PC1. Import data from a .csv file=== Revisit your problem set code from [[Statistics_and_Statistical_Programming_(Fall_2020)/pset1|Problem Set #1]] and recall what group number you were in (should be an integer between 1-20). Navigate to the [https://communitydata.science/~ads/teaching/2020/stats/data data repository for the course] and import the .csv file in the <code>week_04</code> subdirectory with your number (e.g., <code>group_<output>.csv</code>). Note that it is a .csv file and you'll need to use an appropriate procedure/commands to import it! ::'''Recommended sub-challenge:''' Inspect the dataset directly before you import. You might download the .csv file and use spreadsheet software (e.g., Google docs, LibreOffice, Excel, etc.) to do this. I often prefer look at the first few lines of a new dataset in a "raw" format via the command line or a text editor (e.g., NotePad) so that I can inspect the structure. This can help you figure out how best to import the data into R and clue you into any immediate data cleanup/tidying steps you'll need to take after import (e.g., do the columns have headers? are numbers/text formatted differently?). I won't ask about this in class, but I do recommend it. ===PC2. Explore and describe the data=== Take appropriate steps to gain a basic understanding of this dataset. * How many columns and rows are there? What classes/types are the variables/columns? * What appropriate summary statistics can you provide for each variable (e.g., what are the range, center, and spread of the continuous variables?). * Generate univariate tables and visualizations (e.g., boxplots or histograms) to get a sense of what they look like. If there additional steps you'd like to take, feel free to do so. ===PC3. Use and write user-defined functions === Use the example function, <code>my.mean()</code> distributed in the most recent R tutorial materials to calculate the mean of the variable (column) named <code>x</code> in your dataset. Now, write your own function to calculate the median of <code>x</code>. Be ready to walk us through how your function works! ===PC4. Compare two vectors=== Load your vector from [[Statistics_and_Statistical_Programming_(Fall_2020)/pset1|Problem Set #1]] (Week 3) again (you might want to give it a new name) and perform the same cleanup steps you did in PC2.5 and PC2.6 last week (recode negative values as missing and log-transform the data). Now, compare the vector <code>x</code> from Problem Set #1 with the first column (<code>x</code>) of the data you imported for this assignment (Problem Set #2, i.e., the current dataset you just imported from a .csv file). They should be similar, but are they ''exactly'' the same? Use R code to show your answer. ===PC5. Cleanup/tidy your data=== Once again, some cleanup and recoding is needed for this week's data. It turns out that the variables <code>i</code> and <code>j</code> are really dichotomous "true/false" variables that have been coded as 0 and 1 respectively in this dataset. Recode these columns as <code>logical</code> (i.e., "TRUE" or "FALSE" values). The variable <code>k</code> is really a categorical variable. Recode <code>k</code> as a factor and change the numbers so that they are replaced with the following values or levels: 0="none", 1="some", 2="lots", 3="all". *Your data file may only contains the values 1,2,3. The goal is to end up with a factor (so the command <code>class(k)</code> should return the value <code>TRUE</code>) where those text strings are the levels of the factor. ===PC6. Calculate conditional summary statistics=== It's common to consider the conditional distributions of a continuous variable within the levels of a second, categorical variable. Please describe the distribution of <code>x</code> within each of the four levels of <code>k</code>. For each level of <code>k</code> calculate the mean and standard deviation of <code>x</code>. ===PC7. Create a bivariate table=== Now that you have some categorical variables to work with, let's go ahead and create a bivariate table so that you can examine the distributions of some of these values. Use the <code>table()</code> command to create a cross-tabulation of the recoded versions of the <code>k</code> variable and the <code>j</code> variable. ===PC8. Create a bivariate visualization=== Visualize two variables in the Problem Set #2 dataset using <code>ggplot2</code> and the <code>geom_point()</code> function to produce a scatterplot of <code>x</code> on the x-axis and <code>y</code> on the y-axis. '''Optional bonus:''' Incorporate any of the other variables on other dimensions (e.g., color, shape, and/or size are all good options). If you run into any issues plotting these dimensions, revisit the examples in the tutorial and the ggplot2 documentation and consider that ggplot2 can be very picky about the classes of objects.
Summary:
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see
CommunityData:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:
Cancel
Editing help
(opens in new window)
Tools
What links here
Related changes
Special pages
Page information