Not logged in
Talk
Contributions
Create account
Log in
Navigation
Main page
About
People
Publications
Teaching
Resources
Research Blog
Wiki Functions
Recent changes
Help
Licensing
Page
Discussion
Edit
View history
Editing
Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 3
From CommunityData
Jump to:
navigation
,
search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
Please note: if you have trouble loading up your dataset ('''PC2''' below) contact Jeremy or me ASAP as you will only be able to do the other challenges once you've done that one. == Programming Challenges == :'''PC0.''' Create a new project and RMarkdown script for this week's problem set (as usual). :'''PC1.''' Revisit your code from last week and recall what group number you were in (should be an integer between 1-20). Navigate to the [https://communitydata.cc/~ads/teaching/2019/stats/data data repository for the course] and download the .csv file in the <code>week_03</code> subdirectory with your group number from PC1 last week associated with it (e.g., <code>group_<output>.csv</code>). Note that it is a .csv file and not an .RData file. ::'''PC1.5''' Open the dataset and take a look at it! You might use spreadsheet software (e.g., Google docs, LibreOffice, Excel, etc.) to do this, or it is a good idea to open it in a text editor (e.g., NotePad) so you can inspect the structure of the "raw data." Manually inspecting the raw data is common and useful since it can help you figure out how best to read it into R. I won't ask about this in class, but I do recommend it. :'''PC2.''' Read the CSV file into R using the <code>read.csv()</code> command. :'''PC3.''' Get to know your data! Do whatever is necessary to summarize the new dataset. How many columns and rows are there? Report appropriate summary statistics for each variable (e.g., what are the ranges, minimums, maximums, means, medians, and standard deviations of the continuous variables?). Plot histograms for each of the variables to get a sense of what they look like. :'''PC4.''' Use the <code>my.mean()</code> function distributed in this week's R lecture materials to recalculate the mean of the variable (column) named <code>x</code> in your dataset. Write your own function to recalculate the median of <code>x</code>. Be ready to walk us through how your function works! :'''PC5.''' Load your vector from Week 2 again and perform the same cleanup steps you did in PC6 and PC7 last week (recode negative values as missing and log-transform the data). :'''PC6.''' Compare the vector from Week 2 with the first column (<code>x</code>) of the Week 3 data frame. They should be similar, but how similar? Write R code to demonstrate or support your answer. :'''PC7.''' Visualize the Week 3 data using <code>ggplot2</code> and the <code>geom_point()</code> function to produce a scatterplot. First, plot <code>x</code> on the x-axis and <code>y</code> on the y-axis. Second, visualize the other variables on other dimensions (e.g., color, shape, and size seem reasonable). If you run into any issues plotting these dimensions, consider that <code>ggplot2</code> can be very picky about the classes of objects... :'''PC8.''' A very common step when you import and prepare for data analysis is going to be cleaning and recoding data. Some of that is needed here. It turns out that the variables <code>i</code> and <code>j</code> are really dichotomous "true/false" variables that have been coded as 0 and 1 in this dataset. Recode these columns as <code>logical</code> (i.e., "TRUE" or "FALSE" values). The variable <code>k</code> is really a categorical variable. Recode this as a factor and change the numbers into the following levels: 0="none", 1="some", 2="lots", 3="all". The goal is to end up with a factor where those text strings are the levels of the factor. :'''PC9.''' Now that you have cleaned and recoded your data, summarize those three variables again. Also, go back and regenerate the visualizations from PC7. How have the plots changed (if at all)? :'''PC10.''' As always, Save your work and archive the project (i.e., in a .zip file) and [https://canvas.northwestern.edu/courses/90927/assignments/578012 upload it to canvas]. == Statistical Questions (from OpenIntro) == '''Exercises from OpenIntro Β§2''' : '''Q0.''' Any questions or clarifications from the OpenIntro text or lecture notes? : '''Q1.''' Exercise 3.4 on triathlon times : '''Q2.''' Exercise 3.6 which is basically a continuation of 3.4 : '''Q3.''' Exercise 3.18 on evaluating normal approximation : '''Q4.''' Exercise 3.32 on arachnophobia (spiders are a frequent concern in statistical programming) == Empirical Paper Questions == There is no empirical paper this week.
Summary:
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see
CommunityData:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:
Cancel
Editing help
(opens in new window)
Tools
What links here
Related changes
Special pages
Page information