Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 3

Please note: if you have trouble loading up your dataset (PC2 below) contact Jeremy or me ASAP as you will only be able to do the other challenges once you've done that one.

Programming Challenges

PC0. Create a new project and RMarkdown script for this week's problem set (as usual).

PC1. Revisit your code from last week and recall what group number you were in (should be an integer between 1-20). Navigate to the data repository for the course and download the .csv file in the week_03 subdirectory with your group number from PC1 last week associated with it (e.g., group_<output>.Rdata).

PC1.5 Open the dataset and take a look at it! You might use spreadsheet software (e.g., Google docs, LibreOffice, Excel, etc.) to do this, or it is a good idea to open it in a text editor (e.g., NotePad) so you can inspect the structure of the "raw data." Manually inspecting the raw data is common and useful since it can help you figure out how best to read it into R. I won't ask about this is class, but I do recommend it.

PC2. Read the CSV file into R using the read.csv() command.

PC3. Get to know your data! Do whatever is necessary to summarize the new dataset. How many columns and rows are there? Report appropriate summary statistics for each variable (e.g., what are the ranges, minimums, maximums, means, medians, and standard deviations of the continuous variables?). Plot histograms for all of the variables to get a sense of what they look like.

PC4. Use the my.mean() function distributed in this week's R lecture materials to recalculate the mean of the variable (column) named "x" in your dataset. Write your own function to re-compute the median of "x". Be ready to walk us through how your function works!

PC5. Load your vector from Week 2 again and perform the same cleanup steps you did in PC6 and PC7 last week (recode negative values as missing and log-transform the data).

PC6. Compare the vector from Week 2 with the first column (x) of the Week 3 data frame. They should be similar, but how similar? Write R code to demonstrate or support your answer.

PC7. Visualize the Week 3 data using ggplot2 and the geom_point() function to produce a scatterplot. First, plot the x on the x-axis and y on the y-axis. Second, visualize i, j, and k on other dimensions (e.g., color, shape, and size seem reasonable). If you run into any issues plotting these dimensions, consider that ggplot2 can be very picky about the classes of objects...

PC8. A very common step when you import and prepare for data analysis is going to be cleaning and recoding data. Some of that is needed here. It turns out that the variables i and j are really dichotomous "true/false" variables that have been coded as 0 and 1 in this dataset. Recode these columns as logical (i.e., "TRUE" or "FALSE" values). The variable k is really a categorical variable. Recode this as a factor and change the numbers into the following levels: 0="none", 1="some", 2="lots", 3="all". The goal is to end up with a factor where those text strings are the levels of the factor.

PC9. Now that you have cleaned and recoded your data, summarize those three variables again. Also, go back and regenerate the visualizations from PC7. How have the plots changed (if at all)?

PC10. As always, Save your work and archive the project (i.e., in a .zip file) and upload it to canvas.

Statistical Questions (from OpenIntro)

'Exercises from OpenIntro §2

Q0. Any questions or clarifications from the OpenIntro text or lecture notes?

Q1. Exercise 3.4 on triathlon times

Q2. Exercise 3.6 which is basically a continuation of 3.4

Q3. Exercise 3.18 on evaluating normal approximation

Q4. Exercise 3.32 on arachnophobia (spiders are frequent concern in statistical programming)

Empirical Paper Questions

There will be no empirical paper this week.