Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 4

Programming Challenges[edit]

This week we are going to do more advanced data wrangling. We'll use two datasets from data.seattle.gov related to online engagement with the city of Seattle's websites around 2015-2016. Both datasets are drawn from Google Analytics. We're going to merge them together in order to analyze the proportion of pageviews from mobile users and determine whether this proportion is going up or down over time.

The datasets are messy and aren't set up well to help us answer the question. Luckily, you know how to use R to solve these problems!

PC1. The two "raw" datasets come from data.seattle.gov and are available in the course data repository as well as from these links: COS-Statistics-Gov-Domains-Only COS-Statistics-Mobile Sessions. You may want to visit the links to read the codebook for each dataset.

PC2. Load both datasets into R as separate data frames. Assume, for the purposes of this assignment, that the two datasets include pageview data for the same population of websites. Explore the data to get a sense of the structure. What are the columns, rows, missing data, etc? Write code to take a random sample of rows and then look at them! Maybe inspect a few samples just to get more familiar.

PC3. Using the gov domains data, create a new data frame where one column is each month (as described in the data) and a second column is the total number of views made to all pages in the dataset over that month.

PC4. Using the mobile data, create a new data frame where one column is each month described in the data and the second is a measure (estimate?) of the total number of views made by mobile devices (all platforms) over each month. This will involve at least two steps since total views are not included. You'll need to first use the data there to create a measure of the total views for each line in the dataset.

PC5. Merge your two datasets together into a new dataset with columns for each month, total views (across the gov domain pages) and total mobile views. Make sure that month, in your merged dataset, is a date or datetime object in R. Are there are missing data? Can you tell why?

PC6. Create a new column in your merged dataset that describes your best estimate of the proportion of total views that come from mobile. Be able to talk about any assumptions/decisions you've made in constructing this measure.

PC7. Graph the proportion over time and be ready to describe: (a) your best estimate of the proportion of views from mobile devices to the Seattle City website over time and (b) an indication of whether it's going up or down.

Statistical Questions[edit]

Exercises from OpenIntro §4

SQ0. Any questions or clarifications from the OpenIntro text or lecture notes?

SQ1. Exercise 4.8 on Twitter users and news

SQ2. Exercise 4.10 which is a continuation of 4.8

SQ3. Exercise 4.19 on online communication

SQ4. Exercise 4.32 which is asking you to explain why certain statements about statistical inference are true or false

Empirical Paper[edit]

Revisit the paper we read for Week 1 of the course:

Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks. Proceedings of the National Academy of Sciences 111(24):8788–90. [Open Access]

Come to class prepared to discuss your answers to the following questions

EQ1. Write down, in your own words, the key pairs of null/alternative hypotheses tested in the paper (hint: the four pairs that correspond to the main effects represented in the figure).

EQ2. Describe, in your own words, the main effects estimated in the paper for these four key pairs of hypotheses.

EQ3. The authors report Cohen's d along with their regression estimates of the main effects. Look up the formula for Cohen's d. Discuss the substantive or practical significance of the estimates given the magnitudes of the d values reported.