Statistics and Statistical Programming (Fall 2020)/pset3

From CommunityData
< Statistics and Statistical Programming (Fall 2020)
Revision as of 17:47, 29 September 2020 by Aaronshaw (talk | contribs) (Created page with " <small>← Back to Week 5</small> For this problem set, the programming challenges focus on some more advanced...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


← Back to Week 5

For this problem set, the programming challenges focus on some more advanced data wrangling, visualization, and analysis. We'll use a dataset from The Stanford Open Policing Project (SOPP). The dataset includes information about traffic stops in the state of Illinois between 2012-2017. The datasets are pretty "clean," but include a number of features that may be confusing and/or aren't set up well to help us answer our questions. Luckily, you know how to use R to solve these problems!

Programming Challenges

   PC1. The two "raw" datasets come from data.seattle.gov and are available in the course data repository as well as from these links: COS-Statistics-Gov-Domains-Only COS-Statistics-Mobile Sessions. You may want to visit the links to read the codebook for each dataset.
   PC2. Load both datasets into R as separate data frames. Assume, for the purposes of this assignment, that the two datasets include pageview data for the same population of websites. Explore the data to get a sense of the structure. What are the columns, rows, missing data, etc? Write code to take a random sample of rows and then look at them! Maybe inspect a few samples just to get more familiar.
   PC3. Using the gov domains data, create a new data frame where one column is each month (as described in the data) and a second column is the total number of views made to all pages in the dataset over that month.
   PC4. Using the mobile data, create a new data frame where one column is each month described in the data and the second is a measure (estimate?) of the total number of views made by mobile devices (all platforms) over each month. This will involve at least two steps since total views are not included. You'll need to first use the data there to create a measure of the total views for each line in the dataset.
   PC5. Merge your two datasets together into a new dataset with columns for each month, total views (across the gov domain pages) and total mobile views. Make sure that month, in your merged dataset, is a date or datetime object in R. Are there are missing data? Can you tell why?
   PC6. Create a new column in your merged dataset that describes your best estimate of the proportion of total views that come from mobile. Be able to talk about any assumptions/decisions you've made in constructing this measure.
   PC7. Graph the proportion over time and be ready to describe: (a) your best estimate of the proportion of views from mobile devices to the Seattle City website over time and (b) an indication of whether it's going up or down.