Statistics and Statistical Programming (Fall 2020)/pset3: Difference between revisions

From CommunityData
No edit summary
No edit summary
Line 1: Line 1:
<small>[[Statistics_and_Statistical_Programming_(Fall_2020)#Week_5_.2810.2F13.2C_10.2F15.29|← Back to Week 5]]</small>
<small>[[Statistics_and_Statistical_Programming_(Fall_2020)#Week_5_.2810.2F13.2C_10.2F15.29|← Back to Week 5]]</small>


For this problem set, the programming challenges focus on some more advanced data wrangling, visualization, and analysis. We'll use a dataset from [https://openpolicing.stanford.edu The Stanford Open Policing Project] (SOPP). The dataset includes information about traffic stops in the state of Illinois between 2012-2017. The datasets are pretty "clean," but include a number of features that may be confusing and/or aren't set up well to help us answer our questions. Luckily, you know how to use R to solve these problems!
For this problem set, the programming challenges investigate the relationship between vehicle searches and driver attributes (especially race as recorded by police officers conducting traffic stops) in Illinois. Doing so will involve some more advanced data wrangling, visualization, and analysis. We'll use data from [https://openpolicing.stanford.edu The Stanford Open Policing Project] (SOPP) that looks at records of traffic stops in Illinois between 2012-2017. The full SOPP dataset for Illinois is about 12 million rows, so I've created a 1% random sample for us to work with here. Overall, the dataset is well-documented and pretty "clean," but there are still a number of features that may be confusing, weird, and/or ill-organized to help answer the questions I've asked you below. Luckily, you know how to use R to solve these problems...


== Programming Challenges ==
== Programming Challenges ==


PC1. The two "raw" datasets come from data.seattle.gov and are available in the course data repository as well as from these links: COS-Statistics-Gov-Domains-Only COS-Statistics-Mobile Sessions. You may want to visit the links to read the codebook for each dataset.
=== PC1. Investigate the provenance of the data ===


PC2. Load both datasets into R as separate data frames. Assume, for the purposes of this assignment, that the two datasets include pageview data for the same population of websites. Explore the data to get a sense of the structure. What are the columns, rows, missing data, etc? Write code to take a random sample of rows and then look at them! Maybe inspect a few samples just to get more familiar.
Review the project description on the SOPP website, the codebook provided for the project as a whole as well as for the Illinois data specifically, as well as any ancillary materials that help you get oriented with the data. For the questions below we'll focus on the following measures recorded for each stop: `date`, `vehicle_year`, `subject_race`, `subject_sex`, and  `search_conducted`. Note any questions or issues you might notice related to these measures as you review the information about the project and dataset.


=== PC2. Import, explore, clean ===
As I noted above, the full IL SOPP dataset is over 12 million rows, so I have created a random 1% subset for us to work with in this assignment. That subset lives here. To get started, you'll want to import the data and explore its structure as well as key variables that we'll be focusing on in this analysis (`date`, `vehicle_year`, `subject_race`, `subject_sex`, and  `search_conducted`). What data (if any) is missing? Inspect a random sample of rows to get a sense of the data. You may also want to clean/recode some of the key variables. Make sure to explain and justify any data cleanup and/or recoding steps you decide to take.
=== PC3. Summarize outcome and predictor variables ===
Calculate and report appropriate summary statistics for the outcome (`search_conducted`) and each of the predictor variables (`date`, `vehicle_year`, `subject_race`, `subject_sex`). Include visual and/or tabular summaries where appropriate. Attempt, when possible, to write efficient/elegant code that avoids unnecessary repetition.
=== PC4. Summarize relationships between outcome and predictor variables ===
=== PC5. Analyze relationships between driver race and vehicle searches over time ===
=== PC6. Estimate population baselines for relevant racial categories ===
== Statistical Questions ==
=== SQ1. Interpret the results of PC3 ===
=== SQ2. Interpret the results of PC4 ===
=== SQ3. Interpret the results of PC5 ===
=== SQ4. Compare and interpret salient results of PC4 and PC6 ===
=== SQ5. Reflect on the limitations of your analysis ===
==OLD ==
PC3. Using the gov domains data, create a new data frame where one column is each month (as described in the data) and a second column is the total number of views made to all pages in the dataset over that month.
PC3. Using the gov domains data, create a new data frame where one column is each month (as described in the data) and a second column is the total number of views made to all pages in the dataset over that month.



Revision as of 05:56, 6 October 2020

← Back to Week 5

For this problem set, the programming challenges investigate the relationship between vehicle searches and driver attributes (especially race as recorded by police officers conducting traffic stops) in Illinois. Doing so will involve some more advanced data wrangling, visualization, and analysis. We'll use data from The Stanford Open Policing Project (SOPP) that looks at records of traffic stops in Illinois between 2012-2017. The full SOPP dataset for Illinois is about 12 million rows, so I've created a 1% random sample for us to work with here. Overall, the dataset is well-documented and pretty "clean," but there are still a number of features that may be confusing, weird, and/or ill-organized to help answer the questions I've asked you below. Luckily, you know how to use R to solve these problems...

Programming Challenges

PC1. Investigate the provenance of the data

Review the project description on the SOPP website, the codebook provided for the project as a whole as well as for the Illinois data specifically, as well as any ancillary materials that help you get oriented with the data. For the questions below we'll focus on the following measures recorded for each stop: `date`, `vehicle_year`, `subject_race`, `subject_sex`, and `search_conducted`. Note any questions or issues you might notice related to these measures as you review the information about the project and dataset.

PC2. Import, explore, clean

As I noted above, the full IL SOPP dataset is over 12 million rows, so I have created a random 1% subset for us to work with in this assignment. That subset lives here. To get started, you'll want to import the data and explore its structure as well as key variables that we'll be focusing on in this analysis (`date`, `vehicle_year`, `subject_race`, `subject_sex`, and `search_conducted`). What data (if any) is missing? Inspect a random sample of rows to get a sense of the data. You may also want to clean/recode some of the key variables. Make sure to explain and justify any data cleanup and/or recoding steps you decide to take.

PC3. Summarize outcome and predictor variables

Calculate and report appropriate summary statistics for the outcome (`search_conducted`) and each of the predictor variables (`date`, `vehicle_year`, `subject_race`, `subject_sex`). Include visual and/or tabular summaries where appropriate. Attempt, when possible, to write efficient/elegant code that avoids unnecessary repetition.

PC4. Summarize relationships between outcome and predictor variables

PC5. Analyze relationships between driver race and vehicle searches over time

PC6. Estimate population baselines for relevant racial categories

Statistical Questions

SQ1. Interpret the results of PC3

SQ2. Interpret the results of PC4

SQ3. Interpret the results of PC5

SQ4. Compare and interpret salient results of PC4 and PC6

SQ5. Reflect on the limitations of your analysis

OLD

PC3. Using the gov domains data, create a new data frame where one column is each month (as described in the data) and a second column is the total number of views made to all pages in the dataset over that month.

PC4. Using the mobile data, create a new data frame where one column is each month described in the data and the second is a measure (estimate?) of the total number of views made by mobile devices (all platforms) over each month. This will involve at least two steps since total views are not included. You'll need to first use the data there to create a measure of the total views for each line in the dataset.

PC5. Merge your two datasets together into a new dataset with columns for each month, total views (across the gov domain pages) and total mobile views. Make sure that month, in your merged dataset, is a date or datetime object in R. Are there are missing data? Can you tell why?

PC6. Create a new column in your merged dataset that describes your best estimate of the proportion of total views that come from mobile. Be able to talk about any assumptions/decisions you've made in constructing this measure.

PC7. Graph the proportion over time and be ready to describe: (a) your best estimate of the proportion of views from mobile devices to the Seattle City website over time and (b) an indication of whether it's going up or down.