Statistics and Statistical Programming (Fall 2020)/pset3: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
<small>[[Statistics_and_Statistical_Programming_(Fall_2020)#Week_5_.2810.2F13.2C_10.2F15.29|← Back to Week 5]]</small> | <small>[[Statistics_and_Statistical_Programming_(Fall_2020)#Week_5_.2810.2F13.2C_10.2F15.29|← Back to Week 5]]</small> | ||
For this problem set, the programming challenges | For this problem set, the programming challenges investigate the relationship between vehicle searches and driver attributes (especially race as recorded by police officers conducting traffic stops) in Illinois. Doing so will involve some more advanced data wrangling, visualization, and analysis. We'll use data from [https://openpolicing.stanford.edu The Stanford Open Policing Project] (SOPP) that looks at records of traffic stops in Illinois between 2012-2017. The full SOPP dataset for Illinois is about 12 million rows, so I've created a 1% random sample for us to work with here. Overall, the dataset is well-documented and pretty "clean," but there are still a number of features that may be confusing, weird, and/or ill-organized to help answer the questions I've asked you below. Luckily, you know how to use R to solve these problems... | ||
== Programming Challenges == | == Programming Challenges == | ||
PC1. | === PC1. Investigate the provenance of the data === | ||
Review the project description on the SOPP website, the codebook provided for the project as a whole as well as for the Illinois data specifically, as well as any ancillary materials that help you get oriented with the data. For the questions below we'll focus on the following measures recorded for each stop: `date`, `vehicle_year`, `subject_race`, `subject_sex`, and `search_conducted`. Note any questions or issues you might notice related to these measures as you review the information about the project and dataset. | |||
=== PC2. Import, explore, clean === | |||
As I noted above, the full IL SOPP dataset is over 12 million rows, so I have created a random 1% subset for us to work with in this assignment. That subset lives here. To get started, you'll want to import the data and explore its structure as well as key variables that we'll be focusing on in this analysis (`date`, `vehicle_year`, `subject_race`, `subject_sex`, and `search_conducted`). What data (if any) is missing? Inspect a random sample of rows to get a sense of the data. You may also want to clean/recode some of the key variables. Make sure to explain and justify any data cleanup and/or recoding steps you decide to take. | |||
=== PC3. Summarize outcome and predictor variables === | |||
Calculate and report appropriate summary statistics for the outcome (`search_conducted`) and each of the predictor variables (`date`, `vehicle_year`, `subject_race`, `subject_sex`). Include visual and/or tabular summaries where appropriate. Attempt, when possible, to write efficient/elegant code that avoids unnecessary repetition. | |||
=== PC4. Summarize relationships between outcome and predictor variables === | |||
=== PC5. Analyze relationships between driver race and vehicle searches over time === | |||
=== PC6. Estimate population baselines for relevant racial categories === | |||
== Statistical Questions == | |||
=== SQ1. Interpret the results of PC3 === | |||
=== SQ2. Interpret the results of PC4 === | |||
=== SQ3. Interpret the results of PC5 === | |||
=== SQ4. Compare and interpret salient results of PC4 and PC6 === | |||
=== SQ5. Reflect on the limitations of your analysis === | |||
==OLD == | |||
PC3. Using the gov domains data, create a new data frame where one column is each month (as described in the data) and a second column is the total number of views made to all pages in the dataset over that month. | PC3. Using the gov domains data, create a new data frame where one column is each month (as described in the data) and a second column is the total number of views made to all pages in the dataset over that month. | ||
Revision as of 03:56, 6 October 2020
For this problem set, the programming challenges investigate the relationship between vehicle searches and driver attributes (especially race as recorded by police officers conducting traffic stops) in Illinois. Doing so will involve some more advanced data wrangling, visualization, and analysis. We'll use data from The Stanford Open Policing Project (SOPP) that looks at records of traffic stops in Illinois between 2012-2017. The full SOPP dataset for Illinois is about 12 million rows, so I've created a 1% random sample for us to work with here. Overall, the dataset is well-documented and pretty "clean," but there are still a number of features that may be confusing, weird, and/or ill-organized to help answer the questions I've asked you below. Luckily, you know how to use R to solve these problems...
Programming Challenges
PC1. Investigate the provenance of the data
Review the project description on the SOPP website, the codebook provided for the project as a whole as well as for the Illinois data specifically, as well as any ancillary materials that help you get oriented with the data. For the questions below we'll focus on the following measures recorded for each stop: `date`, `vehicle_year`, `subject_race`, `subject_sex`, and `search_conducted`. Note any questions or issues you might notice related to these measures as you review the information about the project and dataset.
PC2. Import, explore, clean
As I noted above, the full IL SOPP dataset is over 12 million rows, so I have created a random 1% subset for us to work with in this assignment. That subset lives here. To get started, you'll want to import the data and explore its structure as well as key variables that we'll be focusing on in this analysis (`date`, `vehicle_year`, `subject_race`, `subject_sex`, and `search_conducted`). What data (if any) is missing? Inspect a random sample of rows to get a sense of the data. You may also want to clean/recode some of the key variables. Make sure to explain and justify any data cleanup and/or recoding steps you decide to take.
PC3. Summarize outcome and predictor variables
Calculate and report appropriate summary statistics for the outcome (`search_conducted`) and each of the predictor variables (`date`, `vehicle_year`, `subject_race`, `subject_sex`). Include visual and/or tabular summaries where appropriate. Attempt, when possible, to write efficient/elegant code that avoids unnecessary repetition.
PC4. Summarize relationships between outcome and predictor variables
PC5. Analyze relationships between driver race and vehicle searches over time
PC6. Estimate population baselines for relevant racial categories
Statistical Questions
SQ1. Interpret the results of PC3
SQ2. Interpret the results of PC4
SQ3. Interpret the results of PC5
SQ4. Compare and interpret salient results of PC4 and PC6
SQ5. Reflect on the limitations of your analysis
OLD
PC3. Using the gov domains data, create a new data frame where one column is each month (as described in the data) and a second column is the total number of views made to all pages in the dataset over that month.
PC4. Using the mobile data, create a new data frame where one column is each month described in the data and the second is a measure (estimate?) of the total number of views made by mobile devices (all platforms) over each month. This will involve at least two steps since total views are not included. You'll need to first use the data there to create a measure of the total views for each line in the dataset.
PC5. Merge your two datasets together into a new dataset with columns for each month, total views (across the gov domain pages) and total mobile views. Make sure that month, in your merged dataset, is a date or datetime object in R. Are there are missing data? Can you tell why?
PC6. Create a new column in your merged dataset that describes your best estimate of the proportion of total views that come from mobile. Be able to talk about any assumptions/decisions you've made in constructing this measure.
PC7. Graph the proportion over time and be ready to describe: (a) your best estimate of the proportion of views from mobile devices to the Seattle City website over time and (b) an indication of whether it's going up or down.