Editing Statistics and Statistical Programming (Winter 2021)/Problem set 7

From CommunityData

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
== Programming Challenges ==
== Programming Challenges ==


'''Do police in the United States engage in discriminatory behavior on the basis of race and ethnicity?''' For this problem set, you will investigate the relationship between traffic stops, vehicle searches and driver attributes (especially race as recorded by police officers conducting traffic stops). Doing so will involve some more advanced data wrangling, visualization, and analysis. We'll use data from [https://openpolicing.stanford.edu The Stanford Open Policing Project] (SOPP) that looks at records of traffic stops in Washington state between January 1, 2009  and September 30, 2018. The full SOPP dataset for Washington is about 11 million rows, so I've created a 1% random sample for us to work with here.  
'''Do police in the United States engage in discriminatory behavior on the basis of race and ethnicity?''' For this problem set, you will investigate the relationship between traffic stops, vehicle searches and driver attributes (especially race as recorded by police officers conducting traffic stops). Doing so will involve some more advanced data wrangling, visualization, and analysis. We'll use data from [https://openpolicing.stanford.edu The Stanford Open Policing Project] (SOPP) that looks at records of traffic stops in Washington state between 2012-2017. The full SOPP dataset for Washington is about 11 million rows, so I've created a 1% random sample for us to work with here.  


Overall, the dataset is well-documented and pretty "clean" (as far as these things go) but there are still a number of features that may be confusing, weird, and/or ill-organized to help answer the questions I've asked you below. Thank goodness you know how to use R to address these issues...
Overall, the dataset is well-documented and pretty "clean" (as far as these things go) but there are still a number of features that may be confusing, weird, and/or ill-organized to help answer the questions I've asked you below. Thank goodness you know how to use R to address these issues...
Line 19: Line 19:
Review the project overview on the [https://openpolicing.stanford.edu/ SOPP homepage], the [https://openpolicing.stanford.edu/data/ overview of the data], the [https://github.com/stanford-policylab/opp/blob/master/data_readme.md#description-of-standardized-data description of the standardized data], the [https://github.com/stanford-policylab/opp/blob/master/data_readme.md#statewide-wa codebook/notes for the Washington data] from the [https://github.com/stanford-policylab/opp/blob/master/data_readme.md data_readme.md], as well as any other ancillary materials that you can find that seem likely to help you get oriented with the data.  
Review the project overview on the [https://openpolicing.stanford.edu/ SOPP homepage], the [https://openpolicing.stanford.edu/data/ overview of the data], the [https://github.com/stanford-policylab/opp/blob/master/data_readme.md#description-of-standardized-data description of the standardized data], the [https://github.com/stanford-policylab/opp/blob/master/data_readme.md#statewide-wa codebook/notes for the Washington data] from the [https://github.com/stanford-policylab/opp/blob/master/data_readme.md data_readme.md], as well as any other ancillary materials that you can find that seem likely to help you get oriented with the data.  


For the questions below we'll focus on the following measures recorded for each traffic stop in Washington 2009-2018: <code>date</code>, <code>subject_age</code>, <code>subject_race</code>, <code>subject_sex</code>, and  <code>search_conducted</code>.  
For the questions below we'll focus on the following measures recorded for each traffic stop in Washington 2012-2017: <code>date</code>, <code>subject_age</code>, <code>subject_race</code>, <code>subject_sex</code>, and  <code>search_conducted</code>.  


Record any questions or issues you might notice related to these measures as you review the information about the project and dataset.
Record any questions or issues you might notice related to these measures as you review the information about the project and dataset.
Line 25: Line 25:
=== PC2. Import, explore, clean ===
=== PC2. Import, explore, clean ===


As I mentioned above, the full WA-SOPP dataset is over 11 million rows, so I have created a random 1% subset for us to work with in this assignment which is [https://www.dropbox.com/home/COM520-shared_files-UW-2021-Q1/datasets/problem_set_7?preview=wa_statewide_2020_04_01-com520_1pct_sample.csv our Dropbox here repository here]. It's about 28MB.
As I mentioned above, the full WA-SOPP dataset is over 11 million rows, so I have created a random 1% subset for us to work with in this assignment. [FIXME ME] (and it's about XXMB).


To get started, you'll want to import the data and explore its structure as well as key variables that we'll be focusing on in this analysis (<code>date</code>, <code>subject_age</code>, <code>subject_race</code>, <code>subject_sex</code>, and  <code>search_conducted</code>). Inspect a random sample of rows to get a sense of the data. What (if anything) is missing? You may also want to clean/recode some of the key variables. Make sure to explain and justify any data cleanup and/or recoding steps you decide to take.
To get started, you'll want to import the data and explore its structure as well as key variables that we'll be focusing on in this analysis (<code>date</code>, <code>subject_age</code>, <code>subject_race</code>, <code>subject_sex</code>, and  <code>search_conducted</code>). Inspect a random sample of rows to get a sense of the data. What (if anything) is missing? You may also want to clean/recode some of the key variables. Make sure to explain and justify any data cleanup and/or recoding steps you decide to take.
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)