Community Data Science Course (Sprint 2019)/Day 7 Notes

From CommunityData
Revision as of 04:07, 15 May 2019 by Guyrt (talk | contribs)

We will be discussing this data set.


  • One of the most important qualities of the Scientific Revolution was that results were broadly shared, so new results could build on top of existing knowledge.
  • Repeatability is the key to science (even data science): your results are only scientific if they are repeatable by a third party.

Today's Lecture Let's go end to end on a data question: are there factors that predict injuries and fatalities in automobile accidents?

  • Download data
  • Explore the data: find missing values, identify categorical, numerical, ordinal data fields
  • Transform (filter, project)
  • Analyze

Download

import requests
   
url = 'https://data-seattlecitygis.opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv'
response = requests.get(url)
   
filehandle = open('~/Desktop/collisions.csv', 'w')
filehandle.write(response.content())
filehandle.close()


Opening a file is new. Note that "open" can open for reading or writing files. Be careful opening a file to write will erase that file. You can not get it back.

Explore Open the file in Excel. What columns seem to be missing sometimes?

Find a categorical, numerical, and ordinal data field.


Read and Transform

In this section, we will read and transform the data.

Code to open a file and print the first column

file_handle = open('sdot_collisions_seattle.csv', 'r')   # open the csv file
for line in file_handle:                                 # loop through the file one line at a time.
    line_clean = line.strip()                            # remove the newline character at end of line
    line_clean_list = line_clean.split(',')              # split the line into parts using split
    print(line_clean_list[0])                            # print the first column of data for this row.


Code to open a file, select a subset of rows and columns, and write to a new file Figure out what this code does!

(You'll want to open this window in a wide browser)

file_handle = open('sdot_collisions_seattle.csv', 'r')   # open the csv file
header = file_handle.readline()
output_handle = open('sdot_collisitions_transformed.csv', 'w')    # NOTE this will overwrite 
for line in file_handle:                                 # loop through the file one line at a time.
    line_clean = line.strip()                            # remove the newline character at end of line
    line_clean_list = line_clean.split(',')              # split the line into parts using split
    if int(line_clean[17]) > 0:                          # If the integer value in columns 17 is greater than one then...
        output_handle.write(line)                        # write that line to the output.
output_handle.close()                                    # Close the output file after the loop.


Analyze We will answer together whether accidents involving cyclists or pedestrians are more likely to result in an injury.