Community Data Science Course (Sprint 2019)/Day 7 Notes
We will be discussing this data set.
- One of the most important qualities of the Scientific Revolution was that results were broadly shared, so new results could build on top of existing knowledge.
- Repeatability is the key to science (even data science): your results are only scientific if they are repeatable by a third party.
Today's Lecture Let's go end to end on a data question: are there factors that predict injuries and fatalities in automobile accidents?
- Download data
- Explore the data: find missing values, identify categorical, numerical, ordinal data fields
- Transform (filter, project)
- Analyze
Download
import requests url = 'https://data-seattlecitygis.opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv' response = requests.get(url) filehandle = open('~/Desktop/collisions.csv', 'w') filehandle.write(response.content()) filehandle.close()
Opening a file is new. Note that "open" can open for reading or writing files. Be careful opening a file to write will erase that file. You can not get it back.
Explore Open the file in Excel. What columns seem to be missing sometimes?
Find a categorical, numerical, and ordinal data field.
Read and Transform
In this section, we will read and transform the data.
Code to open a file and print the first column
file_handle = open('sdot_collisions_seattle.csv', 'r') # open the csv file for line in file_handle: # loop through the file one line at a time. line_clean = line.strip() # remove the newline character at end of line line_clean_list = line_clean.split(',') # split the line into parts using split print(line_clean_list[0]) # print the first column of data for this row.
Code to open a file, select a subset of rows and columns, and write to a new file
Figure out what this code does!
(You'll want to open this window in a wide browser)
file_handle = open('sdot_collisions_seattle.csv', 'r') # open the csv file header = file_handle.readline() output_handle = open('sdot_collisitions_transformed.csv', 'w') # NOTE this will overwrite for line in file_handle: # loop through the file one line at a time. line_clean = line.strip() # remove the newline character at end of line line_clean_list = line_clean.split(',') # split the line into parts using split if int(line_clean[17]) > 0: # If the integer value in columns 17 is greater than one then... output_handle.write(line) # write that line to the output. output_handle.close() # Close the output file after the loop.