Community Data Science Course (Spring 2016)/Day 4 Notes

We will be discussing this data set.


 * One of the most important qualities of the Scientific Revolution was that results were broadly shared, so new results could build on top of existing knowledge.
 * Repeatability is the key to science (even data science): your results are only scientific if they are repeatable by a third party.

Today's Lecture Let's go end to end on a data question: are there factors that predict injuries and fatalities in automobile accidents?
 * Download data
 * Explore the data: find missing values, identify categorical, numerical, ordinal data fields
 * Transform (filter, project)
 * Analyze


 * 1) Find data. Let's start at Seattle Data.
 * 2) brief aside: Socrata
 * 3) Download it.
 * 4) Write exploratory scripts
 * 5) Using   to open a file in python.
 * 6) In groups, explore one of these questions by building a histogram with a python dictionary:
 * 7) What kinds of values occur in  ?
 * 8) What kinds of values occur in  ?
 * 9) What kinds of values occur in  ?
 * 10) What kinds of values occur in  ?
 * 11) What kinds of values occur in  ?
 * 12) What kinds of values occur in  ?
 * 13) (Challenge) Make a histogram of collisions by day in the data. Notice anything odd?
 * 14) Write transformation script and make a conclusion. You can work in groups. Example conclusions:
 * 15) Are incidents involving pedestrians or cyclists more likely to result in fatalities?
 * 16) Are incidents more likely to occur on rainy or wet conditions?

Code to open a file file_handle = open('sdot_collisions_seattle.csv', 'r')  # open the csv file for line in file_handle:                                # loop through the file one line at a time. line_clean = line.strip                           # remove the newline character at end of line line_clean_list = line_clean.split(',')             # split the line into parts using split print(line_clean_list[0])                           # print the first column of data for this row.

Code to open a file, select a subset of rows and columns, and write to a new file file_handle = open('sdot_collisions_seattle.csv', 'r')  # open the csv file header = file_handle.readline output_handle = open('sdot_collisitions_transformed.csv', 'w')   # NOTE this will overwrite for line in file_handle:                                # loop through the file one line at a time. line_clean = line.strip                           # remove the newline character at end of line line_clean_list = line_clean.split(',')             # split the line into parts using split if int(line_clean[8]) > 0:                          # If the integer value in columns 8 is greater than one then...         output_handle.write(line)                        # write that line to the output.

output_handle.close                                   # Close the output file after the loop.