Community Data Science Course (Spring 2017)/Day 4 Notes

From CommunityData
< Community Data Science Course (Spring 2017)
Revision as of 05:14, 21 April 2017 by Guyrt (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

We will be discussing this data set.

  • One of the most important qualities of the Scientific Revolution was that results were broadly shared, so new results could build on top of existing knowledge.
  • Repeatability is the key to science (even data science): your results are only scientific if they are repeatable by a third party.

Today's Lecture Let's go end to end on a data question: are there factors that predict injuries and fatalities in automobile accidents?

  • Download data
  • Explore the data: find missing values, identify categorical, numerical, ordinal data fields
  • Transform (filter, project)
  • Analyze
  1. Find data. Let's start at Seattle Data.
    1. brief aside: Socrata
  2. Download it.
  3. Write exploratory scripts
    1. Using open to open a file in python.
    2. In groups, explore one of these questions by building a histogram with a python dictionary:
      1. Build a histogram for COLLISSIONTYPE.l
      2. Build a histogram for ADDRTYPE.
      3. Build a histogram for JUNCTIONTYPE.
      4. Build a histogram for SDOT_COLDESC?.
      5. Build a histogram for WEATHER.
      6. Build a histogram for SEVERITYDESC.
      7. (Challenge) Make a histogram of collisions by day in the data. Notice anything odd?
  4. Write transformation script and make a conclusion. You can work in groups. Example conclusions:
    1. Are incidents involving pedestrians or cyclists more likely to result in fatalities?
    2. Are incidents more likely to occur on rainy or wet conditions?


Code to open a file

file_handle = open('sdot_collisions_seattle.csv', 'r')   # open the csv file
for line in file_handle:                                 # loop through the file one line at a time.
    line_clean = line.strip()                            # remove the newline character at end of line
    line_clean_list = line_clean.split(',')              # split the line into parts using split
    print(line_clean_list[0])                            # print the first column of data for this row.


Code to open a file, select a subset of rows and columns, and write to a new file

(You'll want to open this window in a wide browser)

file_handle = open('sdot_collisions_seattle.csv', 'r')   # open the csv file
header = file_handle.readline()
output_handle = open('sdot_collisitions_transformed.csv', 'w')    # NOTE this will overwrite 
for line in file_handle:                                 # loop through the file one line at a time.
    line_clean = line.strip()                            # remove the newline character at end of line
    line_clean_list = line_clean.split(',')              # split the line into parts using split
    if int(line_clean[8]) > 0:                           # If the integer value in columns 8 is greater than one then...
        output_handle.write(line)                        # write that line to the output.
output_handle.close()                                    # Close the output file after the loop.