Community Data Science Course (Spring 2016)/Day 4 Notes: Difference between revisions

From CommunityData
(Created page with "'''We will be discussing [https://data.seattle.gov/Transportation/SDOT-Collisions/v7k9-7dn4 this data set].''' * One of the most important qualities of the Scientific Revolut...")
 
No edit summary
 
Line 9: Line 9:
* Explore the data: find missing values, identify categorical, numerical, ordinal data fields
* Explore the data: find missing values, identify categorical, numerical, ordinal data fields
* Transform (filter, project)
* Transform (filter, project)
* Analyze (see todo for prompts)
* Analyze


# Find data. Let's start at [https://data.seattle.gov Seattle Data].
# Find data. Let's start at [https://data.seattle.gov Seattle Data].
Line 16: Line 16:
# Write exploratory scripts
# Write exploratory scripts
## Using <code>open</code> to open a file in python.
## Using <code>open</code> to open a file in python.
# Write transformation script
## In groups, explore one of these questions by building a histogram with a python dictionary:
# In groups, answer the todo prompts.
### What kinds of values occur in <code>COLLISSIONTYPE</code>?
### What kinds of values occur in <code>ADDRTYPE</code>?
### What kinds of values occur in <code>JUNCTIONTYPE</code>?
### What kinds of values occur in <code>SDOT_COLDESC</code>?
### What kinds of values occur in <code>WEATHER</code>?
### What kinds of values occur in <code>SEVERITYDESC</code>?
### (Challenge) Make a histogram of collisions by day in the data. Notice anything odd?
# Write transformation script and make a conclusion. You can work in groups. Example conclusions:
## Are incidents involving pedestrians or cyclists more likely to result in fatalities?
## Are incidents more likely to occur on rainy or wet conditions?
 
 
 
'''Code to open a file'''
 
file_handle = open('sdot_collisions_seattle.csv', 'r')  # open the csv file
for line in file_handle:                                # loop through the file one line at a time.
    line_clean = line.strip()                            # remove the newline character at end of line
    line_clean_list = line_clean.split(',')              # split the line into parts using split
    print(line_clean_list[0])                            # print the first column of data for this row.
 
 
'''Code to open a file, select a subset of rows and columns, and write to a new file'''
 
file_handle = open('sdot_collisions_seattle.csv', 'r')  # open the csv file
header = file_handle.readline()
output_handle = open('sdot_collisitions_transformed.csv', 'w')    # NOTE this will overwrite
for line in file_handle:                                # loop through the file one line at a time.
    line_clean = line.strip()                            # remove the newline character at end of line
    line_clean_list = line_clean.split(',')              # split the line into parts using split
    if int(line_clean[8]) > 0:                          # If the integer value in columns 8 is greater than one then...
        output_handle.write(line)                        # write that line to the output.
 
output_handle.close()                                    # Close the output file after the loop.

Latest revision as of 16:52, 20 April 2017

We will be discussing this data set.

  • One of the most important qualities of the Scientific Revolution was that results were broadly shared, so new results could build on top of existing knowledge.
  • Repeatability is the key to science (even data science): your results are only scientific if they are repeatable by a third party.

Today's Lecture Let's go end to end on a data question: are there factors that predict injuries and fatalities in automobile accidents?

  • Download data
  • Explore the data: find missing values, identify categorical, numerical, ordinal data fields
  • Transform (filter, project)
  • Analyze
  1. Find data. Let's start at Seattle Data.
    1. brief aside: Socrata
  2. Download it.
  3. Write exploratory scripts
    1. Using open to open a file in python.
    2. In groups, explore one of these questions by building a histogram with a python dictionary:
      1. What kinds of values occur in COLLISSIONTYPE?
      2. What kinds of values occur in ADDRTYPE?
      3. What kinds of values occur in JUNCTIONTYPE?
      4. What kinds of values occur in SDOT_COLDESC?
      5. What kinds of values occur in WEATHER?
      6. What kinds of values occur in SEVERITYDESC?
      7. (Challenge) Make a histogram of collisions by day in the data. Notice anything odd?
  4. Write transformation script and make a conclusion. You can work in groups. Example conclusions:
    1. Are incidents involving pedestrians or cyclists more likely to result in fatalities?
    2. Are incidents more likely to occur on rainy or wet conditions?


Code to open a file

file_handle = open('sdot_collisions_seattle.csv', 'r')   # open the csv file
for line in file_handle:                                 # loop through the file one line at a time.
    line_clean = line.strip()                            # remove the newline character at end of line
    line_clean_list = line_clean.split(',')              # split the line into parts using split
    print(line_clean_list[0])                            # print the first column of data for this row.


Code to open a file, select a subset of rows and columns, and write to a new file

file_handle = open('sdot_collisions_seattle.csv', 'r')   # open the csv file
header = file_handle.readline()
output_handle = open('sdot_collisitions_transformed.csv', 'w')    # NOTE this will overwrite 
for line in file_handle:                                 # loop through the file one line at a time.
    line_clean = line.strip()                            # remove the newline character at end of line
    line_clean_list = line_clean.split(',')              # split the line into parts using split
    if int(line_clean[8]) > 0:                           # If the integer value in columns 8 is greater than one then...
        output_handle.write(line)                        # write that line to the output.
output_handle.close()                                    # Close the output file after the loop.