Community Data Science Course (Spring 2016)/Day 4 Notes: Difference between revisions
From CommunityData
(Created page with "'''We will be discussing [https://data.seattle.gov/Transportation/SDOT-Collisions/v7k9-7dn4 this data set].''' * One of the most important qualities of the Scientific Revolut...") |
No edit summary |
||
Line 9: | Line 9: | ||
* Explore the data: find missing values, identify categorical, numerical, ordinal data fields | * Explore the data: find missing values, identify categorical, numerical, ordinal data fields | ||
* Transform (filter, project) | * Transform (filter, project) | ||
* Analyze | * Analyze | ||
# Find data. Let's start at [https://data.seattle.gov Seattle Data]. | # Find data. Let's start at [https://data.seattle.gov Seattle Data]. | ||
Line 16: | Line 16: | ||
# Write exploratory scripts | # Write exploratory scripts | ||
## Using <code>open</code> to open a file in python. | ## Using <code>open</code> to open a file in python. | ||
# Write transformation script | ## In groups, explore one of these questions by building a histogram with a python dictionary: | ||
# | ### What kinds of values occur in <code>COLLISSIONTYPE</code>? | ||
### What kinds of values occur in <code>ADDRTYPE</code>? | |||
### What kinds of values occur in <code>JUNCTIONTYPE</code>? | |||
### What kinds of values occur in <code>SDOT_COLDESC</code>? | |||
### What kinds of values occur in <code>WEATHER</code>? | |||
### What kinds of values occur in <code>SEVERITYDESC</code>? | |||
### (Challenge) Make a histogram of collisions by day in the data. Notice anything odd? | |||
# Write transformation script and make a conclusion. You can work in groups. Example conclusions: | |||
## Are incidents involving pedestrians or cyclists more likely to result in fatalities? | |||
## Are incidents more likely to occur on rainy or wet conditions? | |||
'''Code to open a file''' | |||
file_handle = open('sdot_collisions_seattle.csv', 'r') # open the csv file | |||
for line in file_handle: # loop through the file one line at a time. | |||
line_clean = line.strip() # remove the newline character at end of line | |||
line_clean_list = line_clean.split(',') # split the line into parts using split | |||
print(line_clean_list[0]) # print the first column of data for this row. | |||
'''Code to open a file, select a subset of rows and columns, and write to a new file''' | |||
file_handle = open('sdot_collisions_seattle.csv', 'r') # open the csv file | |||
header = file_handle.readline() | |||
output_handle = open('sdot_collisitions_transformed.csv', 'w') # NOTE this will overwrite | |||
for line in file_handle: # loop through the file one line at a time. | |||
line_clean = line.strip() # remove the newline character at end of line | |||
line_clean_list = line_clean.split(',') # split the line into parts using split | |||
if int(line_clean[8]) > 0: # If the integer value in columns 8 is greater than one then... | |||
output_handle.write(line) # write that line to the output. | |||
output_handle.close() # Close the output file after the loop. |
Latest revision as of 14:52, 20 April 2017
We will be discussing this data set.
- One of the most important qualities of the Scientific Revolution was that results were broadly shared, so new results could build on top of existing knowledge.
- Repeatability is the key to science (even data science): your results are only scientific if they are repeatable by a third party.
Today's Lecture Let's go end to end on a data question: are there factors that predict injuries and fatalities in automobile accidents?
- Download data
- Explore the data: find missing values, identify categorical, numerical, ordinal data fields
- Transform (filter, project)
- Analyze
- Find data. Let's start at Seattle Data.
- brief aside: Socrata
- Download it.
- Write exploratory scripts
- Using
open
to open a file in python. - In groups, explore one of these questions by building a histogram with a python dictionary:
- What kinds of values occur in
COLLISSIONTYPE
? - What kinds of values occur in
ADDRTYPE
? - What kinds of values occur in
JUNCTIONTYPE
? - What kinds of values occur in
SDOT_COLDESC
? - What kinds of values occur in
WEATHER
? - What kinds of values occur in
SEVERITYDESC
? - (Challenge) Make a histogram of collisions by day in the data. Notice anything odd?
- What kinds of values occur in
- Using
- Write transformation script and make a conclusion. You can work in groups. Example conclusions:
- Are incidents involving pedestrians or cyclists more likely to result in fatalities?
- Are incidents more likely to occur on rainy or wet conditions?
Code to open a file
file_handle = open('sdot_collisions_seattle.csv', 'r') # open the csv file for line in file_handle: # loop through the file one line at a time. line_clean = line.strip() # remove the newline character at end of line line_clean_list = line_clean.split(',') # split the line into parts using split print(line_clean_list[0]) # print the first column of data for this row.
Code to open a file, select a subset of rows and columns, and write to a new file
file_handle = open('sdot_collisions_seattle.csv', 'r') # open the csv file header = file_handle.readline() output_handle = open('sdot_collisitions_transformed.csv', 'w') # NOTE this will overwrite for line in file_handle: # loop through the file one line at a time. line_clean = line.strip() # remove the newline character at end of line line_clean_list = line_clean.split(',') # split the line into parts using split if int(line_clean[8]) > 0: # If the integer value in columns 8 is greater than one then... output_handle.write(line) # write that line to the output.
output_handle.close() # Close the output file after the loop.