Community Data Science Course (Spring 2023)/Week 7 coding challenges

From CommunityData

This week the coding challenges are limited to playing around with data in Pandas. Although it's likely that you could do some of these in Excel, I want you to do these in Pandas.

Getting started with pandas and Seattle "complaint" data

  1. Go to data.seattle.gov and download the dataset on code complaints and violations. (It is, by the way, also available in an API!) In this case, just go to Export→CSV and download the file. It should be about 80 megabytes.
  2. As is always the case, spend some time poking around the website and reading documentation to get a sense of what kind of data this is, where it coming from, who generated it, and so on.
  3. Put the CSV file into a directory and create a new Jupyter notebook in the same directory (remember that it is comma-separated, not tab-separated). Load that into Python as a pandas DataFrame.
  4. Show some parts of dataframe and make sure thing have worked.
  5. Print out the number of rows and the number of columns to get a sense of how much data you're working with.

You know the type

  1. Take a look at the "RecordType" column which describes the kinds of complaints that come in. What are the types of categories? How many are in each category? Show both with numbers and with a simple visualization (a histogram, perhaps?). For each category, print out the "Description" of several examples. What kinds of things are included?
  2. Build a new second dataset that includes only the "RecordType" and "OriginalZip" columns.
  3. Use this second dataset to filter the dataset down to just rows from your zipcode. If you don't live in Seattle, you can just use my zip code (98112) which covers north Capitol Hill and Montlake or you can pick an area you think is interesting from this map.
    1. Now look at the number and proportion of different types of records in this subset.
    2. Be ready to explain if the distribution in this zipcode different than the distribution in Seattle overall? If not, how is it different?
    3. Once again, print out the "Description" of several examples from each category. What kinds of things are included?
  4. Use pandas to write out the two-column dataset to TSV (with tabs instead of commas).

It's about time

First, lets return to the full dataset and not the the two column subset.

  1. Create a new timeseries pandas Series that contains zip code and that use the "OpenDate" column as the index. Be sure to check the type of "OriginalZip" column and make sure it's in the pandas datetime format.
  2. Use the function .resample() function associated with your pandas time series so that is shows the number of complaints per week overall and visualizes this with a time series.

You've got questions, you've got answers

Ask and answer a question not on this list using this data. Be sure to:

  1. Explicitly state the question
  2. Include the pandas code to answer it
  3. Write a sentence or two explaining what you found and interpret the finding for you.