Community Data Science Course (Spring 2023)/Week 7 coding challenges

From CommunityData
Revision as of 20:16, 12 May 2023 by Josh (talk | contribs) (→‎You know the type)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This week the coding challenges are limited to playing around with data in Pandas. Although it's likely that you could do some of these in Excel, I want you to do these in Pandas.

Getting started with pandas and Seattle "complaint" data[edit]

  1. Go to data.seattle.gov and download the dataset on code complaints and violations. (It is, by the way, also available in an API!) In this case, just go to Export→CSV and download the file. It should be about 80 megabytes.
  2. As is always the case, spend some time poking around the website and reading documentation to get a sense of what kind of data this is, where it coming from, who generated it, and so on.
  3. Put the CSV file into a directory and create a new Jupyter notebook in the same directory (remember that it is comma-separated, not tab-separated). Load that into Python as a pandas DataFrame.
  4. Show some parts of the dataframe and make sure your load command worked.
  5. Print out the number of rows and the number of columns to get a sense of how much data you're working with.

You know the type[edit]

  1. Take a look at the "RecordType" column which describes the kinds of complaints that come in. What are the types of categories? How many are in each category? Show both with numbers and with a simple visualization (a histogram, perhaps?). For each category, print out the "Description" of several examples. What kinds of things are included?
  2. Build a new dataset that includes only the "RecordType", "OriginalZip", and "Description" columns.
  3. Use this second dataset to filter the dataset down to just rows from your zipcode. If you don't live in Seattle, you can just use my zip code (98112) which covers north Capitol Hill and Montlake or you can pick an area you think is interesting from this map.
    1. Now look at the number and proportion of different types of records in this subset.
    2. Be ready to explain if the distribution in this zipcode different than the distribution in Seattle overall? If not, how is it different?
    3. Once again, print out the "Description" of several examples from each category. What kinds of things are included?
  4. Use pandas to write out the three-column dataset to TSV (with tabs instead of commas).

It's about time[edit]

First, lets return to the full dataset and not the two column subset.

  1. Create a new timeseries (use a pandas Series) that contains zip code and that uses the "OpenDate" column as the index. Be sure to check the type of the "OpenDate" column and make sure it's in the pandas datetime format.
  2. Use the .resample() function associated with your pandas time series so that it shows the number of complaints per week overall and visualize this with a time series plot.

You've got questions, you've got answers[edit]

Ask and answer a question not on this list using this data. Be sure to:

  1. Explicitly state the question
  2. Include the pandas code to answer it
  3. Write a sentence or two explaining what you found and interpret the finding.