Community Data Science Course (Spring 2023)/Week 7 lecture notes

From CommunityData
< Community Data Science Course (Spring 2023)
Revision as of 02:11, 9 May 2023 by Benjamin Mako Hill (talk | contribs) (→‎Dates)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Getting Started[edit]

Pandas is a module which gives us access to two very powerful data structures:

  1. Series: an one-dimensional array of data
  2. DataFrame: a way of storing tabular data

Q: When do you use pandas and when do you not?

Getting setup[edit]

Importing pandas:

  • import pandas as pd
  • If your data has any numbers, it's very normal to also import a closely related package called numpy: import numpy as np

Quick intro to Series:

  • we can build series from lists and they basically ask like lists in terms of indexing, including etc
  • we can also add an index to them my_series.index which makes them a lot like dictionaries (except we can have multiple things with the same key!!)
  • they come with cool built in functions (like .value_counts() and .hist())
  • tons of functions associated with Series: https://pandas.pydata.org/docs/reference/api/pandas.Series.html

Quick introduction to DataFrames:

A more common situation however is that we import them from some other tabular data source

Our first real data frame[edit]

If your data is already in TSV, you can import it with something like:

rockband_revs = pd.read_csv("rock_bands_wp_revisions.tsv", delimiter="\t")

There is also a read_json() function but it's a bit less magic that you might like. We'll come back and I'll show you how to build it up on your own.

  • df.head()
  • df.tail()
  • df.shape()
  • we can getting one column by treating it like a dictionary (or multiple by passing in a list of columns!
  • they also have indexes (although we'll skip that for now)
  • we can get a cell using .iloc[] (notice the square brackets because it's indexing in), which also takes slice notations

Every columns of a data.frame is just a series! This means:

  • value_counts()
  • .hist()

Pandas series can be used in booleans:

  • we can use booleans to filter!
    • e.g., we can get a list of revisions for a particular title (i.e., subsetting)
  • check the type of minor and anon (wow, pandas is smart!)
    • we can use those as subsets too
    • and we can combine them with and, or, or not!
    • and if only want a couple columns, we can combine them here

Group by[edit]

groupby: with aggregations for mean, count, sum

  • rockband_revs.groupby("title")["size"].mean()
  • rockband_revs.groupby("title")["size"].sum()
  • rockband_revs.groupby("title")["size"].count()
  • what's returned in teach case is just a series... we know how to save that and index in it?

... and we can combine these together into a new dataframe

new_df = pd.DataFrame({'edits' : rockband_revs.groupby("title")["size"].count(),
                       'mean.size' : rockband_revs.groupby("title")["size"].mean()})

Dates and Times in Pandas[edit]

And pandas has a bunch of magic related to dates:

  • the pd.to_datetime() function

but the real reason to do this is that pandas has a bunch of stuff to handle timeseries with give you access to a bunch of cool stuff:

  • a pandas time series is just a series with an index that is a pandas date time
  • .resample(), for example by day or week which sort of bins data and works like group

Building up and exporting a pandas DataFrame[edit]

  • I typically justt build a list of dictionaries in the normal way and then use pd.DataFrame()
  • Then I export with .to_csv(filename, sep="\t")