Community Data Science Course (Spring 2023)/Week 7 lecture notes

Getting Started[edit]

Pandas is a module which gives us access to two very powerful data structures:

Series: an one-dimensional array of data
DataFrame: a way of storing tabular data

Q: When do you use pandas and when do you not?

Getting setup[edit]

Importing pandas:

import pandas as pd
If your data has any numbers, it's very normal to also import a closely related package called numpy: import numpy as np

Quick intro to Series:

we can build series from lists and they basically ask like lists in terms of indexing, including etc
we can also add an index to them my_series.index which makes them a lot like dictionaries (except we can have multiple things with the same key!!)
they come with cool built in functions (like .value_counts() and .hist())
tons of functions associated with Series: https://pandas.pydata.org/docs/reference/api/pandas.Series.html

Quick introduction to DataFrames:

a DataFrame is of Series that are all of the same length
we can build them up from a dictionary of lists or a list of dictionaries (I typically prefer the latter but either works)!
tons of functions associated with DataFrames: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

A more common situation however is that we import them from some other tabular data source

Our first real data frame[edit]

If your data is already in TSV, you can import it with something like:

rockband_revs = pd.read_csv("rock_bands_wp_revisions.tsv", delimiter="\t")

There is also a read_json() function but it's a bit less magic that you might like. We'll come back and I'll show you how to build it up on your own.

df.head()
df.tail()
df.shape()

we can getting one column by treating it like a dictionary (or multiple by passing in a list of columns!
they also have indexes (although we'll skip that for now)
we can get a cell using .iloc[] (notice the square brackets because it's indexing in), which also takes slice notations

Every columns of a data.frame is just a series! This means:

value_counts()
.hist()

Pandas series can be used in booleans:

we can use booleans to filter!
- e.g., we can get a list of revisions for a particular title (i.e., subsetting)
check the type of minor and anon (wow, pandas is smart!)
- we can use those as subsets too
- and we can combine them with and, or, or not!
- and if only want a couple columns, we can combine them here

Group by[edit]

groupby: with aggregations for mean, count, sum

rockband_revs.groupby("title")["size"].mean()
rockband_revs.groupby("title")["size"].sum()
rockband_revs.groupby("title")["size"].count()
what's returned in teach case is just a series... we know how to save that and index in it?

... and we can combine these together into a new dataframe

new_df = pd.DataFrame({'edits' : rockband_revs.groupby("title")["size"].count(),
                       'mean.size' : rockband_revs.groupby("title")["size"].mean()})

Dates and Times in Pandas[edit]

And pandas has a bunch of magic related to dates:

the pd.to_datetime() function

but the real reason to do this is that pandas has a bunch of stuff to handle timeseries with give you access to a bunch of cool stuff:

a pandas time series is just a series with an index that is a pandas date time
.resample(), for example by day or week which sort of bins data and works like group

Building up and exporting a pandas DataFrame[edit]

I typically justt build a list of dictionaries in the normal way and then use pd.DataFrame()
Then I export with .to_csv(filename, sep="\t")