Community Data Science Course (Spring 2023)/Week 7 lecture notes
Getting Started[edit]
Pandas is a module which gives us access to two very powerful data structures:
Q: When do you use pandas and when do you not?
Getting setup[edit]
Importing pandas:
import pandas as pd
- If your data has any numbers, it's very normal to also import a closely related package called numpy:
import numpy as np
Quick intro to Series:
- we can build series from lists and they basically ask like lists in terms of indexing, including etc
- we can also add an index to them
my_series.index
which makes them a lot like dictionaries (except we can have multiple things with the same key!!) - they come with cool built in functions (like
.value_counts()
and.hist()
) - tons of functions associated with Series: https://pandas.pydata.org/docs/reference/api/pandas.Series.html
Quick introduction to DataFrames:
- a DataFrame is of Series that are all of the same length
- we can build them up from a dictionary of lists or a list of dictionaries (I typically prefer the latter but either works)!
- tons of functions associated with DataFrames: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
A more common situation however is that we import them from some other tabular data source
Our first real data frame[edit]
If your data is already in TSV, you can import it with something like:
rockband_revs = pd.read_csv("rock_bands_wp_revisions.tsv", delimiter="\t")
There is also a read_json() function but it's a bit less magic that you might like. We'll come back and I'll show you how to build it up on your own.
df.head()
df.tail()
df.shape()
- we can getting one column by treating it like a dictionary (or multiple by passing in a list of columns!
- they also have indexes (although we'll skip that for now)
- we can get a cell using
.iloc[]
(notice the square brackets because it's indexing in), which also takes slice notations
Every columns of a data.frame is just a series! This means:
value_counts()
.hist()
Pandas series can be used in booleans:
- we can use booleans to filter!
- e.g., we can get a list of revisions for a particular title (i.e., subsetting)
- check the type of minor and anon (wow, pandas is smart!)
- we can use those as subsets too
- and we can combine them with
and
,or
, ornot
! - and if only want a couple columns, we can combine them here
Group by[edit]
groupby: with aggregations for mean, count, sum
rockband_revs.groupby("title")["size"].mean()
rockband_revs.groupby("title")["size"].sum()
rockband_revs.groupby("title")["size"].count()
- what's returned in teach case is just a series... we know how to save that and index in it?
... and we can combine these together into a new dataframe
new_df = pd.DataFrame({'edits' : rockband_revs.groupby("title")["size"].count(),
'mean.size' : rockband_revs.groupby("title")["size"].mean()})
Dates and Times in Pandas[edit]
And pandas has a bunch of magic related to dates:
- the
pd.to_datetime()
function
but the real reason to do this is that pandas has a bunch of stuff to handle timeseries with give you access to a bunch of cool stuff:
- a pandas time series is just a series with an index that is a pandas date time
.resample()
, for example by day or week which sort of bins data and works like group
Building up and exporting a pandas DataFrame[edit]
- I typically justt build a list of dictionaries in the normal way and then use
pd.DataFrame()
- Then I export with
.to_csv(filename, sep="\t")