Community Data Science Course (Spring 2023)/Week 7 lecture notes
Getting Started
Pandas is a module which gives us access to two very powerful data structures:
- Series: an one-dimensional array of data
- DataFrame: a way of storing tabular data
Q: When do you use pandas and when do you not?
Getting setup
Importing pandas:
import pandas as pd
- If your data has any numbers, it's very normal to also import a closely related package called numpy:
import numpy as np
Quick intro to Series:
- we can build series from lists and they basically ask like lists in terms of indexing, including etc
- we can also add an index to them
my_series.index
which makes them a lot like dictionaries (except we can have multiple things with the same key!!) - they come with cool built in functions (like
.value_counts()
and.hist()
) - tons of functions associated with Series: https://pandas.pydata.org/docs/reference/api/pandas.Series.html
Quick introduction to DataFrames:
- a DataFrame is of Series that are all of the same length
- we can build them up from a dictionary of lists or a list of dictionaries (I typically prefer the latter but either works)!
- tons of functions associated with DataFrames: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
A more common situation however is that we import them from some other tabular data source
Our first real data frame
If your data is already in TSV, you can import it with something like:
rockband_revs = pd.read_csv("rock_bands_wp_revisions.tsv", delimiter="\t")
There is also a read_json() function but it's a bit less magic that you might like. We'll come back and I'll show you how to build it up on your own.
df.head()
df.tail()
df.shape()
- we can getting one column by treating it like a dictionary (or multiple by passing in a list of columns!
- they also have indexes (although we'll skip that for now)
- we can get a cell using
.iloc[]
(notice the square brackets because it's indexing in), which also takes slice notations
Every columns of a data.frame is just a series! This means:
value_counts()
.hist()
Pandas series can be used in booleans:
- we can use booleans to filter!
- lets get a title subset
- check the type of minor and anon (wow, pandas is smart!)
- we can use those as subsets too
- and we can combine them with
and
,or
, ornot
! - and if only want a couple columns, we can combine them here
Group by
groupby: with aggregations for mean, count, sum
rockband_revs.groupby("title")["size"].mean()
rockband_revs.groupby("title")["size"].sum()
rockband_revs.groupby("title")["size"].count()
- what's returned in teach case is just a series... we know how to save that and index in it?
... and we can combine these together into a new dataframe
new_df = pd.DataFrame({'edits' : rockband_revs.groupby("title")["size"].count(),
'mean.size' : rockband_revs.groupby("title")["size"].mean()})
Dates
And pandas has a bunch of magic related to dates:
- the
pd.to_datetime()
function
but the real reason to do this is that pandas has a bunch of stuff to handle timeseries with give you access to a bunch of cool stuff:
- a pandas time series is just a series with an index that is a pandas date time
.resample()
, for example by day or week which sort of bins data and works like group
Building up and exporting a pandas DataFrame
- I typically justt build a list of dictionaries in the normal way and then use
pd.DataFrame()
- Then I export with
.to_csv(filename, sep="\t")