Community Data Science Course (Spring 2023)/Week 7 lecture notes

Getting Started
Pandas is a module which gives us access to two very powerful data structures:


 * 1) Series: an one-dimensional array of data
 * 2) DataFrame: a way of storing tabular data

Q: When do you use pandas and when do you not?

Getting setup
Importing pandas:


 * If your data has any numbers, it's very normal to also import a closely related package called numpy:
 * If your data has any numbers, it's very normal to also import a closely related package called numpy:

Quick intro to Series:


 * we can build series from lists and they basically ask like lists in terms of indexing, including etc
 * we can also add an index to them  which makes them a lot like dictionaries (except we can have multiple things with the same key!!)
 * they come with cool built in functions (like  and  )
 * tons of functions associated with Series: https://pandas.pydata.org/docs/reference/api/pandas.Series.html

Quick introduction to DataFrames:


 * a DataFrame is of Series that are all of the same length
 * we can build them up from a dictionary of lists or a list of dictionaries (I typically prefer the latter but either works)!
 * tons of functions associated with DataFrames: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

A more common situation however is that we import them from some other tabular data source

Our first real data frame
If your data is already in TSV, you can import it with something like:

There is also a read_json function but it's a bit less magic that you might like. We'll come back and I'll show you how to build it up on your own.




 * we can getting one column by treating it like a dictionary (or multiple by passing in a list of columns!
 * they also have indexes (although we'll skip that for now)
 * we can get a cell using  (notice the square brackets because it's indexing in), which also takes slice notations

Every columns of a data.frame is just a series! This means:



Pandas series can be used in booleans:


 * we can use booleans to filter!
 * e.g., we can get a list of revisions for a particular title (i.e., subsetting)
 * check the type of minor and anon (wow, pandas is smart!)
 * we can use those as subsets too
 * and we can combine them with,  , or  !
 * and if only want a couple columns, we can combine them here

Group by
groupby: with aggregations for mean, count, sum


 * what's returned in teach case is just a series... we know how to save that and index in it?
 * what's returned in teach case is just a series... we know how to save that and index in it?
 * what's returned in teach case is just a series... we know how to save that and index in it?
 * what's returned in teach case is just a series... we know how to save that and index in it?

... and we can combine these together into a new dataframe

Dates and Times in Pandas
And pandas has a bunch of magic related to dates:


 * the  function

but the real reason to do this is that pandas has a bunch of stuff to handle timeseries with give you access to a bunch of cool stuff:


 * a pandas time series is just a series with an index that is a pandas date time
 * , for example by day or week which sort of bins data and works like group

Building up and exporting a pandas DataFrame

 * I typically justt build a list of dictionaries in the normal way and then use
 * Then I export with