Community Data Science Course (Spring 2023)/Week 5 lecture notes
New concepts for the day:
- Defining functions
- Reading from files
- Breaking projects in multiple notebooks and step
Stage 0: Coming up with a plan
I want to download data on page views data for three universities and present the sum total of each.
I'm going to split work into two steps:
- collect the data from the web and write the raw JSON "payload" a file
- read the data from the file and do whatever data extraction, cleaning, counting, etc; then write a TSV file
- open a TSV file and make a graph
Stage 1: Getting data
I want to build data on how popular something is using the MediaWiki views API. First I went searching I found two places:
I chose the second option.
The documentation suggested I should set up a unique user-agent. Search how todo that brought me to this StackOverflow post: https://stackoverflow.com/questions/10606133/sending-user-agent-using-requests-library-in-python which I followed to set up headers appropriately.
Between that and the interactive material in Wikimedia Rest API, I was able to construct a URL.
We will build up something like file 1, version 1:
- setting the header
- json.dumps() [mention that I'll skip this until we have an error]
Stage 2: Reading in data
walk through building file 2, version 1 with a focus on:
- opening files with
f.read()which reads the whole file in
- outputting days and views
- try to graph... we'll have an error when we try to graph
- write some new code to create better formatted date strings...
Stages 3 and 4: lets extend to multiple things
- lets build a couple functions. maybe one for dates? maybe one for getting_pageview data? lets refactor the old code to use these?
- lets build in waiting for a second with
- let's count with a dictionary