Community Data Science Course (Spring 2023)/Week 5 lecture notes

From CommunityData

New concepts for the day:

  • Defining functions
  • import json and json.loads() and json.dumps()
  • Reading from files
  • Breaking projects in multiple notebooks and step
  • Waiting... time.sleep(1)

Stage 0: Coming up with a plan[edit]

I want to download data on page views data for three universities and present the sum total of each.

I'm going to split work into two steps:

  • collect the data from the web and write the raw JSON "payload" a file
  • read the data from the file and do whatever data extraction, cleaning, counting, etc; then write a TSV file
  • open a TSV file and make a graph

Stage 1: Getting data[edit]

I want to build data on how popular something is using the MediaWiki views API. First I went searching I found two places:

I chose the second option.

The documentation suggested I should set up a unique user-agent. Search how todo that brought me to this StackOverflow post: https://stackoverflow.com/questions/10606133/sending-user-agent-using-requests-library-in-python which I followed to set up headers appropriately.

Between that and the interactive material in Wikimedia Rest API, I was able to construct a URL.

We will build up something like file 1, version 1:

  • setting the header
  • json.dumps() [mention that I'll skip this until we have an error]

Stage 2: Reading in data[edit]

walk through building file 2, version 1 with a focus on:

  • opening files with open(filename, 'r')
  • f.read() which reads the whole file in
  • json.loads()
  • outputting days and views
  • try to graph... we'll have an error when we try to graph
  • write some new code to create better formatted date strings...

Stages 3 and 4: lets extend to multiple things[edit]

  • lets build a couple functions. maybe one for dates? maybe one for getting_pageview data? lets refactor the old code to use these?
  • lets build in waiting for a second with time.sleep(1)
  • let's count with a dictionary