Community Data Science Course (Spring 2023)/Week 5 lecture notes: Difference between revisions

From CommunityData
(Created page with "New concepts for the day: * Defining functions * <code>import json</code> and <code>json.loads()</code> and <code>json.dumps()</code> * Reading *from* files * Breaking projects in multiple notebooks and step * Waiting... == Stage 0: Coming up with a plan == * I'm going to split work into two steps, one is basically == Stage 1: Getting data == I want to build data on how popular something is using the MediaWiki views API. First I went [https://www.google.com/search?...")
 
No edit summary
Line 3: Line 3:
* Defining functions
* Defining functions
* <code>import json</code> and <code>json.loads()</code> and <code>json.dumps()</code>
* <code>import json</code> and <code>json.loads()</code> and <code>json.dumps()</code>
* Reading *from* files
* Reading ''from'' files
* Breaking projects in multiple notebooks and step
* Breaking projects in multiple notebooks and step
* Waiting...
* Waiting... <code>time.sleep(1)</code>


== Stage 0: Coming up with a plan ==
== Stage 0: Coming up with a plan ==


* I'm going to split work into two steps, one is basically
I want to download data on page views data for three universities and present the sum total of each.
 
I'm going to split work into two steps:
 
* collect the data from the web and write the raw JSON "payload" a file
* read the data from the file and do whatever data extraction, cleaning, counting, etc; then write a TSV file
* open a TSV file and make a graph


== Stage 1: Getting data ==
== Stage 1: Getting data ==
Line 23: Line 29:


Between that and the interactive material in [https://www.mediawiki.org/wiki/Wikimedia_REST_API Wikimedia Rest API], I was able to construct a URL.
Between that and the interactive material in [https://www.mediawiki.org/wiki/Wikimedia_REST_API Wikimedia Rest API], I was able to construct a URL.
We will '''build up something like file 1, version 1''':
* setting the header
* json.dumps() [mention that I'll skip this until we have an error]
== Stage 2: Reading in data ==
walk through building '''file 2, version 1''' with a focus on:
* opening files with <code>open(filename, 'r')</code>
* <code>f.read()</code> which reads the whole file in
* json.loads()
* outputting days and views
* try to graph... we'll have an error when we try to graph
* write some new code to create better formatted date strings...
== Stage 3: lets extend to multiple things ==
* lets build a couple functions. maybe one for dates? maybe one for getting_pageview data? lets refactor the old code to use these?
* lets build in waiting for a second with <code>time.sleep(1)</code>
* let's count with a dictionary

Revision as of 01:34, 25 April 2023

New concepts for the day:

  • Defining functions
  • import json and json.loads() and json.dumps()
  • Reading from files
  • Breaking projects in multiple notebooks and step
  • Waiting... time.sleep(1)

Stage 0: Coming up with a plan

I want to download data on page views data for three universities and present the sum total of each.

I'm going to split work into two steps:

  • collect the data from the web and write the raw JSON "payload" a file
  • read the data from the file and do whatever data extraction, cleaning, counting, etc; then write a TSV file
  • open a TSV file and make a graph

Stage 1: Getting data

I want to build data on how popular something is using the MediaWiki views API. First I went searching I found two places:

I chose the second option.

The documentation suggested I should set up a unique user-agent. Search how todo that brought me to this StackOverflow post: https://stackoverflow.com/questions/10606133/sending-user-agent-using-requests-library-in-python which I followed to set up headers appropriately.

Between that and the interactive material in Wikimedia Rest API, I was able to construct a URL.

We will build up something like file 1, version 1:

  • setting the header
  • json.dumps() [mention that I'll skip this until we have an error]

Stage 2: Reading in data

walk through building file 2, version 1 with a focus on:

  • opening files with open(filename, 'r')
  • f.read() which reads the whole file in
  • json.loads()
  • outputting days and views
  • try to graph... we'll have an error when we try to graph
  • write some new code to create better formatted date strings...

Stage 3: lets extend to multiple things

  • lets build a couple functions. maybe one for dates? maybe one for getting_pageview data? lets refactor the old code to use these?
  • lets build in waiting for a second with time.sleep(1)
  • let's count with a dictionary