Community Data Science Course (Spring 2023)/Week 5 lecture notes: Difference between revisions

Revision as of 01:34, 25 April 2023

New concepts for the day:

Defining functions
import json and json.loads() and json.dumps()
Reading from files
Breaking projects in multiple notebooks and step
Waiting... time.sleep(1)

Stage 0: Coming up with a plan

I want to download data on page views data for three universities and present the sum total of each.

I'm going to split work into two steps:

collect the data from the web and write the raw JSON "payload" a file
read the data from the file and do whatever data extraction, cleaning, counting, etc; then write a TSV file
open a TSV file and make a graph

Stage 1: Getting data

I want to build data on how popular something is using the MediaWiki views API. First I went searching I found two places:

I chose the second option.

The documentation suggested I should set up a unique user-agent. Search how todo that brought me to this StackOverflow post: https://stackoverflow.com/questions/10606133/sending-user-agent-using-requests-library-in-python which I followed to set up headers appropriately.

Between that and the interactive material in Wikimedia Rest API, I was able to construct a URL.

We will build up something like file 1, version 1:

setting the header
json.dumps() [mention that I'll skip this until we have an error]

Stage 2: Reading in data

walk through building file 2, version 1 with a focus on:

opening files with open(filename, 'r')
f.read() which reads the whole file in
json.loads()
outputting days and views
try to graph... we'll have an error when we try to graph
write some new code to create better formatted date strings...

Stage 3: lets extend to multiple things

lets build a couple functions. maybe one for dates? maybe one for getting_pageview data? lets refactor the old code to use these?
lets build in waiting for a second with time.sleep(1)
let's count with a dictionary

@@ Line 3: / Line 3: @@
 * Defining functions
 * <code>import json</code> and <code>json.loads()</code> and <code>json.dumps()</code>
-* Reading *from* files
+* Reading ''from'' files
 * Breaking projects in multiple notebooks and step
-* Waiting...
+* Waiting... <code>time.sleep(1)</code>
 == Stage 0: Coming up with a plan ==
-* I'm going to split work into two steps, one is basically
+I want to download data on page views data for three universities and present the sum total of each.
+I'm going to split work into two steps:
+* collect the data from the web and write the raw JSON "payload" a file
+* read the data from the file and do whatever data extraction, cleaning, counting, etc; then write a TSV file
+* open a TSV file and make a graph
 == Stage 1: Getting data ==
@@ Line 23: / Line 29: @@
 Between that and the interactive material in [https://www.mediawiki.org/wiki/Wikimedia_REST_API Wikimedia Rest API], I was able to construct a URL.
+We will '''build up something like file 1, version 1''':
+* setting the header
+* json.dumps() [mention that I'll skip this until we have an error]
+== Stage 2: Reading in data ==
+walk through building '''file 2, version 1''' with a focus on:
+* opening files with <code>open(filename, 'r')</code>
+* <code>f.read()</code> which reads the whole file in
+* json.loads()
+* outputting days and views
+* try to graph... we'll have an error when we try to graph
+* write some new code to create better formatted date strings...
+== Stage 3: lets extend to multiple things ==
+* lets build a couple functions. maybe one for dates? maybe one for getting_pageview data? lets refactor the old code to use these?
+* lets build in waiting for a second with <code>time.sleep(1)</code>
+* let's count with a dictionary