Community Data Science Course (Spring 2016)/Day 5 Coding Challenges: Difference between revisions

From CommunityData
No edit summary
 
Line 19: Line 19:
# Can you build a list of all of the articles edited by "Benjamin Mako Hill"? What is the article with the longest title that user Benjamin Mako Hill has edited? ''Hint: coming''
# Can you build a list of all of the articles edited by "Benjamin Mako Hill"? What is the article with the longest title that user Benjamin Mako Hill has edited? ''Hint: coming''
# How many edits to the article "Python (programming language)" were made in 2014? ''Hint: example 1''
# How many edits to the article "Python (programming language)" were made in 2014? ''Hint: example 1''
== Helpful script ==
This script walks through our exploration of a query for categories on a page, which we did right at the end of class on Wednesday.
<source lang="python">
# Import the requests lib.
import requests
# Set up a query that grabs categories for the python page in json format.
request_dict = {
'action': 'query',
'format': 'json',
'prop': 'categories',
'titles': 'Python_(programming_language)',
'clprop': 'timestamp'
}
# Make a call to the wikipedia api.
wp_call = requests.get('https://en.wikipedia.org/w/api.php', request_dict)
# Create a dict from json.
response = wp_call.json()
# Let's just print it!
print(response)
# Woah... big dictionary here. [Question: how did I know it was a dictionary from printing it?]
type(response)
# Out[7]: dict
# Ok, confirmed... it's a dictionary.
# If something is a dictionary then check it's keys.
response.keys()
# Out[8]: dict_keys(['continue', 'query'])
# I told you not to worry much about continue, so let's look at query.
# Q1: what type is it?
type(response['query'])
# Out[9]: dict
print(response['query'])
# Woops.... still huge. Let's explore more.
# Ok, so response['query'] is a dict. Which means it has keys!
response['query'].keys()
# Out[10]: dict_keys(['pages', 'normalized'])
response['query']['normalized']
# Out[11]:
# [{'from': 'Python_(programming_language)',
#  'to': 'Python (programming language)'}]
# Ok, so normalized is a small list [HOW DID I KNOW?]. I can pretty much see what it's listing: ways of rewriting the query.
# In this case, it changed spaces to _.
response['query']['pages']
# Woah... still huge. Let's explore more.
type(response['query']['pages'])
# Out[13]: dict
# Ok, it's a dict. Let's look at keys!
response['query']['pages'].keys()
# dict_keys(['23862'])
# One key. This is the page id! [WHAT IF YOU CHANGE titles IN THE INPUT TO QUERY TWO PAGES?]
response['query']['pages']['23862']
# Still big, so let's keep going.
response['query']['pages']['23862'].keys()
# Out[16]: dict_keys(['categories', 'pageid', 'ns', 'title'])
# Let's look at each key.
response['query']['pages']['23862']['title']
# Out[17]: 'Python (programming language)'
# That one makes sense...
response['query']['pages']['23862']['ns']
# Out[18]: 0
# I don't know what it is but it doesn't seem useful right now. I'll keep exploring.
response['query']['pages']['23862']['pageid']
# Out[19]: 23862
# This is an int (how did I know?) that apparently corresponds to the key in response['query']['pages']
response['query']['pages']['23862']['categories']
# It's a list [HOW DID I KNOW from the printout?] Still kind of long... let's keep going.
type(response['query']['pages']['23862']['categories'])
# Out[20]: list
# Ok, confirmed it's a list. I got same info when the printout above started with '['
len(response['query']['pages']['23862']['categories'])
# Out[21]: 10
# Ten categories. The docs say that's a default.
response['query']['pages']['23862']['categories'][0]
# Out[22]:
# {'ns': 14,
#  'timestamp': '2016-02-03T16:53:02Z',
#  'title': 'Category:Articles with DMOZ links'}
# Now I've learned something: the elements of categories are DICTs (note the '{', '}' in output or use type)
# I've learned that there are titles in every category.
# What's next?
# REPEAT THIS EXERCISE but query wikipedia for revisions not categories. Walk through the json output, which is
# composed of lists and dictionaries.
</source>

Latest revision as of 18:52, 29 April 2016

Get the software[edit]

http://mako.cc/teaching/2015/community_data_science/wikipedia-data-examples.zip

Each of the challenges this week will ask you to modify and work with code in the zip file above.

As always, it's not essential that you solve or get through all of these — I'm not grading your answers on these. That said, being able to work through at least many of them is a good sign that you have mastered the concepts for the week. It is always fine to collaborate or work together on these problem sets. The only thing I ask is that you do not broadcast answers before Sunday at midnight on Canvas.

Challenges[edit]

  1. Save the revision metadata printed in wikipedia1-2.py (i.e., the material already being printed out) to a file called "wikipedia_revisions.tsv".
  2. Print out the revision ids and edit summaries (i.e., comment) of each revision for the article on Python. Hint: modify example 1
    1. modified Print out the id, the parent id, and the content of the revision.
  3. Find out what other data or metadata you can print out for a revision for an article. Hint: this isn't a coding question
  4. Which article is in more categories? Python (programming language) or R (programming language)? Hint: modify question 2 (example 1). You'll want to investigate the titles key in the wikipedia api
  5. Find out how many revisions to the article on "Python (programming language)" were made by user "Peterl"? How about "Hfastedge"? Hint: modify example 1-2. You'll want to make sure you get username from the api
  6. How would you use the API to find out how many revisions/edits the user "Benjamin Mako Hill" has made to Wikipedia? Hint: coming
  7. Can you build a list of all of the articles edited by "Benjamin Mako Hill"? What is the article with the longest title that user Benjamin Mako Hill has edited? Hint: coming
  8. How many edits to the article "Python (programming language)" were made in 2014? Hint: example 1

Helpful script[edit]

This script walks through our exploration of a query for categories on a page, which we did right at the end of class on Wednesday.

# Import the requests lib.
import requests

# Set up a query that grabs categories for the python page in json format.
request_dict = {
 'action': 'query',
 'format': 'json',
 'prop': 'categories',
 'titles': 'Python_(programming_language)',
 'clprop': 'timestamp'
}

# Make a call to the wikipedia api.
wp_call = requests.get('https://en.wikipedia.org/w/api.php', request_dict)

# Create a dict from json.
response = wp_call.json()

# Let's just print it!
print(response)

# Woah... big dictionary here. [Question: how did I know it was a dictionary from printing it?]

type(response)
# Out[7]: dict
# Ok, confirmed... it's a dictionary.

# If something is a dictionary then check it's keys.
response.keys()
# Out[8]: dict_keys(['continue', 'query'])

# I told you not to worry much about continue, so let's look at query.

# Q1: what type is it?
type(response['query'])
# Out[9]: dict

print(response['query'])
# Woops.... still huge. Let's explore more.

# Ok, so response['query'] is a dict. Which means it has keys!
response['query'].keys()
# Out[10]: dict_keys(['pages', 'normalized'])


response['query']['normalized']
# Out[11]:
# [{'from': 'Python_(programming_language)',
#  'to': 'Python (programming language)'}]

# Ok, so normalized is a small list [HOW DID I KNOW?]. I can pretty much see what it's listing: ways of rewriting the query. 
# In this case, it changed spaces to _.

response['query']['pages']
# Woah... still huge. Let's explore more.


type(response['query']['pages'])
# Out[13]: dict

# Ok, it's a dict. Let's look at keys!

response['query']['pages'].keys()
# dict_keys(['23862'])

# One key. This is the page id! [WHAT IF YOU CHANGE titles IN THE INPUT TO QUERY TWO PAGES?]
response['query']['pages']['23862']
# Still big, so let's keep going.

response['query']['pages']['23862'].keys()
# Out[16]: dict_keys(['categories', 'pageid', 'ns', 'title'])

# Let's look at each key.
response['query']['pages']['23862']['title']
# Out[17]: 'Python (programming language)'
# That one makes sense...

response['query']['pages']['23862']['ns']
# Out[18]: 0
# I don't know what it is but it doesn't seem useful right now. I'll keep exploring.

response['query']['pages']['23862']['pageid']
# Out[19]: 23862
# This is an int (how did I know?) that apparently corresponds to the key in response['query']['pages']


response['query']['pages']['23862']['categories']
# It's a list [HOW DID I KNOW from the printout?] Still kind of long... let's keep going.

type(response['query']['pages']['23862']['categories'])
# Out[20]: list
# Ok, confirmed it's a list. I got same info when the printout above started with '['

len(response['query']['pages']['23862']['categories'])
# Out[21]: 10

# Ten categories. The docs say that's a default. 

response['query']['pages']['23862']['categories'][0]
# Out[22]:
# {'ns': 14,
#  'timestamp': '2016-02-03T16:53:02Z',
#  'title': 'Category:Articles with DMOZ links'}

# Now I've learned something: the elements of categories are DICTs (note the '{', '}' in output or use type)
# I've learned that there are titles in every category. 

# What's next?
# REPEAT THIS EXERCISE but query wikipedia for revisions not categories. Walk through the json output, which is
# composed of lists and dictionaries.