Editing Community Data Science Course (Spring 2016)/Day 5 Coding Challenges

From CommunityData

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 11: Line 11:


# Save the revision metadata printed in <code>wikipedia1-2.py</code>  (i.e., the material already being printed out) to a file called "wikipedia_revisions.tsv".
# Save the revision metadata printed in <code>wikipedia1-2.py</code>  (i.e., the material already being printed out) to a file called "wikipedia_revisions.tsv".
# Print out the revision ids and edit summaries (i.e., <code>comment</code>) of each revision for the article on Python. ''Hint: modify example 1''
# Print out the revision ids and edit summaries (i.e., <code>comment</code>) of each revision for the article on Python.
## ''modified'' Print out the id, the parent id, and the content of the revision.
# Find out what other data or metadata you can print out for a revision for an article.
# Find out what other data or metadata you can print out for a revision for an article. ''Hint: this isn't a coding question''
# Which article is in more categories? [[:wiki:Python (programming language)|Python (programming language)]] or [[:wiki:R (programming language)|R (programming language)]]?   
# Which article is in more categories? [[:wiki:Python (programming language)|Python (programming language)]] or [[:wiki:R (programming language)|R (programming language)]]?  ''Hint: modify question 2 (example 1). You'll want to investigate the titles key in the wikipedia api''
# Find out how many revisions to the article on "Python (programming language)" were made by user "Peterl"? How about "Hfastedge"?
# Find out how many revisions to the article on "Python (programming language)" were made by user "Peterl"? How about "Hfastedge"? ''Hint: modify example 1-2. You'll want to make sure you get username from the api''
# How would you use the API to find out how many revisions/edits the user "Benjamin Mako Hill" has made to Wikipedia?
# How would you use the API to find out how many revisions/edits the user "Benjamin Mako Hill" has made to Wikipedia? ''Hint: coming''
# Can you build a list of all of the articles edited by "Benjamin Mako Hill"? What is the article with the longest title that user Benjamin Mako Hill has edited?
# Can you build a list of all of the articles edited by "Benjamin Mako Hill"? What is the article with the longest title that user Benjamin Mako Hill has edited? ''Hint: coming''
# How many edits to the article "Python (programming language)" where made in 2014?
# How many edits to the article "Python (programming language)" were made in 2014? ''Hint: example 1''


== Helpful script ==
;Here's a much more complicated challenge but a fun one that you know enough to solve: Check out the game [http://kevan.org/catfishing.php Catfishing] which shows you categories and has you guess an article. Write a version that uses the Wikipedia API. For example, pick 5 articles and write a program that will randomly show the categories for one of those articles and to ask you to guess the article. Read the guess with <code>input()</code> and let the user know if they go it right or wrong!
 
This script walks through our exploration of a query for categories on a page, which we did right at the end of class on Wednesday.
 
<source lang="python">
# Import the requests lib.
import requests
 
# Set up a query that grabs categories for the python page in json format.
request_dict = {
'action': 'query',
'format': 'json',
'prop': 'categories',
'titles': 'Python_(programming_language)',
'clprop': 'timestamp'
}
 
# Make a call to the wikipedia api.
wp_call = requests.get('https://en.wikipedia.org/w/api.php', request_dict)
 
# Create a dict from json.
response = wp_call.json()
 
# Let's just print it!
print(response)
 
# Woah... big dictionary here. [Question: how did I know it was a dictionary from printing it?]
 
type(response)
# Out[7]: dict
# Ok, confirmed... it's a dictionary.
 
# If something is a dictionary then check it's keys.
response.keys()
# Out[8]: dict_keys(['continue', 'query'])
 
# I told you not to worry much about continue, so let's look at query.
 
# Q1: what type is it?
type(response['query'])
# Out[9]: dict
 
print(response['query'])
# Woops.... still huge. Let's explore more.
 
# Ok, so response['query'] is a dict. Which means it has keys!
response['query'].keys()
# Out[10]: dict_keys(['pages', 'normalized'])
 
 
response['query']['normalized']
# Out[11]:
# [{'from': 'Python_(programming_language)',
#  'to': 'Python (programming language)'}]
 
# Ok, so normalized is a small list [HOW DID I KNOW?]. I can pretty much see what it's listing: ways of rewriting the query.  
# In this case, it changed spaces to _.
 
response['query']['pages']
# Woah... still huge. Let's explore more.
 
 
type(response['query']['pages'])
# Out[13]: dict
 
# Ok, it's a dict. Let's look at keys!
 
response['query']['pages'].keys()
# dict_keys(['23862'])
 
# One key. This is the page id! [WHAT IF YOU CHANGE titles IN THE INPUT TO QUERY TWO PAGES?]
response['query']['pages']['23862']
# Still big, so let's keep going.
 
response['query']['pages']['23862'].keys()
# Out[16]: dict_keys(['categories', 'pageid', 'ns', 'title'])
 
# Let's look at each key.
response['query']['pages']['23862']['title']
# Out[17]: 'Python (programming language)'
# That one makes sense...
 
response['query']['pages']['23862']['ns']
# Out[18]: 0
# I don't know what it is but it doesn't seem useful right now. I'll keep exploring.
 
response['query']['pages']['23862']['pageid']
# Out[19]: 23862
# This is an int (how did I know?) that apparently corresponds to the key in response['query']['pages']
 
 
response['query']['pages']['23862']['categories']
# It's a list [HOW DID I KNOW from the printout?] Still kind of long... let's keep going.
 
type(response['query']['pages']['23862']['categories'])
# Out[20]: list
# Ok, confirmed it's a list. I got same info when the printout above started with '['
 
len(response['query']['pages']['23862']['categories'])
# Out[21]: 10
 
# Ten categories. The docs say that's a default.
 
response['query']['pages']['23862']['categories'][0]
# Out[22]:
# {'ns': 14,
#  'timestamp': '2016-02-03T16:53:02Z',
#  'title': 'Category:Articles with DMOZ links'}
 
# Now I've learned something: the elements of categories are DICTs (note the '{', '}' in output or use type)
# I've learned that there are titles in every category.
 
# What's next?
# REPEAT THIS EXERCISE but query wikipedia for revisions not categories. Walk through the json output, which is
# composed of lists and dictionaries.
 
 
</source>
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)