Latest revision |
Your text |
Line 11: |
Line 11: |
|
| |
|
| # Save the revision metadata printed in <code>wikipedia1-2.py</code> (i.e., the material already being printed out) to a file called "wikipedia_revisions.tsv". | | # Save the revision metadata printed in <code>wikipedia1-2.py</code> (i.e., the material already being printed out) to a file called "wikipedia_revisions.tsv". |
| # Print out the revision ids and edit summaries (i.e., <code>comment</code>) of each revision for the article on Python. ''Hint: modify example 1'' | | # Print out the revision ids and edit summaries (i.e., <code>comment</code>) of each revision for the article on Python. |
| ## ''modified'' Print out the id, the parent id, and the content of the revision.
| | # Find out what other data or metadata you can print out for a revision for an article. |
| # Find out what other data or metadata you can print out for a revision for an article. ''Hint: this isn't a coding question'' | | # Which article is in more categories? [[:wiki:Python (programming language)|Python (programming language)]] or [[:wiki:R (programming language)|R (programming language)]]? |
| # Which article is in more categories? [[:wiki:Python (programming language)|Python (programming language)]] or [[:wiki:R (programming language)|R (programming language)]]? ''Hint: modify question 2 (example 1). You'll want to investigate the titles key in the wikipedia api'' | | # Find out how many revisions to the article on "Python (programming language)" were made by user "Peterl"? How about "Hfastedge"? |
| # Find out how many revisions to the article on "Python (programming language)" were made by user "Peterl"? How about "Hfastedge"? ''Hint: modify example 1-2. You'll want to make sure you get username from the api'' | | # How would you use the API to find out how many revisions/edits the user "Benjamin Mako Hill" has made to Wikipedia? |
| # How would you use the API to find out how many revisions/edits the user "Benjamin Mako Hill" has made to Wikipedia? ''Hint: coming'' | | # Can you build a list of all of the articles edited by "Benjamin Mako Hill"? What is the article with the longest title that user Benjamin Mako Hill has edited? |
| # Can you build a list of all of the articles edited by "Benjamin Mako Hill"? What is the article with the longest title that user Benjamin Mako Hill has edited? ''Hint: coming'' | | # How many edits to the article "Python (programming language)" were made in 2014? |
| # How many edits to the article "Python (programming language)" were made in 2014? ''Hint: example 1'' | |
|
| |
|
| == Helpful script ==
| | ;Here's a much more complicated challenge but a fun one that you know enough to solve: Check out the game [http://kevan.org/catfishing.php Catfishing] which shows you categories and has you guess an article. Write a version that uses the Wikipedia API. For example, pick 5 articles and write a program that will randomly show the categories for one of those articles and to ask you to guess the article. Read the guess with <code>input()</code> and let the user know if they go it right or wrong! |
| | |
| This script walks through our exploration of a query for categories on a page, which we did right at the end of class on Wednesday.
| |
| | |
| <source lang="python">
| |
| # Import the requests lib.
| |
| import requests
| |
| | |
| # Set up a query that grabs categories for the python page in json format.
| |
| request_dict = {
| |
| 'action': 'query',
| |
| 'format': 'json',
| |
| 'prop': 'categories',
| |
| 'titles': 'Python_(programming_language)',
| |
| 'clprop': 'timestamp'
| |
| }
| |
| | |
| # Make a call to the wikipedia api.
| |
| wp_call = requests.get('https://en.wikipedia.org/w/api.php', request_dict)
| |
| | |
| # Create a dict from json.
| |
| response = wp_call.json()
| |
| | |
| # Let's just print it!
| |
| print(response)
| |
| | |
| # Woah... big dictionary here. [Question: how did I know it was a dictionary from printing it?]
| |
| | |
| type(response)
| |
| # Out[7]: dict
| |
| # Ok, confirmed... it's a dictionary.
| |
| | |
| # If something is a dictionary then check it's keys.
| |
| response.keys()
| |
| # Out[8]: dict_keys(['continue', 'query'])
| |
| | |
| # I told you not to worry much about continue, so let's look at query.
| |
| | |
| # Q1: what type is it?
| |
| type(response['query'])
| |
| # Out[9]: dict
| |
| | |
| print(response['query'])
| |
| # Woops.... still huge. Let's explore more.
| |
| | |
| # Ok, so response['query'] is a dict. Which means it has keys!
| |
| response['query'].keys()
| |
| # Out[10]: dict_keys(['pages', 'normalized'])
| |
| | |
| | |
| response['query']['normalized']
| |
| # Out[11]:
| |
| # [{'from': 'Python_(programming_language)',
| |
| # 'to': 'Python (programming language)'}]
| |
| | |
| # Ok, so normalized is a small list [HOW DID I KNOW?]. I can pretty much see what it's listing: ways of rewriting the query.
| |
| # In this case, it changed spaces to _.
| |
| | |
| response['query']['pages']
| |
| # Woah... still huge. Let's explore more.
| |
| | |
| | |
| type(response['query']['pages'])
| |
| # Out[13]: dict
| |
| | |
| # Ok, it's a dict. Let's look at keys!
| |
| | |
| response['query']['pages'].keys()
| |
| # dict_keys(['23862'])
| |
| | |
| # One key. This is the page id! [WHAT IF YOU CHANGE titles IN THE INPUT TO QUERY TWO PAGES?]
| |
| response['query']['pages']['23862']
| |
| # Still big, so let's keep going.
| |
| | |
| response['query']['pages']['23862'].keys()
| |
| # Out[16]: dict_keys(['categories', 'pageid', 'ns', 'title'])
| |
| | |
| # Let's look at each key.
| |
| response['query']['pages']['23862']['title']
| |
| # Out[17]: 'Python (programming language)'
| |
| # That one makes sense...
| |
| | |
| response['query']['pages']['23862']['ns']
| |
| # Out[18]: 0
| |
| # I don't know what it is but it doesn't seem useful right now. I'll keep exploring.
| |
| | |
| response['query']['pages']['23862']['pageid']
| |
| # Out[19]: 23862
| |
| # This is an int (how did I know?) that apparently corresponds to the key in response['query']['pages']
| |
| | |
| | |
| response['query']['pages']['23862']['categories']
| |
| # It's a list [HOW DID I KNOW from the printout?] Still kind of long... let's keep going.
| |
| | |
| type(response['query']['pages']['23862']['categories'])
| |
| # Out[20]: list
| |
| # Ok, confirmed it's a list. I got same info when the printout above started with '['
| |
| | |
| len(response['query']['pages']['23862']['categories'])
| |
| # Out[21]: 10
| |
| | |
| # Ten categories. The docs say that's a default.
| |
| | |
| response['query']['pages']['23862']['categories'][0]
| |
| # Out[22]:
| |
| # {'ns': 14,
| |
| # 'timestamp': '2016-02-03T16:53:02Z',
| |
| # 'title': 'Category:Articles with DMOZ links'}
| |
| | |
| # Now I've learned something: the elements of categories are DICTs (note the '{', '}' in output or use type)
| |
| # I've learned that there are titles in every category.
| |
| | |
| # What's next?
| |
| # REPEAT THIS EXERCISE but query wikipedia for revisions not categories. Walk through the json output, which is
| |
| # composed of lists and dictionaries.
| |
| | |
| | |
| </source> | |