Latest revision |
Your text |
Line 19: |
Line 19: |
| # Can you build a list of all of the articles edited by "Benjamin Mako Hill"? What is the article with the longest title that user Benjamin Mako Hill has edited? ''Hint: coming'' | | # Can you build a list of all of the articles edited by "Benjamin Mako Hill"? What is the article with the longest title that user Benjamin Mako Hill has edited? ''Hint: coming'' |
| # How many edits to the article "Python (programming language)" were made in 2014? ''Hint: example 1'' | | # How many edits to the article "Python (programming language)" were made in 2014? ''Hint: example 1'' |
|
| |
| == Helpful script ==
| |
|
| |
| This script walks through our exploration of a query for categories on a page, which we did right at the end of class on Wednesday.
| |
|
| |
| <source lang="python">
| |
| # Import the requests lib.
| |
| import requests
| |
|
| |
| # Set up a query that grabs categories for the python page in json format.
| |
| request_dict = {
| |
| 'action': 'query',
| |
| 'format': 'json',
| |
| 'prop': 'categories',
| |
| 'titles': 'Python_(programming_language)',
| |
| 'clprop': 'timestamp'
| |
| }
| |
|
| |
| # Make a call to the wikipedia api.
| |
| wp_call = requests.get('https://en.wikipedia.org/w/api.php', request_dict)
| |
|
| |
| # Create a dict from json.
| |
| response = wp_call.json()
| |
|
| |
| # Let's just print it!
| |
| print(response)
| |
|
| |
| # Woah... big dictionary here. [Question: how did I know it was a dictionary from printing it?]
| |
|
| |
| type(response)
| |
| # Out[7]: dict
| |
| # Ok, confirmed... it's a dictionary.
| |
|
| |
| # If something is a dictionary then check it's keys.
| |
| response.keys()
| |
| # Out[8]: dict_keys(['continue', 'query'])
| |
|
| |
| # I told you not to worry much about continue, so let's look at query.
| |
|
| |
| # Q1: what type is it?
| |
| type(response['query'])
| |
| # Out[9]: dict
| |
|
| |
| print(response['query'])
| |
| # Woops.... still huge. Let's explore more.
| |
|
| |
| # Ok, so response['query'] is a dict. Which means it has keys!
| |
| response['query'].keys()
| |
| # Out[10]: dict_keys(['pages', 'normalized'])
| |
|
| |
|
| |
| response['query']['normalized']
| |
| # Out[11]:
| |
| # [{'from': 'Python_(programming_language)',
| |
| # 'to': 'Python (programming language)'}]
| |
|
| |
| # Ok, so normalized is a small list [HOW DID I KNOW?]. I can pretty much see what it's listing: ways of rewriting the query.
| |
| # In this case, it changed spaces to _.
| |
|
| |
| response['query']['pages']
| |
| # Woah... still huge. Let's explore more.
| |
|
| |
|
| |
| type(response['query']['pages'])
| |
| # Out[13]: dict
| |
|
| |
| # Ok, it's a dict. Let's look at keys!
| |
|
| |
| response['query']['pages'].keys()
| |
| # dict_keys(['23862'])
| |
|
| |
| # One key. This is the page id! [WHAT IF YOU CHANGE titles IN THE INPUT TO QUERY TWO PAGES?]
| |
| response['query']['pages']['23862']
| |
| # Still big, so let's keep going.
| |
|
| |
| response['query']['pages']['23862'].keys()
| |
| # Out[16]: dict_keys(['categories', 'pageid', 'ns', 'title'])
| |
|
| |
| # Let's look at each key.
| |
| response['query']['pages']['23862']['title']
| |
| # Out[17]: 'Python (programming language)'
| |
| # That one makes sense...
| |
|
| |
| response['query']['pages']['23862']['ns']
| |
| # Out[18]: 0
| |
| # I don't know what it is but it doesn't seem useful right now. I'll keep exploring.
| |
|
| |
| response['query']['pages']['23862']['pageid']
| |
| # Out[19]: 23862
| |
| # This is an int (how did I know?) that apparently corresponds to the key in response['query']['pages']
| |
|
| |
|
| |
| response['query']['pages']['23862']['categories']
| |
| # It's a list [HOW DID I KNOW from the printout?] Still kind of long... let's keep going.
| |
|
| |
| type(response['query']['pages']['23862']['categories'])
| |
| # Out[20]: list
| |
| # Ok, confirmed it's a list. I got same info when the printout above started with '['
| |
|
| |
| len(response['query']['pages']['23862']['categories'])
| |
| # Out[21]: 10
| |
|
| |
| # Ten categories. The docs say that's a default.
| |
|
| |
| response['query']['pages']['23862']['categories'][0]
| |
| # Out[22]:
| |
| # {'ns': 14,
| |
| # 'timestamp': '2016-02-03T16:53:02Z',
| |
| # 'title': 'Category:Articles with DMOZ links'}
| |
|
| |
| # Now I've learned something: the elements of categories are DICTs (note the '{', '}' in output or use type)
| |
| # I've learned that there are titles in every category.
| |
|
| |
| # What's next?
| |
| # REPEAT THIS EXERCISE but query wikipedia for revisions not categories. Walk through the json output, which is
| |
| # composed of lists and dictionaries.
| |
|
| |
|
| |
| </source>
| |