Community Data Science Course (Spring 2019)/Day 5 Notes: Difference between revisions

Latest revision as of 23:40, 30 April 2019

Downloading data from the internet[edit]

API (Application Programmer Interface): a structured way for two programs to communicate. Think of it like a contract or a secret handshake.

Examples:

The api for twitter describes how to read tweets, write tweets, and follow people. See details here: https://dev.twitter.com/
Yelp has an API described here: https://www.yelp.com/developers
Zillow's API: https://www.zillow.com/howto/api/APIOverview.htm

What to look for when looking at an API:

Where is the documentation?
What kinds of information can I request?
How to I request information from this API?
Are there any rate limits or restrictions on use? For instance, Twitter doesn't want you downloading tweets. Zillow forbids storing bulk results. (Why?)
Is there a python package that will help me? For instance, Twitter has a great python package called tweepy that will simplify access.

Example We're going to spend today looking at Open Street Map's api called Nominatim.

Structured data and JSON

HTML is the markup language your browser uses to display information.

Do this: Go to http://nominatim.openstreetmap.org/?q=[bakery]+seattle+wa (you'll want to copy the whole thing) and view source to see the raw html. See if you can find the structured data embedded in there somewhere. It's there, but it's often difficult to teach computers to find it.

Sometimes, the only way to get data is to extract it from potentially messy HTML. This is called scrapping and python has a library called BeautifulSoup to help with that.

Often, data providers make it easier for computers to extract information from their API by providing it in a structured format.

Do this: Go to http://nominatim.openstreetmap.org/?q=[bakery]+seattle+wa&format=json to see the same query in JSON format.

JSON: javascript object notation. JSON data looks like python lists and dictionaries, and we'll see that it's easy to turn it into a python variable that is a list or dictionary. Here's a sample:

   [
    {
     "place_id":"21583441",
     "licence":"Data © OpenStreetMap contributors, ODbL 1.0. http:\/\/www.openstreetmap.org\/copyright",
     "osm_type":"node",
     "osm_id":"2131716956",
     "boundingbox":[
        "47.6248735",
        "47.6249735",
        "-122.3207478",
        "-122.3206478"
     ],
     "lat":"47.6249235",
     "lon":"-122.3206978",
     "display_name":"The Confectional, 618, Broadway East, Eastlake, Capitol Hill, Seattle, King County, Washington, 98102, United States of America",
     "class":"shop",
     "type":"bakery",
     "importance":0.201,
     "icon":"http:\/\/nominatim.openstreetmap.org\/images\/mapicons\/shopping_bakery.p.20.png"
  }

]

Do this copy the json output from the query above into the json parser at https://jsonformatter.curiousconcept.com/. What is the structure of this document?

The python requests library We can use the following program to download json data from Nominatim:

   import requests  # this imports a new package called requests
   
   response = requests.get('http://nominatim.openstreetmap.org/', {'q': '[bakery] seattle wa', 'format': 'json'})
   print response.status_code  # 200 means it worked.
   data = response.json()
   print(type(data))

Here, we make use of the requests library, which has excellent documentation here.

Let's break down each line:

import requests imports the library so we can use it.
response = requests.get('http://nominatim.openstreetmap.org/', {'q': '[bakery] seattle wa', 'format': 'json'})

This is the most important line! Here, we "get" information from the web server. Note that we pass the url up to the "?" character as the first argument. Compare the dictionary second argument to the query we did above in our browser. How do they differ? How are they the same?

print(response.status_code) the response is a python object that contains the actual contents of the web page as well as some status information. Here, we're getting the status_code, which tells us whether the call succeeded. 200 is "good", and you will sometimes see 404 for "not found" or 500 for "server error".
print(response.content) this wasn't above, but try it anyway. response.content contains the raw contents of the webpage as a string.
data = response.json() response.json tries to convert the string content to a python list or dictionary if the content was stored in JSON format. (What happens if the content wasn't JSON?)

@@ Line 83: / Line 83: @@
 * <code>print(response.content)</code>  this wasn't above, but try it anyway. response.content contains the raw contents of the webpage as a string.
 * <code>data = response.json()</code>  response.json tries to convert the string content to a python list or dictionary if the content was stored in JSON format. (What happens if the content wasn't JSON?)
-'''Try it!'''
-We'll spend the rest of class working with the Nominatim API. The docs can be found here: http://wiki.openstreetmap.org/wiki/Nominatim
-#) '''Reverse Geocoding'''  Use the geocoding API to look up a specific latitude and longitude of your choice (try this building!).
-#) Craft a query using the search API to find colleges in Seattle. (Hint: you'll want to set bounded=1 and use viewbox). Print the name and location of every college you find.
-#) How can you tell that a place returned by the API is in fact a college?
-#) Find bakeries near your home. (Hint: look at special phrases documentation)
-#) Are there more cafes in Ballard or Capitol Hill? (Hint: investigate 'limit' keyword)
-#) Are there more dentists near the University or near Downtown?