Editing Community Data Science Course (Spring 2019)/Day 5 Notes (section)

== '''Downloading data from the internet''' ==

'''API (Application Programmer Interface)''': a structured way for two programs to communicate. Think of it like a contract or a secret handshake.

Examples:
* The api for twitter describes how to read tweets, write tweets, and follow people. See details here: https://dev.twitter.com/
* Yelp has an API described here: https://www.yelp.com/developers
* Zillow's API: https://www.zillow.com/howto/api/APIOverview.htm

What to look for when looking at an API:

# Where is the documentation?
# What kinds of information can I request?
# How to I request information from this API?
# Are there any rate limits or restrictions on use? For instance, Twitter doesn't want you downloading tweets. Zillow forbids storing bulk results. (Why?)
# Is there a python package that will help me? For instance, Twitter has a great python package called tweepy that will simplify access. 

'''Example'''
We're going to spend today looking at Open Street Map's api called [http://nominatim.openstreetmap.org/ Nominatim].


'''Structured data and JSON'''

* HTML is the markup language your browser uses to display information.

'''Do this:''' Go to <code>http://nominatim.openstreetmap.org/?q=[bakery]+seattle+wa</code> (you'll want to copy the whole thing) and view source to see the raw html. See if you can find the ''structured data'' embedded in there somewhere. It's there, but it's often difficult to teach computers to find it.

Sometimes, the only way to get data is to extract it from potentially messy HTML. This is called scrapping and python has a library called BeautifulSoup to help with that.

Often, data providers make it easier for computers to extract information from their API by providing it in a structured format.

'''Do this:''' 
Go to <code>http://nominatim.openstreetmap.org/?q=[bakery]+seattle+wa&format=json</code> to see the same query in JSON format.

JSON: javascript object notation. JSON data looks like python lists and dictionaries, and we'll see that it's easy to turn it into a python variable that is a list or dictionary. Here's a sample:

    [
     {
      "place_id":"21583441",
      "licence":"Data © OpenStreetMap contributors, ODbL 1.0. http:\/\/www.openstreetmap.org\/copyright",
      "osm_type":"node",
      "osm_id":"2131716956",
      "boundingbox":[
         "47.6248735",
         "47.6249735",
         "-122.3207478",
         "-122.3206478"
      ],
      "lat":"47.6249235",
      "lon":"-122.3206978",
      "display_name":"The Confectional, 618, Broadway East, Eastlake, Capitol Hill, Seattle, King County, Washington, 98102, United States of America",
      "class":"shop",
      "type":"bakery",
      "importance":0.201,
      "icon":"http:\/\/nominatim.openstreetmap.org\/images\/mapicons\/shopping_bakery.p.20.png"
   }
]


'''Do this''' copy the json output from the query above into the json parser at https://jsonformatter.curiousconcept.com/.  What is the structure of this document? 


'''The python requests library''' We can use the following program to download json data from Nominatim:


    import requests  # this imports a new package called requests
    
    response = requests.get('http://nominatim.openstreetmap.org/', {'q': '[bakery] seattle wa', 'format': 'json'})
    print response.status_code  # 200 means it worked.
    data = response.json()
    print(type(data))


Here, we make use of the requests library, which has excellent documentation [http://docs.python-requests.org/en/latest/api/ here].

Let's break down each line:

* <code>import requests</code>  imports the library so we can use it.
* <code>response = requests.get('http://nominatim.openstreetmap.org/', {'q': '[bakery] seattle wa', 'format': 'json'})</code>
This is the most important line! Here, we "get" information from the web server. Note that we pass the url up to the "?" character as the first argument. Compare the dictionary second argument to the query we did above in our browser. How do they differ? How are they the same?
* <code>print(response.status_code)</code>  the response is a python object that contains the actual contents of the web page as well as some status information. Here, we're getting the status_code, which tells us whether the call succeeded. 200 is "good", and you will sometimes see 404 for "not found" or 500 for "server error".
* <code>print(response.content)</code>  this wasn't above, but try it anyway. response.content contains the raw contents of the webpage as a string.
* <code>data = response.json()</code>  response.json tries to convert the string content to a python list or dictionary if the content was stored in JSON format. (What happens if the content wasn't JSON?)