Community Data Science Course (Spring 2023)/Week 4 lecture notes

From CommunityData
Revision as of 20:30, 17 April 2023 by Benjamin Mako Hill (talk | contribs) (Created page with "== Using APIs to download data from the internet == '''API (Application Programmer Interface)''' is a structured way for two programs to communicate. Think of it like a contract or a secret handshake. APIs exist on both the Internet and, in a sense, we've already been using some APIs in Python. An interface typically has two parts: * A description of ''how to request something'' * A description of ''what one will get in return'' Once you understand those two things,...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Using APIs to download data from the internet

API (Application Programmer Interface) is a structured way for two programs to communicate. Think of it like a contract or a secret handshake. APIs exist on both the Internet and, in a sense, we've already been using some APIs in Python.

An interface typically has two parts:

  • A description of how to request something
  • A description of what one will get in return

Once you understand those two things, you know the API. An API within Python typically includes a set of functions.

A web API is quite a lot of like functions in Python but it describes how a program running on your computer can talk to another computer running a website. Basically, it's like a website your programs can visit (you:a website::your program:a web API).

Examples:

What to look for when looking at an API
  1. Where is the documentation? Are there examples or code samples?
  2. What kinds of information can I request?
  3. How do I request information from this API?
  4. What format does it give me data back in?
  5. Are there any rate limits or restrictions on use? For instance, Twitter doesn't want you downloading tweets. Zillow forbids storing bulk results. (Why?)
  6. Is there a python package that will help me? For instance, Twitter has a great python package called tweepy that will simplify access.

Checklist: How do we use an API to fetch datasets?

Basic idea: your program sends a request, the API sends data back:

  • Where do you direct your request? (i.e., what are the site's API endpoints)
  • How do I write my request? Put together a URL; it will be different for different web APIs.
    • Check the documentation, look for code samples
  • How do you send a request?
    • Often the simplest way is to try it in your browser
    • Python has modules you can use, like requests, to make HTTP requests. The requests library, which has excellent documentation here.
  • What do you get back?
    • Structured data (usually in the JSON format).
      • JSON is javascript object notation'. JSON data looks like python lists and dictionaries, and we'll see that it's easy to turn it into a python variable that is a list or dictionary. Here's a sample:
  • How do you understand (i.e. parse) the data?

How do we write Python programs that make web request

To use APIs to build a dataset we will need:

  • all our tools from last session: variables, etc [DONE!]
  • the ability to open URLs on the web
  • the ability to create custom URLS
  • the ability to understand (i.e., parse) JSON data that APIs usually give us
  • the ability to save to files [DONE!]

Our first API: Bored API

  • First of all, lets check out this page: http://www.boredapi.com/
    • Let's click through the about page and the documentation and try stuff on their web interface
      • Try the random endpoint
      • The output is JSON
  • JSON
    • HTML versus JSON
    • The good news is that JSON is (almost!) the same as Python! Just lists, strings, dictionaries, integers, floats, etc. (e.g., What type is key?
    • Most APIs will return JSON directly
    • It's often helpful to format JSON to understand it
  • Passing parameters to an API
    • let's go back and look at the documentation
    • lets request things based on a specific number of participants
    • lets try to request thing based on a price range (give you all 3-4 minutes to try)
  • Making requests in Python
    • import requests
    • response = requests.get(URL, params={})
    • print(response.status_code)
    • data = response.json(); now we can check type and poke around in it
  • e.g., lets work through a quick example
    • Let's put it into a Python program to print out one activity for 1 through 5 people!
    • Let's add the type of activity to what we print out
    • Let's add another parameter (maybe a price range?)

Introducing the OSM Nominatim API

We're going to spend today looking at Open Street Map's api called Nominatim.

simple request:

import requests  # this imports a new package called requests

response = requests.get('http://nominatim.openstreetmap.org/', {'q': '[bakery] seattle wa', 'format': 'json'})
print response.status_code  # 200 means it worked.
data = response.json()
print(type(data))

Do this: Go to http://nominatim.openstreetmap.org/?q=[bakery]+seattle+wa&format=json to see the same query in JSON format.


   [
    {
     "place_id":"21583441",
     "licence":"Data © OpenStreetMap contributors, ODbL 1.0. http:\/\/www.openstreetmap.org\/copyright",
     "osm_type":"node",
     "osm_id":"2131716956",
     "boundingbox":[
        "47.6248735",
        "47.6249735",
        "-122.3207478",
        "-122.3206478"
     ],
     "lat":"47.6249235",
     "lon":"-122.3206978",
     "display_name":"The Confectional, 618, Broadway East, Eastlake, Capitol Hill, Seattle, King County, Washington, 98102, United States of America",
     "class":"shop",
     "type":"bakery",
     "importance":0.201,
     "icon":"http:\/\/nominatim.openstreetmap.org\/images\/mapicons\/shopping_bakery.p.20.png"
  }

]


Let's break down each line:

This is the most important line! Here, we "get" information from the web server. Note that we pass the url up to the "?" character as the first argument. Compare the dictionary second argument to the query we did above in our browser. How do they differ? How are they the same?

  • print(response.status_code) the response is a python object that contains the actual contents of the web page as well as some status information. Here, we're getting the status_code, which tells us whether the call succeeded. 200 is "good", and you will sometimes see 404 for "not found" or 500 for "server error".
  • print(response.content) this wasn't above, but try it anyway. response.content contains the raw contents of the webpage as a string.
  • data = response.json() response.json() tries to convert the string content to a python list or dictionary if the content was stored in JSON format. (What happens if the content wasn't JSON?)

FAQ

What if there's no API?
Sometimes, the only way to get data is to extract it from potentially messy HTML. This is called scrapping and python has a library called BeautifulSoup to help with that.