Not logged in
Talk
Contributions
Create account
Log in
Navigation
Main page
About
People
Publications
Teaching
Resources
Research Blog
Wiki Functions
Recent changes
Help
Licensing
Page
Discussion
Edit
View history
Editing
Community Data Science Course (Spring 2019)/Day 5 Notes
(section)
From CommunityData
Jump to:
navigation
,
search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== '''Downloading data from the internet''' == '''API (Application Programmer Interface)''': a structured way for two programs to communicate. Think of it like a contract or a secret handshake. Examples: * The api for twitter describes how to read tweets, write tweets, and follow people. See details here: https://dev.twitter.com/ * Yelp has an API described here: https://www.yelp.com/developers * Zillow's API: https://www.zillow.com/howto/api/APIOverview.htm What to look for when looking at an API: # Where is the documentation? # What kinds of information can I request? # How to I request information from this API? # Are there any rate limits or restrictions on use? For instance, Twitter doesn't want you downloading tweets. Zillow forbids storing bulk results. (Why?) # Is there a python package that will help me? For instance, Twitter has a great python package called tweepy that will simplify access. '''Example''' We're going to spend today looking at Open Street Map's api called [http://nominatim.openstreetmap.org/ Nominatim]. '''Structured data and JSON''' * HTML is the markup language your browser uses to display information. '''Do this:''' Go to <code>http://nominatim.openstreetmap.org/?q=[bakery]+seattle+wa</code> (you'll want to copy the whole thing) and view source to see the raw html. See if you can find the ''structured data'' embedded in there somewhere. It's there, but it's often difficult to teach computers to find it. Sometimes, the only way to get data is to extract it from potentially messy HTML. This is called scrapping and python has a library called BeautifulSoup to help with that. Often, data providers make it easier for computers to extract information from their API by providing it in a structured format. '''Do this:''' Go to <code>http://nominatim.openstreetmap.org/?q=[bakery]+seattle+wa&format=json</code> to see the same query in JSON format. JSON: javascript object notation. JSON data looks like python lists and dictionaries, and we'll see that it's easy to turn it into a python variable that is a list or dictionary. Here's a sample: [ { "place_id":"21583441", "licence":"Data Β© OpenStreetMap contributors, ODbL 1.0. http:\/\/www.openstreetmap.org\/copyright", "osm_type":"node", "osm_id":"2131716956", "boundingbox":[ "47.6248735", "47.6249735", "-122.3207478", "-122.3206478" ], "lat":"47.6249235", "lon":"-122.3206978", "display_name":"The Confectional, 618, Broadway East, Eastlake, Capitol Hill, Seattle, King County, Washington, 98102, United States of America", "class":"shop", "type":"bakery", "importance":0.201, "icon":"http:\/\/nominatim.openstreetmap.org\/images\/mapicons\/shopping_bakery.p.20.png" } ] '''Do this''' copy the json output from the query above into the json parser at https://jsonformatter.curiousconcept.com/. What is the structure of this document? '''The python requests library''' We can use the following program to download json data from Nominatim: import requests # this imports a new package called requests response = requests.get('http://nominatim.openstreetmap.org/', {'q': '[bakery] seattle wa', 'format': 'json'}) print response.status_code # 200 means it worked. data = response.json() print(type(data)) Here, we make use of the requests library, which has excellent documentation [http://docs.python-requests.org/en/latest/api/ here]. Let's break down each line: * <code>import requests</code> imports the library so we can use it. * <code>response = requests.get('http://nominatim.openstreetmap.org/', {'q': '[bakery] seattle wa', 'format': 'json'})</code> This is the most important line! Here, we "get" information from the web server. Note that we pass the url up to the "?" character as the first argument. Compare the dictionary second argument to the query we did above in our browser. How do they differ? How are they the same? * <code>print(response.status_code)</code> the response is a python object that contains the actual contents of the web page as well as some status information. Here, we're getting the status_code, which tells us whether the call succeeded. 200 is "good", and you will sometimes see 404 for "not found" or 500 for "server error". * <code>print(response.content)</code> this wasn't above, but try it anyway. response.content contains the raw contents of the webpage as a string. * <code>data = response.json()</code> response.json tries to convert the string content to a python list or dictionary if the content was stored in JSON format. (What happens if the content wasn't JSON?)
Summary:
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see
CommunityData:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:
Cancel
Editing help
(opens in new window)
Tools
What links here
Related changes
Special pages
Page information