Community Data Science Course (Spring 2023)/Week 5 coding challenges
There's actually nothing to download this time so you simply start with a fresh Jupyter notebook! Be sure to give a nice descriptive name, as always.
Although there's nothing to download, you will likely want to look at the following resources when working through the first half of these these:
- Community Data Science Course (Spring 2023)/Week 5 lecture notes
- Notebooks from the Week 5 lecture including:
- Week 5 lecture notebook part 1 - Data Collection
- Week 5 lecture notebook part 2 - Data Processing
- Week 5 lecture notebook (prebaked) — A combination with the three notebooks above with versions of the code that I wrote as notes for myself before the class.
- The Week 5 lecture video
#1 Wikipedia Page View API
- Identify a famous person who has been famous for at least a few years and that you have some personal interest in. Use the Wikimedia API to collect page view data from the English Wikipedia article on that person. Now use that data to generate a time-series visualization and include a link to it in your notebook.
- Identify 2 other languages editions of Wikipedia that have articles on that person. Collect page view data on the article in other languages and create a single visualization that shows how the dynamics and similar and/or different. (Note: My approach involved creating a TSV file with multiple columns.)
- Collect page view data on the articles about Marvel Comics and DC Comics in English Wikipedia. (If you'd rather replace these examples with some other comparison of popular rivals, that's just as good!)
- Which has more total page views in 2022?
- Can you draw a visualization in a spreadsheet that shows this? (Again, provide a link.)
- Were there any years when 2022's more popular page was instead the less popular of the two? How many and which ones?
- Were there any months was this reversal of relative popularity occurred? How many and which ones?
- How about any days? How many?
- I've made this file available which includes list of more than 100 Wikipedia articles about alternative rock bands from Washington state that I built from this category in Wikipedia.[*] It's a
.jsonlfile. Download the file (click "raw" and then save the file onto your drive). Now read it in, and request monthly page view data from all of them. If you need some help with loading it in, I've included some sample code at the bottom of this page.
- Once you've done this, sum up all of the page views from all of the pages and print out a TSV file with these total numbers.
- You know the routine by now! Now, make a time series graph of these numbers and include a link in your notebook.
#2 Starting on your projects
|If you are planning on collecting data from Reddit, please look into using the Pushshift API instead of the default Reddit API. The Pushshift API is not as up-to-date but it is targeted toward data scientists, not app-makers, and is likely much better suited to our needs in the class. That said, take a look at both!|
In this section, you will take your first steps towards working with your project API. Many of these questions will not involve code, so just mark down your answers in cells in your notebook.
One very useful trick is to convert cells into "markdown" mode. You can do in the menu with Cell→Cell Type→Markdown or you can just type
m when the cell is selected but not being edited (just press
Esc if you are editing to switch out of edit mode). Clicking
y turns it back into code. Markdown is just normal text but if you want to do fancier stuff like links or formatting you can look at this Markdown Cheat Sheet.
Feel free to document any findings you think might be useful as you continue to work on your project; you might thank yourself later!
- Identify an API you will (or might!) want to use for your project.
- Find documentation for that API and include links in your notebook.
- What are the API endpoints you plan to use? What are the parameters you will need to use at that endpoint?
- Is there a Python module that exists that helps make contact with the API? (See if you can you find example code on how to use it).
- If so, download it, install it, and import it into your notebook.
- Does the API require authentication? Does it need to be approved?
- If so, sign up for a developer account and get your keys. (Do this early because it often takes time for these accounts to be approved.)
- Does the API list rate limits? Does it make any requests about how you should use it?
- Make a single API call, either directly using requests or using the Python module you have used. It doesn't matter for what. The goal is that you can get something'.
- IMPORTANT: If you have included any API keys in your notebook, make a copy of your notebook, delete the cell where you include the keys, before you upload the copy of the notebook. We'll show you some tricks for hiding this information going forward.
[*] You will probably not be shocked to hear that I collected this data from an API! I've included a Jupyter Notebook with the code to grab that data from the PetScan API in the form of this Github notebook.
If you just want to read it in the file, remember it's just a JSONL file so you can modify the code from the lecture and it should work (e.g., something with
open() and the
.readlines() function associated with file variables.