Community Data Science Course (Spring 2023)/Week 6 coding challenges: Difference between revisions

Revision as of 20:53, 2 May 2023

This week there are three sets of questions. To answer to the first two, you'll want to work closely from the notebooks I talked through in class during the Community Data Science Course (Spring 2023)/Week 6 lecture. The good news is that answering these will mostly involve modifying or adding small amounts of code (maybe even code we've done in previous assignments!) to those notebooks.

Feel free to use spreadsheets for any part of this that you can but be sure to share links to the spreadsheets in the same way you've been doing.

Also, split your work up into multiple notebooks but name them all in a clear and consistent way. For example, these might be good names that keep things organized:

mako-01-mediawiki_1.ipynb
mako-02-mediawiki_2.ipynb
mako-03-yelp_1.ipynb
mako-05-final_project.ipynb

This approach will also be very helpful when you upload your assignment to github!

#1 MediaWiki API

Identify a movie, television, video game, or other media property that has both (a) 5 or more related articles on Wikipedia *and* (b) 5 or more other articles on the same topic on a Fandom.com website. Any large entertainment franchise will definitely work but feel free to get creative! For example, you might choose 5 Wikipedia articles about the anime Naruto and 5 articles (pages) from the naruto.fandom.com site.

First modify the code from first sets of notebooks I used in the Community Data Science Course (Spring 2023)/Week 6 lecture to download data (and metadata) about revisions to the 5 articles you chose from Wikipedia. Be ready to share:
1. (i) what proportion of those edits were made by users without accounts
2. (ii) what proportion of those edits were marked as "minor", and
3. (iii) make and share a visualization of the total number of edits across those 5 articles over time (I didn't do this in class but I made the TSV file would allow this).
Now grab data for the 5 articles you chose from the Fandom.com wiki you identified and grab revision/edit data from there. (Hint: Your wikipedia work will give you lots of clues here: for example, the fandom API endpoint for The Wire is https://thewire.fandom.com/api.php and the Fandom API, as I said in class, is the same as the Wikipedia API). Produce answers to the same three questions (i, ii, and iii) above but using this dataset.
Finally, choose either your Wikipedia or Fandom datasets as the data source for a visualization that shows how each of those articles have grown in length (as measured in characters or "bytes") over time. (Hint: you'll need to return "size" as one of the revision properties (rvprop) if you are not doing it already.)

#2 Yelp API

Get set up on the Yelp Fusion API. I've put some details on how to do that on the page on a Yelp Authentication setup page which will likely be very useful!
Install the yelpapi module which is online: there's both a documentation page on the the Python Package Index (PyPI) website and a Github page with some documentation. As I said in class, you can either do this by (a) opening a terminal on your system and running pip install yelpapi or you can try running %run pip install yelpapi in your Jupyter notebook. Reach on out teams or in open lab sessions if you run into trouble.
Create a new .py file (e.g., I called mine yelp_authentication.py) in the same directory as your Yelp notebooks are using and add your API key to it. Then use the import command to use your API key in a notebook without having the key itself visible in the notebook!
Once you've done this, use your yelp data collection notebook to grab a list of 50 businesses of any kind (your choice!), in any city (again, your choice!) using Yelp and the yelpapi module. This should be easy if you modify the notebook from the Community Data Science Course (Spring 2023)/Week 6 lecture.
Once you have done this, add some code so that you save the "raw" JSON output to a .json or .jsonl file (whichever is appropriate).
Now create a second notebook that opens up that file, reads the data, and outputs a TSV file with the the name of the business, the average rating, and at least three other pieces or metadata that are available in the Yelp API.

#3 Progress on your final project

Answer these questions using markdown cells in a notebook. Check out the notes I left about this in the Community Data Science Course (Spring 2023)/Week 5 coding challenges (the third paragraph) or think back to Kaylea's comments during the Week 6 assignment recap video.

Let us know:

What is your proposed unit of analysis? In other words, if/when you end up building something like a spreadsheet, what are rows going to represent?
What specific measures associated with each unit do you want to collect? In other words, what are the columns in the spreadsheet going to be?
Tell us what you've learned about the API:
1. Are you going to be able to get the data you want with one API call or many? If more than one, how many?
2. If it's more than one call, how will you know when you have collected all your data?
Make one API call and save the output to your desk in either a .json or .jsonl file. Be sure to share the code you used to do this. Be sure not to include any API keys in your notebook!
How big is the JSON file that you saved on your disk (i.e., in bytes or kilobytes)? If it is not your full dataset, what is your estimate for how much larger the full dataset will be? How big will the total dataset be? Is that a problem?

@@ Line 9: / Line 9: @@
   mako-03-yelp_1.ipynb
   mako-05-final_project.ipynb
+This approach will also be very helpful when you upload your assignment to github!
 == #1 MediaWiki API ==
-Identify a movie, television, video game, or other media property that has both (a) five or more related articles on Wikipedia and (b) 5 or more other articles on the same topic on a [https://fandom.com Fandom.com] website. Any large entertainment franchise will definitely work but feel free to get creative!
+Identify a movie, television, video game, or other media property that has both (a) 5 or more related articles on Wikipedia *and* (b) 5 or more other articles on the same topic on a [https://fandom.com Fandom.com] website. Any large entertainment franchise will definitely work but feel free to get creative! For example, you might choose 5 Wikipedia articles about the anime Naruto and 5 articles (pages) from the naruto.fandom.com site.
-# First modify the code from first sets of notebooks I used in the [[../Week 6 lecture]] to download data (and metadata) about revisions to the five articles from Wikipedia. Be ready to share:
+# First modify the code from first sets of notebooks I used in the [[../Week 6 lecture]] to download data (and metadata) about revisions to the 5 articles you chose from Wikipedia. Be ready to share:
 ## (i) what proportion of those edits were made by users without accounts
 ## (ii) what proportion of those edits were marked as "minor", and
-## (iii) make and share a visualization of the total number of edits across those five articles over time (I didn't do this in class but I made the TSV file would allow this).
+## (iii) make and share a visualization of the total number of edits across those 5 articles over time (I didn't do this in class but I made the TSV file would allow this).
-# Now grab data for the comparable set of 5 articles (ideally articles on the same topic) from the Fandom.com wiki you identified and grab revision/edit data from there. ('''''Hint:''' The fandom API endpoint for The Wire is https://thewire.fandom.com/api.php but the API, as I said in class, is the same''). Produce answers to the same three questions (i, ii, and iii) above but using this dataset.
+# Now grab data for the 5 articles you chose from the Fandom.com wiki you identified and grab revision/edit data from there. ('''''Hint:''' Your wikipedia work will give you lots of clues here: for example, the fandom API endpoint for The Wire is https://thewire.fandom.com/api.php and the Fandom API, as I said in class, is the same as the Wikipedia API''). Produce answers to the same three questions (i, ii, and iii) above but using this dataset.
-# Finally, create a visualization that shows five of those  each of those articles (either the ones on Wikipedia or Wikia, but not need to do both) have grown in length (as measured in characters or "bytes") over time. ('''''Hint:''' you'll need to return "size" as one of the revision properties (<code>rvprop</code>) if you are not doing it already.'')
+# Finally, choose either your Wikipedia or Fandom datasets as the data source for a visualization that shows how each of those articles have grown in length (as measured in characters or "bytes") over time. ('''''Hint:''' you'll need to return "size" as one of the revision properties (<code>rvprop</code>) if you are not doing it already.'')
 == #2 Yelp API ==
-# Get setup on [https://fusion.yelp.com/ the Yelp Fusion API]. I've put some details on how to do that on the page on [[Yelp Authentication setup]] page which will likely be very useful!
+# Get set up on [https://fusion.yelp.com/ the Yelp Fusion API]. I've put some details on how to do that on the page on a [[Yelp Authentication setup]] page which will likely be very useful!
-# Install the <code>yelpapi</code> module which is online both [https://pypi.org/project/yelpapi/ has a page] on [https://pypi.org/ the Python Package Index (PyPI) website] and has [https://github.com/lanl/yelpapi a Github page with some documentation]. As I said in class, you can either do this by (a) opening a terminal on your system and running <code>pip install yelpapi</code> or you can try running <code>%run pip install yelpapi</code> in your Jupyter notebook. Reach on out teams or in open lab sessions if you run into trouble.
+# Install the <code>yelpapi</code> module which is online: there's both [https://pypi.org/project/yelpapi/ a documentation page] on the [https://pypi.org/ the Python Package Index (PyPI) website] and [https://github.com/lanl/yelpapi a Github page with some documentation]. As I said in class, you can either do this by (a) opening a terminal on your system and running <code>pip install yelpapi</code> or you can try running <code>%run pip install yelpapi</code> in your Jupyter notebook. Reach on out teams or in open lab sessions if you run into trouble.
-# Create a new <code>.py</code> file (e.g., I called mine <code>yelp_authentication.py</code>) in the same directory as Yelp notebooks that are using and add your API key to it. Then use the <code>import</code> command to use your API key in a notebook without having the key itself visible in the notebook!
+# Create a new <code>.py</code> file (e.g., I called mine <code>yelp_authentication.py</code>) in the same directory as your Yelp notebooks are using and add your API key to it. Then use the <code>import</code> command to use your API key in a notebook without having the key itself visible in the notebook!
-# Once you've done this, use this to grab a list of 50 businesses of any time (your choice!) in any city (again, your choice!) using Yelp and the <code>yelpapi</code> module. This should be easy if you modify  notebook from the [[../Week 6 lecture]].
+# Once you've done this, use your yelp data collection notebook to grab a list of 50 businesses of any kind (your choice!), in any city (again, your choice!) using Yelp and the <code>yelpapi</code> module. This should be easy if you modify the notebook from the [[../Week 6 lecture]].
 # Once you have done this, add some code so that you save the "raw" JSON output to a <code>.json</code> or <code>.jsonl</code> file (whichever is appropriate).
 # Now create a second notebook that opens up that file, reads the data, and outputs a TSV file with the the name of the business, the average rating, and at least three other pieces or metadata that are available in the Yelp API.
@@ Line 34: / Line 36: @@
 Answer these questions using markdown cells in a notebook. Check out the notes I left about this in the [[../Week 5 coding challenges]] (the third paragraph) or think back to Kaylea's comments during the Week 6 assignment recap video.
-Let us know know:
+Let us know:
 # What is your proposed ''unit of analysis''? In other words, if/when you end up building something like a spreadsheet, what are rows going to represent?
-# What specific measures associated with each unit do you want to collect? In other words, what are your columns in the spreadsheet going to be?
+# What specific measures associated with each unit do you want to collect? In other words, what are the columns in the spreadsheet going to be?
 # Tell us what you've learned about the API:
 ## Are you going to be able to get the data you want with one API call or many? If more than one, how many?