Editing Human Centered Data Science (Fall 2019)/Assignments

From CommunityData

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 88: Line 88:
[[File:En-wikipedia_traffic_200801-201709_thompson.png|300px|thumb|Your assignment is to create a graph that looks a lot like this one, starting from scratch, and following best practices for reproducible research.]]
[[File:En-wikipedia_traffic_200801-201709_thompson.png|300px|thumb|Your assignment is to create a graph that looks a lot like this one, starting from scratch, and following best practices for reproducible research.]]


The goal of this assignment is to construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from January 1 2008 through August 30 2019. All analysis should be performed in a single Jupyter notebook and all data, documentation, and code should be published in a single GitHub repository.
The goal of this assignment is to construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from January 1 2008 through September 30 2018. All analysis should be performed in a single Jupyter notebook and all data, documentation, and code should be published in a single GitHub repository.


The purpose of the assignment is to demonstrate that you can follow best practices for open scientific research in designing and implementing your project, and make your project fully reproducible by others: from data collection to data analysis.
The purpose of the assignment is to demonstrate that you can follow best practices for open scientific research in designing and implementing your project, and make your project fully reproducible by others: from data collection to data analysis.
Line 95: Line 95:


==== Step 0: Read about reproducibility ====
==== Step 0: Read about reproducibility ====
Review Chapter 2 [https://www.practicereproducibleresearch.org/core-chapters/2-assessment.html "Assessing Reproducibility"] and Chapter 3 [https://www.practicereproducibleresearch.org/core-chapters/3-basic.html "The Basic Reproducible Workflow Template"] from ''The Practice of Reproducible Research'' University of California Press, 2018.  
Read Chapter 2 [https://www.practicereproducibleresearch.org/core-chapters/2-assessment.html "Assessing Reproducibility"] and Chapter 3 [https://www.practicereproducibleresearch.org/core-chapters/3-basic.html "The Basic Reproducible Workflow Template"] from ''The Practice of Reproducible Research'' University of California Press, 2018.  


==== Step 1: Data acquisition ====
==== Step 1: Data acquisition ====
In order to measure Wikipedia traffic from 2008-2019, you will need to collect data from two different API endpoints, the Legacy Pagecounts API and the Pageviews API.
In order to measure Wikipedia traffic from 2008-2018, you will need to collect data from two different API endpoints, the Legacy Pagecounts API and the Pageviews API.


# The '''Legacy Pagecounts API''' ([https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts documentation], [https://wikimedia.org/api/rest_v1/#!/Pagecounts_data_(legacy)/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end endpoint]) provides access to desktop and mobile traffic data from December 2007 through July 2016.
# The '''Legacy Pagecounts API''' ([https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts documentation], [https://wikimedia.org/api/rest_v1/#!/Pagecounts_data_(legacy)/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end endpoint]) provides access to desktop and mobile traffic data from December 2007 through July 2016.
#The '''Pageviews API''' ([https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews documentation], [https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_aggregate_project_access_agent_granularity_start_end endpoint]) provides access to desktop, mobile web, and mobile app traffic data from July 2015 through last month.
#The '''Pageviews API''' ([https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews documentation], [https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_aggregate_project_access_agent_granularity_start_end endpoint]) provides access to desktop, mobile web, and mobile app traffic data from July 2015 through last month.


For each API, you will need to collect data ''for all months where data is available'' and then save the raw results into 5 separate JSON source data files (one file per API query type) before continuing to step 2.
For each API, you will need to collect data ''for all months where data is avaiable'' and then save the raw results into 5 separate JSON source data files (one file per API query type) before continuing to step 2.


To get you started, you can refer to this example Notebook that contains sample code for API calls ([http://paws-public.wmflabs.org/paws-public/User:Jtmorgan/data512_a1_example.ipynb view the notebook], [http://paws-public.wmflabs.org/paws-public/User:Jtmorgan/data512_a1_example.ipynb?format=raw download the notebook]). This sample code is [https://creativecommons.org/share-your-work/public-domain/cc0/ licensed CC0] so feel free to re-use any of the code in that notebook without attribution.
To get you started, you can refer to this example Notebook that contains sample code for API calls ([http://paws-public.wmflabs.org/paws-public/User:Jtmorgan/data512_a1_example.ipynb view the notebook], [http://paws-public.wmflabs.org/paws-public/User:Jtmorgan/data512_a1_example.ipynb?format=raw download the notebook]). This sample code is [https://creativecommons.org/share-your-work/public-domain/cc0/ licensed CC0] so feel free to re-use any of the code in that notebook without attribution.
Line 111: Line 111:


For example, your filename for monthly page views on desktop should be:
For example, your filename for monthly page views on desktop should be:
  pagecounts_desktop-site_200712-201908.json
  pagecounts_desktop-site_200712-201809.json


'''Important notes:'''
'''Important notes:'''
Line 167: Line 167:
<!-- Your visualization should look similar to the example graph above, which is based on the same data you'll be using! The only big difference should be that your mobile traffic data will only go back to October 2014, since the API does not provide monthly traffic data going back to 2010. -->
<!-- Your visualization should look similar to the example graph above, which is based on the same data you'll be using! The only big difference should be that your mobile traffic data will only go back to October 2014, since the API does not provide monthly traffic data going back to 2010. -->


In order to complete the analysis correctly and receive full credit, your graph will need to be the right scale to view the data; all units, axes, and values should be clearly labeled; and the graph should possess a legend and a title. You must also generate a .png or .jpeg formatted image of your final graph.  
In order to complete the analysis correctly and receive full credit, your graph will need to be the right scale to view the data; all units, axes, and values should be clearly labeled; and the graph should possess a key and a title. You must also generate a .png or .jpeg formatted image of your final graph.  


If possible please graph the data in Python or R, in your notebook, rather than using an external application.  
You should graph the data in Python or R, in your notebook.  


<!-- If you decide to use Google Sheet or some other open, public data visualization platform to build your graph, link to it in the README, and make sure sharing settings allow anyone who clicks on the link to view the graph and download the data! -->
<!-- If you decide to use Google Sheet or some other open, public data visualization platform to build your graph, link to it in the README, and make sure sharing settings allow anyone who clicks on the link to view the graph and download the data! -->


==== Step 4: Documentation ====
==== Step 4: Documentation ====
Follow best practices for documenting your project, as outlined in the lecture slides and in Chapter 2 [https://www.practicereproducibleresearch.org/core-chapters/2-assessment.html "Assessing Reproducibility"] and Chapter 3 [https://www.practicereproducibleresearch.org/core-chapters/3-basic.html "The Basic Reproducible Workflow Template"] from ''The Practice of Reproducible Research''.  
Follow best practices for documenting your project, as outlined in the Week 3 slides and in Chapter 2 [https://www.practicereproducibleresearch.org/core-chapters/2-assessment.html "Assessing Reproducibility"] and Chapter 3 [https://www.practicereproducibleresearch.org/core-chapters/3-basic.html "The Basic Reproducible Workflow Template"] from ''The Practice of Reproducible Research''.  


Your documentation will be done in your Jupyter Notebook, a README file, and a LICENSE file.
Your documentation will be done in your Jupyter Notebook, a README file, and a LICENSE file.
Line 191: Line 191:
#Create the data-512-a1 repository on GitHub w/ your code and data.
#Create the data-512-a1 repository on GitHub w/ your code and data.
#Complete and add your README and LICENSE file.
#Complete and add your README and LICENSE file.
#Submit the link to your GitHub repo to: https://canvas.uw.edu/courses/1319253/assignments/4937082
#Submit the link to your GitHub repo to: https://canvas.uw.edu/courses/1244514/assignments/4376106


==== Required deliverables ====
==== Required deliverables ====
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)