Editing Human Centered Data Science (Fall 2018)/Assignments (section)

=== A1: Data curation ===
[[File:En-wikipedia_traffic_200801-201709_thompson.png|300px|thumb|Your assignment is to create a graph that looks a lot like this one, starting from scratch, and following best practices for reproducible research.]]

The goal of this assignment is to construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from January 1 2008 through September 30 2018. All analysis should be performed in a single Jupyter notebook and all data, documentation, and code should be published in a single GitHub repository.

The purpose of the assignment is to demonstrate that you can follow best practices for open scientific research in designing and implementing your project, and make your project fully reproducible by others: from data collection to data analysis.

For this assignment, you combine data about Wikipedia page traffic from two different [https://www.mediawiki.org/wiki/REST_API Wikimedia REST API] endpoints into a single dataset, perform some simple data processing steps on the data, and then analyze that data. 

==== Step 0: Read about reproducibility ====
Read Chapter 2 [https://www.practicereproducibleresearch.org/core-chapters/2-assessment.html "Assessing Reproducibility"] and Chapter 3 [https://www.practicereproducibleresearch.org/core-chapters/3-basic.html "The Basic Reproducible Workflow Template"] from ''The Practice of Reproducible Research'' University of California Press, 2018. 

==== Step 1: Data acquisition ====
In order to measure Wikipedia traffic from 2008-2018, you will need to collect data from two different API endpoints, the Legacy Pagecounts API and the Pageviews API.

# The '''Legacy Pagecounts API''' ([https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts documentation], [https://wikimedia.org/api/rest_v1/#!/Pagecounts_data_(legacy)/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end endpoint]) provides access to desktop and mobile traffic data from December 2007 through July 2016.
#The '''Pageviews API''' ([https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews documentation], [https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_aggregate_project_access_agent_granularity_start_end endpoint]) provides access to desktop, mobile web, and mobile app traffic data from July 2015 through last month.

For each API, you will need to collect data ''for all months where data is avaiable'' and then save the raw results into 5 separate JSON source data files (one file per API query type) before continuing to step 2.

To get you started, you can refer to this example Notebook that contains sample code for API calls ([http://paws-public.wmflabs.org/paws-public/User:Jtmorgan/data512_a1_example.ipynb view the notebook], [http://paws-public.wmflabs.org/paws-public/User:Jtmorgan/data512_a1_example.ipynb?format=raw download the notebook]). This sample code is [https://creativecommons.org/share-your-work/public-domain/cc0/ licensed CC0] so feel free to re-use any of the code in that notebook without attribution.

Your JSON-formatted source data file must contain the complete and un-edited output of your API queries. The naming convention for the source data files is: 
 apiname_accesstype_firstmonth-lastmonth.json

For example, your filename for monthly page views on desktop should be:
 pagecounts_desktop-site_200712-201809.json

'''Important notes:'''
# As much as possible, we're interested in ''organic'' (user) traffic, as opposed to traffic by web crawlers or spiders. The Pageview API (but not the Pagecount API) allows you to filter by <tt>agent=user</tt>. You should do that.
# There was about 1 year of overlapping traffic data between the two APIs. You need to gather, and later graph, data from both APIs for this period of time.

==== Step 2: Data processing ====
You will need to perform a series of processing steps on these data files in order to prepare them for analysis. These steps must be followed exactly in order to prepare the data for analysis. At the end of this step, you will have a single CSV-formatted data file that can be used in your analysis (step 3) with no significant additional processing.

* For data collected from the Pageviews API, combine the monthly values for <tt>mobile-app</tt> and <tt>mobile-web</tt> to create a total mobile traffic count for each month.
* For all data, separate the value of <tt>timestamp</tt> into four-digit year (YYYY) and two-digit month (MM) and discard values for day and hour (DDHH).
Combine all data into a single CSV file with the following headers:

{|class="wikitable"
|-
! Column
!Value
|-
|year
|YYYY
|-
| month
|MM
|-
| pagecount_all_views
|num_views
|-
| pagecount_desktop_views
|num_views
|-
|pagecount_mobile_views
|num_views
|-
|pageview_all_views
|num_views
|-
|pageview_desktop_views
|num_views
|-
|pageview_mobile_views
|num_views
|}

For all months with 0 pageviews for a given access method (e.g. <tt>desktop-site, mobile-app</tt>), that value for that (column, month) should be listed as 0. So for example all values of <tt>pagecount_mobile_views</tt> for months before October 2014 should be 0, because mobile traffic data is not available before that month.

The final data file should be named: 
 en-wikipedia_traffic_200712-201809.csv

==== Step 3: Analysis ====
<!-- [[File:PlotPageviewsEN_overlap.png|200px|thumb|A sample visualization of pageview traffic data.]] -->
For this assignment, the "analysis" will be fairly straightforward: you will visualize the dataset you have created as a time series graph. 

Your visualization will track three traffic metrics: mobile traffic, desktop traffic, and all traffic (mobile + desktop).

<!-- Your visualization should look similar to the example graph above, which is based on the same data you'll be using! The only big difference should be that your mobile traffic data will only go back to October 2014, since the API does not provide monthly traffic data going back to 2010. -->

In order to complete the analysis correctly and receive full credit, your graph will need to be the right scale to view the data; all units, axes, and values should be clearly labeled; and the graph should possess a key and a title. You must also generate a .png or .jpeg formatted image of your final graph. 

You should graph the data in Python or R, in your notebook. 

<!-- If you decide to use Google Sheet or some other open, public data visualization platform to build your graph, link to it in the README, and make sure sharing settings allow anyone who clicks on the link to view the graph and download the data! -->

==== Step 4: Documentation ====
Follow best practices for documenting your project, as outlined in the Week 3 slides and in Chapter 2 [https://www.practicereproducibleresearch.org/core-chapters/2-assessment.html "Assessing Reproducibility"] and Chapter 3 [https://www.practicereproducibleresearch.org/core-chapters/3-basic.html "The Basic Reproducible Workflow Template"] from ''The Practice of Reproducible Research''. 

Your documentation will be done in your Jupyter Notebook, a README file, and a LICENSE file.

At minimum, your Jupyter Notebook should:
* Provide a short, clear description of every step in the acquisition, processing, and analysis of your data ''in full Markdown sentences'' (not just inline comments or docstrings)

At minimum, you README file should
* Describe the goal of the project.
* List the license of the source data and a link to the Wikimedia Foundation REST API terms of use: https://www.mediawiki.org/wiki/REST_API#Terms_and_conditions
* Link to all relevant API documentation
* Describe the values of all fields in your final data file.
* List any known issues or special considerations with the data that would be useful for another researcher to know. For example, you should describe that data from the Pageview API excludes spiders/crawlers, while data from the Pagecounts API does not.

==== Submission instructions ====
#Create the data-512-a1 repository on GitHub w/ your code and data.
#Complete and add your README and LICENSE file.
#Submit the link to your GitHub repo to: https://canvas.uw.edu/courses/1244514/assignments/4376106

==== Required deliverables ====
A directory in your GitHub repository called <tt>data-512-a1</tt> that contains the following files:
:# 5 source data files in JSON format that follow the specified naming convention.
:# 1 final data file in CSV format that follows the specified naming convention.
:# 1 Jupyter notebook named <tt>hcds-a1-data-curation</tt> that contains all code as well as information necessary to understand each programming step.
:# 1 README file in .txt or .md format that contains information to reproduce the analysis, including data descriptions, attributions and provenance information, and descriptions of all relevant resources and documentation (inside and outside the repo) and hyperlinks to those resources.
:# 1 LICENSE file that contains an [https://opensource.org/licenses/MIT MIT LICENSE] for your code.
:# 1 .png or .jpeg image of your visualization.

==== Helpful tips ====
* Read all instructions carefully before you begin
* Read all API documentation carefully before you begin
* Experiment with queries in the sandbox of the technical documentation  for each API to familiarize yourself with the schema and the data
* Ask questions on Slack if you're unsure about anything
* When documenting/describing your project, think: "If I found this GitHub repo, and wanted to fully reproduce the analysis, what information would I want? What information would I need?"