HCDS (Fall 2017)/Assignments



Assignments are comprised of weekly in-class activities, weekly reading reflections, written assignments, and programming/data analysis assignments. Weekly in-class reading groups will discuss the assigned readings from the course and students are expected to have read the material in advance. In class activities each week are posted to Canvas and may require time outside of class to complete.

Unless otherwise noted, all assignments are due before 5pm on the following week's class.

Unless otherwise noted, all assignments are individual assignments.

Assignment timeline

 * Assignments due every week
 * In-class activities - 2 points (weekly): In-class activity output posted to Canvas (group or individual)
 * Reading reflections - 2 points (weekly): Reading reflections posted to Canvas (individual)


 * Scheduled assignments
 * A1 - 5 points (due Week 4): Data curation (programming/analysis)
 * A2 - 10 points (due Week 5): Sources of bias in data (programming/analysis)
 * A3 - 10 points (due Week 7): Final project plan (written)
 * A4 - 10 points (due Week 9): Crowdwork self-ethnography (written)
 * A5 - 10 points (due Week 11): Final project presentation (oral, written)
 * A6 - 15 points (due by 11:59pm on Sunday, December 10): Final project report (programming/analysis, written)

more information...

Weekly in-class activities
Love it or hate it, teamwork is an integral part of data science practice (and work in general). During each class session, you will be asked to participate in one or more group activities. These activities may involve reading discussions, group brainstorming activities, collaborative coding or data analysis, working together on designs, or offering peer support.

In each class session, one in-class activity will have a graded deliverable that is due the next day. The sum of these deliverables constitutes your participation grade for the course. The deliverable is intended to be something that you complete (and ideally, turn in, in class), but in rare cases may involve some work after class. It could be as simple as a picture of a design sketch you made, or notes from a group brainstorm. When you and your group complete the assigned activity, follow the instructions below to submit the activity and get full credit.


 * Instructions
 * 1) Do the in-class activity
 * 2) Choose a group member to submit the deliverable
 * 3) Submit the deliverable via Canvas, in the format specified by the instructor within 24 hours of class
 * 4) Make sure to list the full names of all group members in the Canvas post

Late deliverables will never be accepted, and everyone in the group will lose points. So make sure you choose someone reliable to turn the assignment in!

Weekly reading reflections
This course will introduce you to cutting edge research and opinion from major thinkers in the domain of human centered data science. By reading and writing about this material, you will have an opportunity to explore the complex intersections of technology, methodology, ethics, and social thought that characterize this budding field of research and practice. As a participant in the course, you are responsible for intellectually engaging with all assigned readings and developing an understanding of the ideas discussed in them.

This assignment is designed to encourage you to reflect on these readings (or in some cases, viewings or listenings) and make connections during our class discussions. To this end, you will be responsible for posting reading reflections every week of the quarter (except for week 1).

There will generally be multiple readings assigned each week. You are responsible for reading all of them. However, you only need to write a reflection on one reading per week. Unless your instructor specifies otherwise, you can choose which reading you would like to write your reflection about.

These reflections are meant to be brief but meaningful. Follow the instructions below, demonstrate that you engaged with the material, and turn the reflection in on time, and you will receive full credit. Late reading reflections will never be accepted.


 * Instructions
 * 1) Read all assigned readings.
 * 2) Select a reading to reflect on.
 * 3) In at least 2-3 full sentences, answer the question "How does this reading inform your understanding of human centered data science?"
 * 4) Using full sentences, list at least 1 question that this reading raised in your mind.
 * 5) Post your reflection to Canvas before the next class session.

You are encouraged, but not required, to make connections between different readings (from the current week, or previous weeks) in your reflections.

Scheduled assignments
This section provides basic descriptions of all scheduled course assignments (everything you are graded on except for weekly in-class activities and reading reflections). The instructor will make specific rubrics and requirements for each of these assignments available on Canvas the day the homework is assigned.

A1: Data curation
The goal of this assignment is to construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from July 1 2008 through September 30 2017.

The purpose of the assignment is to demonstrate that you can follow best practices for open scientific research in designing and implementing your project, and make your project fully reproducible by others: from data collection to data analysis.

For this assignment, you combine data Wikipedia traffic from two different Wikimedia REST API endpoints into a single dataset, perform some simple data processing steps on the data, and then analyze that data.

Step 1: Data acquisition
In order to measure Wikipedia traffic from 2008-2016, you will need to collect data from two different API endpoints, the Pagecounts API and the Pageviews API.


 * 1) The legacy Pagecounts API (documentation, endpoint) provides access to desktop and mobile traffic data from January 2008 through July 2016.
 * 2) The Pageviews API (documentation, endpoint) provides access to desktop, mobile web, and mobile app traffic data from July 2015 through September 2017.

You will need to collect data for all months from both APIs in a Jupyter Notebook and then save the raw results into 5 separate JSON source data files (one file per API query) before continuing to step 2.

Your JSON-formatted source data file must contain the complete and un-edited output of your API queries.The naming convention for the source data files is: apiname_accesstype_firstmonth-lastmonth.json

For example, your filename for monthly page views on desktop should be: pagecounts_desktop-site_200801-201607.json

Important notes:
 * 1) As much as possible, we're interested in organic (user) traffic, as opposed to traffic by web crawlers or spiders. The Pageview API (but not the Pagecount API) allows you to filter by agent=user. You should do that.
 * 2) 2 There is a ~13 month period in which both APIs provide traffic data. You need to gather, and later graph, data from both APIs for this period of time.

Step 2: Data processing
You will need to perform a series of processing steps on these data files in order to prepare them for analysis. These steps must be followed exactly in order to prepare the data for analysis. At the end of this step, you will have a single CSV-formatted data file that can be used in your analysis (step 3) with no significant additional processing.

Combine all data into a single CSV file with the following headers:
 * For data collected from the Pageviews API, combine the monthly values for mobile-app and mobile-web to create a total mobile traffic count for each month.
 * For all data, separate the value of timestamp into four-digit year (YYYY) and two-digit month (MM) and discard values for day and hour (DDHH).

For all months with 0 pageviews for a given access method (e.g. desktop-site, mobile-app), that value for that (column, month) should be listed as 0. So for example all values of pagecount_mobile_views for months before October 2014 should be 0, because mobile traffic data is not available before that month.

The final data file should be named: en-wikipedia_traffic_200801-201709.csv

Step 3: Analysis
For this assignment, the "analysis" will be fairly straightforward: you will visualize the dataset you have created as a time series graph.

Your visualization will track three traffic metrics: mobile traffic, desktop traffic, and all traffic (mobile + desktop).

Your visualization should look similar to the example graph above, which is based on the same data you'll be using! The only big difference should be that your mobile traffic data will only go back to October 2014, since the API does not provide monthly traffic data going back to 2010.

In order to complete the analysis correctly and receive full credit, your graph will need to be the right scale to view the data; all units, axes, and values should be clearly labeled; and the graph should possess a key and a title. You must also generate a .png or .jpeg formatted image of your final graph.

You may choose to graph the data in Python, in your notebook. If you decide to use Google Sheet or some other open, public data visualization platform to build your graph, link to it in the README, and make sure sharing settings allow anyone who clicks on the link to view the graph and download the data!

Step 4: Documentation
Follow best practices for documenting your project, as outlined in the Week 3 slides (LINK). Your documentation will be done in your Jupyter Notebook, a README file, and a LICENSE file.

At minimum, your Jupyter Notebook should:
 * Provide a short, clear description of every step in the acquisition, processing, and analysis of your data in full Markdown sentences (not just inline comments or docstrings)

At minimum, you README file should
 * Describe the goal of the project.
 * List the license of the source data and a link to the Wikimedia Foundation terms of use (LINK)
 * Link to all relevant API documentation
 * Describe the values of all fields in your final data file.
 * List any known issues or special considerations with the data that would be useful for another researcher to know. For example, you should describe that data from the Pageview API excludes spiders/crawlers, while data from the Pagecounts API does not.

Submission instructions

 * 1) Complete you Notebook and datasets in Jupyter Hub.
 * 2) Download the data-512-a1 directory from Jupyter Hub.
 * 3) Create the data-512-a1 repository on GitHub w/ your code and data.
 * 4) Complete and add your README and LICENSE file.
 * 5) Submit the link to your GitHub repo to: https://canvas.uw.edu/courses/1174178/assignments/3876066

Required deliverables
A directory in your GitHub repository called data-512-a1 that contains the following files:
 * 5 source data files in JSON format that follow the specified naming convention.
 * 1 final data file in CSV format that follows the specified naming convention.
 * 1 Jupyter notebook named hcds-a1-data-curation</tt> that contains all code as well as information necessary to understand each programming step.
 * 1 README file in .txt or .md format that contains information to reproduce the analysis, including data descriptions, attributions and provenance information, and descriptions of all relevant resources and documentation (inside and outside the repo) and hyperlinks to those resources.
 * 1 LICENSE file that contains an MIT LICENSE for your code.
 * 1 .png or .jpeg image of your visualization.

Helpful tips

 * Read all instructions carefully before you begin
 * Read all API documentation carefully before you begin
 * Experiment with queries in the sandbox of the technical documentation for each API to familiarize yourself with the schema and the data
 * Ask questions on Slack if you're unsure about anything
 * When documenting/describing your project, think: "If I found this GitHub repo, and wanted to fully reproduce the analysis, what information would I want? What information would I need?"

A2: Bias in data
For this assignment, you will perform an analysis of the most popular articles on Wikipedia. There are several ways to define what "most popular" might mean in this case. Your job is to compare and contrast several of the possible definitions by exploring the data in different ways, and then create a definition, and make a case for why you think this definition is a good one.

A3: Final project plan
For this assignment, you will write up a study plan for your final class project. The plan will cover a variety of details about your final project, including what data you will use, what you will do with the data (e.g. statistical analysis, train a model), what results you expect or intend, and most importantly, why your project is interesting or important (and to whom, besides yourself).

A4: Crowdwork self-ethnography
For this assignment, you will go undercover as a member of the Amazon Mechanical Turk community. You will perform assigned tasks, participate (or lurk) in Turker discussion forums, and write an ethnographic account of your experience as a human-in-the-loop of data science.

A5: Final project presentation
For this assignment, you will give an in-class presentation of your final project. The goal of this assignment is to demonstrate that you are able to effectively communicate your research questions, methods, conclusions, and implications to your target audience.

A6: Final project report
For this assignment, you will publish the complete code, data, and analysis of your final research project. The goal is to demonstrate that you can incorporate all of the human-centered design considerations you learned in this course and create research artifacts that are understandable, impactful, and reproducible.