Editing HCDS (Fall 2017)/Assignments

From CommunityData
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
<noinclude>
<div style="font-family:Rockwell,'Courier Bold',Courier,Georgia,'Times New Roman',Times,serif; min-width:10em;">
<div style="float:left; width:100%; margin-right:2%;">
{{Link/Graphic/Main/2
|highlight color= 27666b
|color=460c40
|link=
|image=
|text-align=left
|top font-size= 1.1em
|top color=FFF
|line color=FFF
|top text=This page is a work in progress.
|bottom font-size= 1em
|bottom color= FFF
|bottom text=
|line= none
}}</div></div>
</noinclude>


__FORCETOC__
__FORCETOC__
Line 10: Line 29:
=== Assignment timeline ===
=== Assignment timeline ===
;Assignments due every week
;Assignments due every week
* '''In-class activities - 2 points''' (weekly): In-class activity output posted to Canvas (group or individual)
* '''In-class activities - 5 points''' (weekly): In-class activity output posted to Canvas (group or individual)
* '''Reading reflections - 2 points''' (weekly): Reading reflections posted to Canvas (individual)
* '''Reading reflections - 5 points''' (weekly): Reading reflections posted to Canvas (individual)




;Scheduled assignments
;Scheduled assignments
* '''A1 - 5 points''' (due Week 4): Data curation (programming/analysis)
* '''A1 - 5 points''' (due Week 4): Data curation (programming/analysis)
* '''A2 - 10 points''' (due Week 6): Sources of bias in data (programming/analysis)
* '''A2 - 10 points''' (due Week 5): Sources of bias in data (programming/analysis)
* '''A3  - 10 points''' (due Week 7): Final project plan (written)
* '''A3  - 10 points''' (due Week 7): Final project plan (written)
* '''A4 - 10 points''' (due Week 9): Crowdwork self-ethnography (written)
* '''A4 - 10 points''' (due Week 9): Crowdwork self-ethnography (written)
Line 32: Line 51:
;Instructions
;Instructions
# Do the in-class activity
# Do the in-class activity
# Choose a group member to submit the deliverable
# Submit the deliverable via Canvas, in the format specified by the instructor within 24 hours of class
# Submit the deliverable via Canvas, in the format specified by the instructor within 24 hours of class
# If it is a group assignment:
# '''Make sure to list the full names of all group members in the Canvas post'''
:*Choose one group member to submit the deliverable for the whole group
:*'''Make sure to list the full names of all group members in the Canvas post'''


Late deliverables will never be accepted, and everyone in the group will lose points. So make sure you choose someone reliable to turn the assignment in!
Late deliverables will never be accepted, and everyone in the group will lose points. So make sure you choose someone reliable to turn the assignment in!
Line 61: Line 79:


=== A1: Data curation ===
=== A1: Data curation ===
The goal of this assignment is to construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from January 1 2008 through September 30 2017.
For this assignment, you will assemble a set of Wikipedia data drawn from several sources, perform a set of specific set of transformations on the data, check for inconsistencies or errors in the data, and publish the dataset and the code used to create it according to a set of best practices for open data science research that you have learned in class.
 
The purpose of the assignment is to demonstrate that you can follow best practices for open scientific research in designing and implementing your project, and make your project fully reproducible by others: from data collection to data analysis.
 
For this assignment, you combine data Wikipedia traffic from two different [https://www.mediawiki.org/wiki/REST_API Wikimedia REST API] endpoints into a single dataset, perform some simple data processing steps on the data, and then analyze that data.
 
==== Step 1: Data acquisition ====
In order to measure Wikipedia traffic from 2008-2016, you will need to collect data from two different API endpoints, the Pagecounts API and the Pageviews API.
 
# The legacy '''Pagecounts API''' ([https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts documentation], [https://wikimedia.org/api/rest_v1/#!/Pagecounts_data_(legacy)/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end endpoint]) provides access to desktop and mobile traffic data from January 2008 through July 2016.
#The '''Pageviews API''' ([https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews documentation], [https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_aggregate_project_access_agent_granularity_start_end endpoint]) provides access to desktop, mobile web, and mobile app traffic data from July 2015 through September 2017.
 
You will need to collect data ''for all months'' from both APIs in a Jupyter Notebook and then save the raw results into 5 separate JSON source data files (one file per API query) before continuing to step 2.
 
Your JSON-formatted source data file must contain the complete and un-edited output of your API queries.The naming convention for the source data files is:
apiname_accesstype_firstmonth-lastmonth.json
 
For example, your filename for monthly page views on desktop should be:
pagecounts_desktop-site_200801-201607.json
 
'''Important notes:'''
# As much as possible, we're interested in ''organic'' (user) traffic, as opposed to traffic by web crawlers or spiders. The Pageview API (but not the Pagecount API) allows you to filter by <tt>agent=user</tt>. You should do that.
#2 There is a ~13 month period in which both APIs provide traffic data. You need to gather, and later graph, data from both APIs for this period of time.
 
==== Step 2: Data processing ====
You will need to perform a series of processing steps on these data files in order to prepare them for analysis. These steps must be followed exactly in order to prepare the data for analysis. At the end of this step, you will have a single CSV-formatted data file that can be used in your analysis (step 3) with no significant additional processing.
 
* For data collected from the Pageviews API, combine the monthly values for <tt>mobile-app</tt> and <tt>mobile-web</tt> to create a total mobile traffic count for each month.
* For all data, separate the value of <tt>timestamp</tt> into four-digit year (YYYY) and two-digit month (MM) and discard values for day and hour (DDHH).
Combine all data into a single CSV file with the following headers:
 
{|class="wikitable"
|-
! Column
!Value
|-
|year
|YYYY
|-
| month
|MM
|-
| pagecount_all_views
|num_views
|-
| pagecount_desktop_views
|num_views
|-
|pagecount_mobile_views
|num_views
|-
|pageview_all_views
|num_views
|-
|pageview_desktop_views
|num_views
|-
|pageview_mobile_views
|num_views
|}
 
For all months with 0 pageviews for a given access method (e.g. <tt>desktop-site, mobile-app</tt>), that value for that (column, month) should be listed as 0. So for example all values of <tt>pagecount_mobile_views</tt> for months before October 2014 should be 0, because mobile traffic data is not available before that month.
 
The final data file should be named:
en-wikipedia_traffic_200801-201709.csv
 
==== Step 3: Analysis ====
[[File:PlotPageviewsEN_overlap.png|200px|thumb|A sample visualization of pageview traffic data.]]
For this assignment, the "analysis" will be fairly straightforward: you will visualize the dataset you have created as a time series graph.
 
Your visualization will track three traffic metrics: mobile traffic, desktop traffic, and all traffic (mobile + desktop).
 
Your visualization should look similar to the example graph above, which is based on the same data you'll be using! The only big difference should be that your mobile traffic data will only go back to October 2014, since the API does not provide monthly traffic data going back to 2010.
 
In order to complete the analysis correctly and receive full credit, your graph will need to be the right scale to view the data; all units, axes, and values should be clearly labeled; and the graph should possess a key and a title. You must also generate a .png or .jpeg formatted image of your final graph.
 
You may choose to graph the data in Python, in your notebook. If you decide to use Google Sheet or some other open, public data visualization platform to build your graph, link to it in the README, and make sure sharing settings allow anyone who clicks on the link to view the graph and download the data!
 
==== Step 4: Documentation ====
Follow best practices for documenting your project, as outlined in the Week 3 slides (LINK). Your documentation will be done in your Jupyter Notebook, a README file, and a LICENSE file.
 
At minimum, your Jupyter Notebook should:
* Provide a short, clear description of every step in the acquisition, processing, and analysis of your data ''in full Markdown sentences'' (not just inline comments or docstrings)
 
At minimum, you README file should
* Describe the goal of the project.
* List the license of the source data and a link to the Wikimedia Foundation terms of use (LINK)
* Link to all relevant API documentation
* Describe the values of all fields in your final data file.
* List any known issues or special considerations with the data that would be useful for another researcher to know. For example, you should describe that data from the Pageview API excludes spiders/crawlers, while data from the Pagecounts API does not.
 
==== Submission instructions ====
#Complete you Notebook and datasets in Jupyter Hub.
#Download the data-512-a1 directory from Jupyter Hub.
#Create the data-512-a1 repository on GitHub w/ your code and data.
#Complete and add your README and LICENSE file.
#Submit the link to your GitHub repo to: https://canvas.uw.edu/courses/1174178/assignments/3876066
 
==== Required deliverables ====
A directory in your GitHub repository called <tt>data-512-a1</tt> that contains the following files:
:# 5 source data files in JSON format that follow the specified naming convention.
:# 1 final data file in CSV format that follows the specified naming convention.
:# 1 Jupyter notebook named <tt>hcds-a1-data-curation</tt> that contains all code as well as information necessary to understand each programming step.
:# 1 README file in .txt or .md format that contains information to reproduce the analysis, including data descriptions, attributions and provenance information, and descriptions of all relevant resources and documentation (inside and outside the repo) and hyperlinks to those resources.
:# 1 LICENSE file that contains an [https://opensource.org/licenses/MIT MIT LICENSE] for your code.
:# 1 .png or .jpeg image of your visualization.
 
==== Helpful tips ====
* Read all instructions carefully before you begin
* Read all API documentation carefully before you begin
* Experiment with queries in the sandbox of the technical documentation  for each API to familiarize yourself with the schema and the data
* Ask questions on Slack if you're unsure about anything
* When documenting/describing your project, think: "If I found this GitHub repo, and wanted to fully reproduce the analysis, what information would I want? What information would I need?"


=== A2: Bias in data ===
=== A2: Bias in data ===
The goal of this assignment is to explore the concept of 'bias' through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.
You are expected to perform an analysis of how the ''coverage'' of politicians on Wikipedia and the ''quality'' of articles about politicians varies between countries. Your analysis will consist of a series of tables that show:
# the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
# the countries with the highest and lowest proportion of high quality articles about politicians.
You are also expected to write a short reflection on the project, that describes how this assignment helps you understand the causes and consequences of bias on Wikipedia.
==== Getting the article and population data ====
The first step is getting the data, which lives in several different places. The wikipedia dataset can be found [https://figshare.com/articles/Untitled_Item/5513449 on Figshare]. Read through the documentation for this repository, then download and unzip it.
The population data is on the [http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14 Population Research Bureau website]. Download this data as a CSV file (hint: look for the 'Microsoft Excel' icon in the upper right).
==== Getting article quality predictions ====
Now you need to get the predicted quality scores for each article in the Wikipedia dataset. For this step, we're using a Wikimedia API endpoint for a machine learning system called [https://www.mediawiki.org/wiki/ORES ORES] ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst:
# FA - Featured article
# GA - Good article
# B - B-class article
# C - C-class article
# Start - Start-class article
# Stub - Stub-class article
For context, these quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. If you're curious, you can read more about what these assessment classes mean on [https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Grades English Wikipedia]. We will talk about what these categories mean, and how the ORES model predicts which category an article goes into, next week in class. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these 6 categories to any article you send it.
The ORES API is configured fairly similarly to the pageviews API we used last assignment; documentation can be found [https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model here]. It expects a revision ID, which is the third column in the Wikipedia dataset, and a model, which is "wp10". The sample iPython notebook for this assignment provides an example of a correctly-structured API query that you can use to understand how to gather your data, and also to examine the query output.
In order to get article predictions for each article in the Wikipedia dataset, you will need to read <tt>page_data.csv</tt> into Python (or R), and then read through the dataset line by line, using the value of the <tt>last_edit</tt> column in the API query. If you're working in Python, the [https://docs.python.org/3/library/csv.html CSV module] will help with this.
When you query the API, you will notice that ORES returns a <tt>prediction</tt> value that contains the name of one category, as well as <tt>probability</tt> values for each of the 6 quality categories. For this assignment, you only need to capture and use the value for <tt>prediction</tt>. We'll talk more about what the other values mean in class next week.
==== Combining the datasets ====
Some processing of the data will be necessary! In particular, you'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which ''cannot'' be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vice versa. You will need to remove the rows that do not have matching data.
Consolidate the remaining data into a single CSV file which looks something like this:
{|class="wikitable"
|-
! Column
|-
|country
|-
|article_name
|-
|revision_id
|-
|article_quality
|-
|population
|}
Note: <tt>revision_id</tt> here is the same thing as <tt>last_edit</tt>, which you used to get scores from the ORES API.
==== Analysis ====
Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.
Examples:
* if a country has a population of 10,000 people, and you found 10 articles about politicians from that country, then the percentage of articles-per-population would be .1%.
* if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.
==== Tables ====
The tables should be pretty straightforward. Produce four tables that show:
#10 highest-ranked countries in terms of number of politician articles as a proportion of country population
#10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
#10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
#10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
Embed them in the iPython notebook.
==== Writeup ====
Write a few paragraphs, either in the README or in the notebook, reflecting on what you have learned, what you found, what (if anything) surprised you about your findings, and/or what theories you have about why any biases might exist (if you find they exist). You can also include any questions this assignment raised for you about bias, Wikipedia, or machine learning.
==== Submission instructions ====
#Complete your Notebook and datasets in Jupyter Hub.
#Create the data-512-a2 repository on GitHub w/ your code and data.
#Complete and add your README and LICENSE file.
#Submit the link to your GitHub repo to: https://canvas.uw.edu/courses/1174178/assignments/3876068
==== Required deliverables ====
A directory in your GitHub repository called <tt>data-512-a2</tt> that contains the following files:
:# 1 final data file in CSV format that follows the formatting conventions.
:# 1 Jupyter notebook named <tt>hcds-a2-bias</tt> that contains all code as well as information necessary to understand each programming step, as well as your writeup (if you have not included it in the README) and the tables.
:# 1 README file in .txt or .md format that contains information to reproduce the analysis, including data descriptions, attributions and provenance information, and descriptions of all relevant resources and documentation (inside and outside the repo) and hyperlinks to those resources, and your writeup (if you have not included it in the notebook).
:# 1 LICENSE file that contains an [https://opensource.org/licenses/MIT MIT LICENSE] for your code.
==== Helpful tips ====
* Read all instructions carefully before you begin
* Read all API documentation carefully before you begin
* Experiment with queries in the sandbox of the technical documentation for the API to familiarize yourself with the schema and the data
* Explore the data a bit before starting to be sure you understand how it is structured and what it contains
* Ask questions on Slack if you're unsure about anything
* When documenting/describing your project, think: "If I found this GitHub repo, and wanted to fully reproduce the analysis, what information would I want? What information would I need?"


=== A3: Final project plan ===
=== A3: Final project plan ===
''For examples of datasets you may want to use for your final project, see [[HCDS_(Fall_2017)/Datasets]].''
For this assignment, you will write up a study plan for your final class project. The plan will cover a variety of details about your final project, including what data you will use, what you will do with the data (e.g. statistical analysis, train a model), what results you expect or intend, and most importantly, why your project is interesting or important (and to whom, besides yourself).


=== A4: Crowdwork ethnography ===
=== A4: Crowdwork self-ethnography ===
For this assignment, you will go undercover as a member of the Amazon Mechanical Turk community. You will preview or perform Mechanical Turk tasks (called "HITs"), lurk in Turk worker discussion forums, and write an ethnographic account of your experience as a crowdworker, and how this experience changes your understanding of the phenomenon of crowdwork.
 
The full assignment description is available in PDF form [[:File:HCDS_A4_Crowdwork_ethnography.pdf|here]].


=== A5: Final project presentation ===
=== A5: Final project presentation ===
For this assignment, you will give an in-class presentation of your final project. The goal of this assignment is to demonstrate that you are able to effectively communicate your research questions, methods, conclusions, and implications to your target audience.


=== A6: Final project report ===
=== A6: Final project report ===
For this assignment, you will publish the complete code, data, and analysis of your final research project. The goal is to demonstrate that you can incorporate all of the human-centered design considerations you learned in this course and create research artifacts that are understandable, impactful, and reproducible.
 




[[Category:HCDS (Fall 2017)]]
[[Category:HCDS (Fall 2017)]]
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)