Editing Human Centered Data Science (Fall 2019)/Assignments

From CommunityData

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 34: Line 34:


;Scheduled assignments
;Scheduled assignments
* '''A1 - 5 points''' (due 10/3): Data curation (programming/analysis)
* '''A1 - 5 points''' (due 10/18): Data curation (programming/analysis)
* '''A2 - 10 points''' (due 10/17): Bias in data (programming/analysis)
* '''A2 - 10 points''' (due 11/1): Sources of bias in data (programming/analysis)
* '''A3  - 10 points''' (due 10/31): Crowdwork Ethnography (written)
* '''A3  - 10 points''' (due 11/8): Crowdwork Ethnography (written)
* '''A4 - 5 points''' (due 11/7): Final project proposal (written)
* '''A4 - 10 points''' (due 11/22): Final project plan (written)
* '''A5 - 5 points''' (due 11/14): Final project plan (written)
* '''A5 - 10 points''' (due 12/6): Final project presentation (oral, slides)
* '''A6 - 10 points''' (due 12/5): Final project presentation (oral, slides)
* '''A6 - 15 points''' (due 12/9): Final project report (programming/analysis, written)
* '''A7 - 15 points''' (due 12/10): Final project report (programming/analysis, written)


[[Human Centered Data Science (Fall 2019)/Assignments|more information...]]
[[Human Centered Data Science (Fall 2019)/Assignments|more information...]]
Line 88: Line 87:
[[File:En-wikipedia_traffic_200801-201709_thompson.png|300px|thumb|Your assignment is to create a graph that looks a lot like this one, starting from scratch, and following best practices for reproducible research.]]
[[File:En-wikipedia_traffic_200801-201709_thompson.png|300px|thumb|Your assignment is to create a graph that looks a lot like this one, starting from scratch, and following best practices for reproducible research.]]


The goal of this assignment is to construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from January 1 2008 through August 30 2019. All analysis should be performed in a single Jupyter notebook and all data, documentation, and code should be published in a single GitHub repository.
The goal of this assignment is to construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from January 1 2008 through September 30 2018. All analysis should be performed in a single Jupyter notebook and all data, documentation, and code should be published in a single GitHub repository.


The purpose of the assignment is to demonstrate that you can follow best practices for open scientific research in designing and implementing your project, and make your project fully reproducible by others: from data collection to data analysis.
The purpose of the assignment is to demonstrate that you can follow best practices for open scientific research in designing and implementing your project, and make your project fully reproducible by others: from data collection to data analysis.
Line 95: Line 94:


==== Step 0: Read about reproducibility ====
==== Step 0: Read about reproducibility ====
Review Chapter 2 [https://www.practicereproducibleresearch.org/core-chapters/2-assessment.html "Assessing Reproducibility"] and Chapter 3 [https://www.practicereproducibleresearch.org/core-chapters/3-basic.html "The Basic Reproducible Workflow Template"] from ''The Practice of Reproducible Research'' University of California Press, 2018.  
Read Chapter 2 [https://www.practicereproducibleresearch.org/core-chapters/2-assessment.html "Assessing Reproducibility"] and Chapter 3 [https://www.practicereproducibleresearch.org/core-chapters/3-basic.html "The Basic Reproducible Workflow Template"] from ''The Practice of Reproducible Research'' University of California Press, 2018.  


==== Step 1: Data acquisition ====
==== Step 1: Data acquisition ====
In order to measure Wikipedia traffic from 2008-2019, you will need to collect data from two different API endpoints, the Legacy Pagecounts API and the Pageviews API.
In order to measure Wikipedia traffic from 2008-2018, you will need to collect data from two different API endpoints, the Legacy Pagecounts API and the Pageviews API.


# The '''Legacy Pagecounts API''' ([https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts documentation], [https://wikimedia.org/api/rest_v1/#!/Pagecounts_data_(legacy)/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end endpoint]) provides access to desktop and mobile traffic data from December 2007 through July 2016.
# The '''Legacy Pagecounts API''' ([https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts documentation], [https://wikimedia.org/api/rest_v1/#!/Pagecounts_data_(legacy)/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end endpoint]) provides access to desktop and mobile traffic data from December 2007 through July 2016.
#The '''Pageviews API''' ([https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews documentation], [https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_aggregate_project_access_agent_granularity_start_end endpoint]) provides access to desktop, mobile web, and mobile app traffic data from July 2015 through last month.
#The '''Pageviews API''' ([https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews documentation], [https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_aggregate_project_access_agent_granularity_start_end endpoint]) provides access to desktop, mobile web, and mobile app traffic data from July 2015 through last month.


For each API, you will need to collect data ''for all months where data is available'' and then save the raw results into 5 separate JSON source data files (one file per API query type) before continuing to step 2.
For each API, you will need to collect data ''for all months where data is avaiable'' and then save the raw results into 5 separate JSON source data files (one file per API query type) before continuing to step 2.


To get you started, you can refer to this example Notebook that contains sample code for API calls ([http://paws-public.wmflabs.org/paws-public/User:Jtmorgan/data512_a1_example.ipynb view the notebook], [http://paws-public.wmflabs.org/paws-public/User:Jtmorgan/data512_a1_example.ipynb?format=raw download the notebook]). This sample code is [https://creativecommons.org/share-your-work/public-domain/cc0/ licensed CC0] so feel free to re-use any of the code in that notebook without attribution.
To get you started, you can refer to this example Notebook that contains sample code for API calls ([http://paws-public.wmflabs.org/paws-public/User:Jtmorgan/data512_a1_example.ipynb view the notebook], [http://paws-public.wmflabs.org/paws-public/User:Jtmorgan/data512_a1_example.ipynb?format=raw download the notebook]). This sample code is [https://creativecommons.org/share-your-work/public-domain/cc0/ licensed CC0] so feel free to re-use any of the code in that notebook without attribution.
Line 111: Line 110:


For example, your filename for monthly page views on desktop should be:
For example, your filename for monthly page views on desktop should be:
  pagecounts_desktop-site_200712-201908.json
  pagecounts_desktop-site_200712-201809.json


'''Important notes:'''
'''Important notes:'''
Line 167: Line 166:
<!-- Your visualization should look similar to the example graph above, which is based on the same data you'll be using! The only big difference should be that your mobile traffic data will only go back to October 2014, since the API does not provide monthly traffic data going back to 2010. -->
<!-- Your visualization should look similar to the example graph above, which is based on the same data you'll be using! The only big difference should be that your mobile traffic data will only go back to October 2014, since the API does not provide monthly traffic data going back to 2010. -->


In order to complete the analysis correctly and receive full credit, your graph will need to be the right scale to view the data; all units, axes, and values should be clearly labeled; and the graph should possess a legend and a title. You must also generate a .png or .jpeg formatted image of your final graph.  
In order to complete the analysis correctly and receive full credit, your graph will need to be the right scale to view the data; all units, axes, and values should be clearly labeled; and the graph should possess a key and a title. You must also generate a .png or .jpeg formatted image of your final graph.  


If possible please graph the data in Python or R, in your notebook, rather than using an external application.  
You should graph the data in Python or R, in your notebook.  


<!-- If you decide to use Google Sheet or some other open, public data visualization platform to build your graph, link to it in the README, and make sure sharing settings allow anyone who clicks on the link to view the graph and download the data! -->
<!-- If you decide to use Google Sheet or some other open, public data visualization platform to build your graph, link to it in the README, and make sure sharing settings allow anyone who clicks on the link to view the graph and download the data! -->


==== Step 4: Documentation ====
==== Step 4: Documentation ====
Follow best practices for documenting your project, as outlined in the lecture slides and in Chapter 2 [https://www.practicereproducibleresearch.org/core-chapters/2-assessment.html "Assessing Reproducibility"] and Chapter 3 [https://www.practicereproducibleresearch.org/core-chapters/3-basic.html "The Basic Reproducible Workflow Template"] from ''The Practice of Reproducible Research''.  
Follow best practices for documenting your project, as outlined in the Week 3 slides and in Chapter 2 [https://www.practicereproducibleresearch.org/core-chapters/2-assessment.html "Assessing Reproducibility"] and Chapter 3 [https://www.practicereproducibleresearch.org/core-chapters/3-basic.html "The Basic Reproducible Workflow Template"] from ''The Practice of Reproducible Research''.  


Your documentation will be done in your Jupyter Notebook, a README file, and a LICENSE file.
Your documentation will be done in your Jupyter Notebook, a README file, and a LICENSE file.
Line 191: Line 190:
#Create the data-512-a1 repository on GitHub w/ your code and data.
#Create the data-512-a1 repository on GitHub w/ your code and data.
#Complete and add your README and LICENSE file.
#Complete and add your README and LICENSE file.
#Submit the link to your GitHub repo to: https://canvas.uw.edu/courses/1319253/assignments/4937082
#Submit the link to your GitHub repo to: https://canvas.uw.edu/courses/1244514/assignments/4376106


==== Required deliverables ====
==== Required deliverables ====
Line 210: Line 209:


=== A2: Bias in data ===
=== A2: Bias in data ===
 
<!--
The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.
The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.


Line 216: Line 215:
# the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.  
# the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.  
# the countries with the highest and lowest proportion of high quality articles about politicians.
# the countries with the highest and lowest proportion of high quality articles about politicians.
# a ranking of geographic regions by articles-per-person and proportion of high quality articles.


You are also expected to write a short reflection on the project, that focuses on how both your findings from this analysis and the process you went through to reach those findings helps you understand the causes and consequences of biased data in large, complex data science projects.
You are also expected to write a short reflection on the project, that describes how this assignment helps you understand the causes and consequences of bias on Wikipedia.


'''A repository with a README framework and examples of querying the ORES datastore in R and Python can be found [https://github.com/Ironholds/data-512-a2 here]'''


==== Getting the article and population data ====
==== Getting the article and population data ====


The first step is getting the data, which lives in several different places. The Wikipedia politicians by country dataset can be found [https://figshare.com/articles/Untitled_Item/5513449 on Figshare]. Read through the documentation for this repository, then download and unzip it to extract the data file, which is called <tt>page_data.csv</tt>.
The first step is getting the data, which lives in several different places. The wikipedia dataset can be found [https://figshare.com/articles/Untitled_Item/5513449 on Figshare]. Read through the documentation for this repository, then download and unzip it.  
 
The population data is available in CSV format in the [https://canvas.uw.edu/courses/1319253/files/ Files section of Canvas] under "A2: bias in data". This dataset is drawn from the [https://www.prb.org/international/indicator/population/table world population datasheet] published by the Population Reference Bureau.
 
==== Cleaning the data ====
Both <tt>page_data.csv</tt> and <tt>WPDS_2018_data.csv</tt> contain some rows that you will need to filter out and/or ignore when you combine the datasets in the next step. In the case of <tt>page_data.csv</tt>, the dataset contains some page names that start with the string "Template:". These pages are ''not'' Wikipedia articles, and should not be included in your analysis.


Similarly, <tt>WPDS_2018_data</tt> contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows won't match the country values in <tt>page_data</tt>, but you will want to retain them (either in the original file, or a separate file) so that you can report coverage and quality by region in the analysis section.
The population data is on [https://www.dropbox.com/s/5u7sy1xt7g0oi2c/WPDS_2018_data.csv?dl=0 Dropbox]. Download this data as a CSV file (hint: look for the 'Microsoft Excel' icon in the upper right).


==== Getting article quality predictions ====
==== Getting article quality predictions ====
''A repository with a template README file and examples of querying the ORES datastore in R and Python can be found [https://github.com/Ironholds/data-512-a2 here]''


Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called [https://www.mediawiki.org/wiki/ORES ORES] ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst:
Now you need to get the predicted quality scores for each article in the Wikipedia dataset. For this step, we're using a Wikimedia API endpoint for a machine learning system called [https://www.mediawiki.org/wiki/ORES ORES] ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst:


# FA - Featured article
# FA - Featured article
Line 244: Line 237:
# Stub - Stub-class article
# Stub - Stub-class article


For context, these quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. If you're curious, you can read more about what these assessment classes mean on [https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Grades English Wikipedia]. We will talk about what these categories mean, and how the ORES model predicts which category an article goes into, next week in class. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these 6 categories to any <tt>rev_id</tt> you send it.
For context, these quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. If you're curious, you can read more about what these assessment classes mean on [https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Grades English Wikipedia]. We will talk about what these categories mean, and how the ORES model predicts which category an article goes into, next week in class. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these 6 categories to any article you send it.
 
In order to get article predictions for each article in the Wikipedia dataset, you will first need to read <tt>page_data.csv</tt> into Python (or R), and then read through the dataset line by line, using the value of the <tt>rev_id</tt> column in the API query. If you're working in Python, the [https://docs.python.org/3/library/csv.html CSV module] will help with this.
 
You have two options for getting data from the ORES:
 
;Option 1: Install and run the ORES client (preferred, Python only)
 
You can ''pip install ores'' in your local notebook environment (https://github.com/wikimedia/ores installation instructions). This will allow you to get scores for list of multiple <tt>rev_id</tt> values in a single batch--you can even send all ~50k articles in the <tt>page_data.csv</tt> in a single batch! Here's some demo code:


from ores import api
The ORES API is configured fairly similarly to the pageviews API we used last assignment; documentation can be found [https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model here]. It expects a revision ID, which is the third column in the Wikipedia dataset, and a model, which is "wp10". The [https://github.com/Ironholds/data-512-a2 sample iPython notebooks for this assignment] provide examples of a correctly-structured API query that you can use to understand how to gather your data, and also to examine the query output.
#please provide this useragent string (second arg below) to help the ORES team track requests
ores_session = api.Session("https://ores.wikimedia.org", "Class project <jmorgan@wikimedia.org>")
#where 1234, 5678, 91011 below are rev_ids...
results = ores_session.score("enwiki", ["articlequality"], [1234, 5678, 91011])
for score in results:
    print(score)
#where the value for 'prediction' in each response below corresponds to the predicted article quality class
{'articlequality': {'score': {'prediction': 'B', 'probability': {'GA': 0.005565225912988614, 'Stub': 0.285072978841463, 'C': 0.1237249061020009, 'B': 0.2910788689339172, 'Start': 0.2859984921969326, 'FA': 0.008559528012697881}}}}
{'articlequality': {'score': {'prediction': 'Start', 'probability': {'GA': 0.005264197821210708, 'Stub': 0.40368617053424666, 'C': 0.021887833774629408, 'B': 0.029933164235917967, 'Start': 0.5352849001253548, 'FA': 0.0039437335086407645}}}}
{'articlequality': {'score': {'prediction': 'Stub', 'probability': {'GA': 0.0033975128938096197, 'Stub': 0.8980284163392759, 'C': 0.01216786960110309, 'B': 0.01579141569356552, 'Start': 0.06809640787450176, 'FA': 0.0025183775977442226}}}}


;Option 2: Use the REST API endpoint (Python or R)
In order to get article predictions for each article in the Wikipedia dataset, you will need to read <tt>page_data.csv</tt> into Python (or R), and then read through the dataset line by line, using the value of the <tt>last_edit</tt> column in the API query. If you're working in Python, the [https://docs.python.org/3/library/csv.html CSV module] will help with this.


The ORES REST API is configured fairly similarly to the pageviews API we used for Assignment 1. Documentation can be found [https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model here]. It expects a revision ID, which is the third column in the Wikipedia dataset, and a model, which is "wp10". The [https://github.com/Ironholds/data-512-a2 sample iPython notebooks for this assignment] provide examples of a correctly-structured API query that you can use to understand how to gather your data, and also to examine the query output.
When you query the API, you will notice that ORES returns a <tt>prediction</tt> value that contains the name of one category, as well as <tt>probability</tt> values for each of the 6 quality categories. For this assignment, you only need to capture and use the value for <tt>prediction</tt>. We'll talk more about what the other values mean in class next week.
 
Whether you query the API or use the client, you will notice that ORES returns a <tt>prediction</tt> value that contains the name of one category, as well as <tt>probability</tt> values for each of the 6 quality categories. For this assignment, you only need to capture and use the value for <tt>prediction</tt>. We'll talk more about what the other values mean in upcoming weeks.
 
''Note:'' It's possible that you will be unable to get a score for a particular article (there are various possible reasons for this, which we can talk about later). If that happens, make sure to maintain a log of articles for which you were not able to retrieve an ORES score. This log can be saved as a separate file, or (if it's only a few articles), simply printed and logged within the notebook. I leave the choice up to you.


==== Combining the datasets ====
==== Combining the datasets ====
   
   
Some processing of the data will be necessary! In particular, you'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which ''cannot'' be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vis versa.  
Some processing of the data will be necessary! In particular, you'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which ''cannot'' be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vice versa. You will need to remove the rows that do not have matching data.
 
Please remove any rows that do not have matching data, and output them to a CSV file called <tt>wp_wpds_countries-no_match.csv</tt>
 
Consolidate the remaining data into a single CSV file called <tt>wp_wpds_politicians_by_country.csv</tt>


The schema for that file should look something like this:
Consolidate the remaining data into a single CSV file which looks something like this:




Line 300: Line 267:
|}
|}


Note: <tt>revision_id</tt> here is the same thing as <tt>rev_id</tt>, which you used to get scores from ORES.
Note: <tt>revision_id</tt> here is the same thing as <tt>last_edit</tt>, which you used to get scores from the ORES API.


==== Analysis ====
==== Analysis ====
Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country AND for each geographic region. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.
Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.


Examples:
Examples:
* if a country has a population of 10,000 people, and you found 10 articles about politicians from that country, then the percentage of articles-per-population would be .1%.  
* if a country has a population of 10,000 people, and you found 10 articles about politicians from that country, then the percentage of articles-per-population would be .1%.  
* if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.
* if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.  
 
==== Results format ====
 
Your results from this analysis will be published in the form of data tables. You are being asked to produce '''six total tables''', that show:
 
#'''Top 10 countries by coverage:''' 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
#'''Bottom 10 countries by coverage:''' 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
#'''Top 10 countries by relative quality:''' 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
#'''Bottom 10 countries by relative quality:''' 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
#'''Geographic regions by coverage:''' Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
#'''Geographic regions by coverage:''' Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality
 
Embed these tables in the Jupyter notebook. You do not need to graph or otherwise visualize the data for this assignment, although you are welcome to do so in addition to generating the data tables described above, if you wish to do so!
 
''Reminder:'' you will find the list of geographic regions, which countries are in each region, and total regional population in the raw <tt>WPDS_2018_data.csv</tt> file. See "Cleaning the data" above for more information.


==== Writeup: reflections and implications ====
==== Tables ====
Write a few paragraphs, either in the README or at the end of the notebook, reflecting on what you have learned, what you found, what (if anything) surprised you about your findings, and/or what theories you have about why any biases might exist (if you find they exist). You can also include any questions this assignment raised for you about bias, Wikipedia, or machine learning.
The tables should be pretty straightforward. Produce four tables that show:
#10 highest-ranked countries in terms of number of politician articles as a proportion of country population
#10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
#10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
#10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country


In addition to any reflections you want to share about the process of the assignment, please respond (briefly) to '''at least three''' of the questions below:
Embed them in the iPython notebook.


# What biases did you expect to find in the data (before you started working with it), and why?
==== Writeup ====
# What (potential) sources of bias did you discover in the course of your data processing and analysis?
Write a few paragraphs, either in the README or in the notebook, reflecting on what you have learned, what you found, what (if anything) surprised you about your findings, and/or what theories you have about why any biases might exist (if you find they exist). You can also include any questions this assignment raised for you about bias, Wikipedia, or machine learning. Particular questions you might want to answer:
# What might your results suggest about (English) Wikipedia as a data source?
# What might your results suggest about the internet and global society in general?
# Can you think of a realistic data science research situation where using these data (to train a model, perform a hypothesis-driven research, or make business decisions) might create biased or misleading results, due to the inherent gaps and limitations of the data?
# Can you think of a realistic data science research situation where using these data (to train a model, perform a hypothesis-driven research, or make business decisions) might still be appropriate and useful, despite its inherent limitations and biases?
# How might a researcher supplement or transform this dataset to potentially ''correct for'' the limitations/biases you observed?


This section doesn't need to be particularly long or thorough, but we'll expect you to write at least a couple paragraphs.
# What biases did you expect to find in the data, and why?
# What are the results?
# What theories do you have about why the results are what they are?


==== Submission instructions ====
==== Submission instructions ====
#Complete your analysis and write up
#Complete your Notebook and datasets in Jupyter Hub.
#Check all deliverables into your GitHub repo
#Create the data-512-a2 repository on GitHub w/ your code and data.
#Submit the link to your GitHub repo through the [https://canvas.uw.edu/courses/1319253/assignments/4937083 Assignment 2 submission form on Canvas]
#Complete and add your README and LICENSE file.
#Submit the link to your GitHub repo to: https://canvas.uw.edu/courses/1244514/assignments/4376107


==== Required deliverables ====
==== Required deliverables ====
A directory in your GitHub repository called <tt>data-512-a2</tt> that contains at minimum the following files:
A directory in your GitHub repository called <tt>data-512-a2</tt> that contains the following files:
:# your two source data files and a description of each
:# 1 final data file in CSV format that follows the formatting conventions.
:# 1 final data file in CSV format that contains all articles you analyzed, the corresponding country and population, and their predicted quality score.
:# 1 Jupyter notebook named <tt>hcds-a2-bias</tt> that contains all code as well as information necessary to understand each programming step, as well as your writeup (if you have not included it in the README) and the tables.
:# 1 Jupyter notebook named <tt>hcds-a2-bias</tt> that contains all code as well as information necessary to understand each programming step, as well your findings (six tables) and your writeup (if you have not included it in the README).
:# 1 README file in .txt or .md format that contains information to reproduce the analysis, including data descriptions, attributions and provenance information, and descriptions of all relevant resources and documentation (inside and outside the repo) and hyperlinks to those resources, and your writeup (if you have not included it in the notebook). A prototype framework is included in the [https://github.com/Ironholds/data-512-a2 sample repository]
:# 1 README file in .txt or .md format that contains information to reproduce the analysis, including data descriptions, attributions and provenance information, and descriptions of all relevant resources and documentation (inside and outside the repo) and hyperlinks to those resources, and your writeup (if you have not included it in the notebook). A prototype framework is included in the [https://github.com/Ironholds/data-512-a2 sample repository]
:# 1 LICENSE file that contains an [https://opensource.org/licenses/MIT MIT LICENSE] for your code.
:# 1 LICENSE file that contains an [https://opensource.org/licenses/MIT MIT LICENSE] for your code.
If you created any additional process or incremental files in the course of your data processing and analysis (for example, a list of articles for which you were not able to gather ORES scores), please include these in the folder as well, and briefly describe them in the README.


==== Helpful tips ====
==== Helpful tips ====
Line 359: Line 310:
* Experiment with queries in the sandbox of the technical documentation for the API to familiarize yourself with the schema and the data
* Experiment with queries in the sandbox of the technical documentation for the API to familiarize yourself with the schema and the data
* Explore the data a bit before starting to be sure you understand how it is structured and what it contains
* Explore the data a bit before starting to be sure you understand how it is structured and what it contains
* Ask questions on Slack if you're unsure about anything. If you need more help, come to office hours or schedule a time to meet with Yihan or Jonathan.
* Ask questions on Slack if you're unsure about anything. Please email Os to set up a meeting, or come to office hours, if you want to! This time is set aside specifically for you - it is not an imposition.
* When documenting/describing your project, think: "If I found this GitHub repo, and wanted to fully reproduce the analysis, what information would I want? What information would I need?"
* When documenting/describing your project, think: "If I found this GitHub repo, and wanted to fully reproduce the analysis, what information would I want? What information would I need?"
-->


=== A3: Crowdwork ethnography ===
=== A3: Crowdwork ethnography ===
For this assignment, you will go undercover as a member of the Amazon Mechanical Turk community. You will preview or perform Mechanical Turk tasks (called "HITs"), lurk in Turk worker discussion forums, and write an ethnographic account of your experience as a crowdworker, and how this experience changes your understanding of the phenomenon of crowdwork.
For this assignment, you will go undercover as a member of the Amazon Mechanical Turk community. You will preview or perform Mechanical Turk tasks (called "HITs"), lurk in Turk worker discussion forums, and write an ethnographic account of your experience as a crowdworker, and how this experience changes your understanding of the phenomenon of crowdwork.


The full assignment description is available [https://docs.google.com/document/d/16lZdTxkw1meUPMzA-BYl8TVtk0Jxv4Wh8mbZq_BursM/edit?usp=sharing as a Google doc].
The full assignment description is available [https://docs.google.com/document/d/16lZdTxkw1meUPMzA-BYl8TVtk0Jxv4Wh8mbZq_BursM/edit?usp=sharing as a Google doc] and [[:File:HCDS_Crowdwork_ethnography_instructions.pdf|as a PDF]].
 
=== A4: Final project proposal ===
The final project proposal is a short pitch for your final class project. It should include three basic components:
* '''Motivation/problem statement:''' Why are you planning to do this analysis? Why is it potentially interesting and useful, from a scientific, practical, and/or human-centered perspective? What do you hope to learn? Note that you only need to describe your overall research goal at this stage; specific hypotheses or research questions aren’t necessary in the project proposal.
 
* '''Data used:''' What dataset do you plan to use, and why? Summarize what is represented in the dataset; Link to the dataset, and specify license/terms of use; Briefly justify why this dataset is relevant to your problem statement; Highlight any possible ethical considerations to using this dataset (and say why or why not).


* '''Unknowns and dependencies:''' Are there any factors outside of your control that might impact your ability to complete this project by the end of the quarter? The purpose of this section is to get you thinking, in a practical sense, about your ability to complete this project within the time allotted.
=== A4: Final project plan ===
<!--
''For examples of datasets you may want to use for your final project, see [[HCDS_(Fall_2017)/Datasets]].''
-->


=== A5: Final project plan ===
For this assignment, you will write up a study plan for your final class project. The plan will cover a variety of details about your final project, including what data you will use, what you will do with the data (e.g. statistical analysis, train a model), what results you expect or intend, and most importantly, why your project is interesting or important (and to whom, besides yourself).
For this assignment, you will write up a study plan for your final class project. The plan will cover a variety of details about your final project, including what data you will use, what you will do with the data (e.g. statistical analysis, train a model), what results you expect or intend, and most importantly, why your project is interesting or important (and to whom, besides yourself).


The final project plan is an extension of the proposal, and should be in the same (.ipynb or .md) document in your repo. New sections to add are:
=== A5: Final project presentation ===
 
For this assignment, you will give an in-class presentation of your final project. The goal of this assignment is to demonstrate that you are able to effectively communicate your research questions, methods, conclusions, and implications to your target audience.
* '''Research questions and/or hypotheses:''' These describe what you hope to discover or determine in the course of your research.
:* Example research question: what is the impact of an MS degree on data scientist salaries over the course of their careers?
:* Example hypothesis: earning an MS degree is associated with an increase of x% in career data scientist salaries compared to similar data scientists who do not earn a degree
 
* '''Background and/or Related Work:''' What is already known about the phenomenon you are investigating? How does previous research or background info inform your decision to perform this study, the way you designed the study, or your specific research questions? Make sure to include references (endnotes and/or inline hyperlinks) to the sources of background information--whether they are websites, news articles, or peer-reviewed research.


* '''Methodology:''' Describe how you plan to investigate this phenomenon. Don't just describe what your analytical methods are (e.g. "ordinary least squares", "student's t-test", "heatmap visualization", or "recurrent neural network"), it's critical to justify why these are appropriate methods for gathering and analyzing your data, or presenting your findings. You are expected to be thorough here: please describe to the best of your ability the entire series of gathering, analysis, and presentation methods you plan to use.
=== A6: Final project report ===
 
=== A6: Final project presentation ===
For this assignment, you will give an in-class presentation of your final project. The goal of this presentation is to demonstrate that you are able to effectively communicate your research questions, methods, conclusions, and implications to a non-data-scientist audience.
 
The presentation will be no more than 5 minutes long. Slides are not necessary, but are probably a good idea.
 
The presentation should demonstrate the following:
* Your ability to give a professional research presentation.
* Your ability to communicate the importance of your research to a specified audience (Imagine that you are pitching your project to directors/execs at a company you work for).
* Your ability to communicate the nature and implications of your findings in an accurate and compelling way.
* Your ability to do all of the above in a very short time (Hint: please practice beforehand and time yourself)
 
=== A7: Final project report ===
For this assignment, you will publish the complete code, data, and analysis of your final research project. The goal is to demonstrate that you can incorporate all of the human-centered design considerations you learned in this course and create research artifacts that are understandable, impactful, and reproducible.
For this assignment, you will publish the complete code, data, and analysis of your final research project. The goal is to demonstrate that you can incorporate all of the human-centered design considerations you learned in this course and create research artifacts that are understandable, impactful, and reproducible.
A successful report will take the form of a well-written, well-executed research study document (a repo with a notebook + supporting data files and documentation) that includes:
* All your code and data, thoroughly documented and reproducible
* A human-centered argument for why your analysis is important
* Background research or related work
* Your research question(s)
* The methods, data, and approach that you used to collect and analyze the data
* Findings, implications, and limitations of your study
* A thoughtful reflection that describes the specific ways that human-centered data science principles informed your decision-making in this project—from beginning to end.
Data visualizations aren’t necessary, but are encouraged (they are often an effective way of communicating your findings!)
Your deliverables for the final project proposal and plan are part of this report: you are expected to build your report around these documents.




[[Category:HCDS (Fall 2019)]]
[[Category:HCDS (Fall 2019)]]
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)