Editing Human Centered Data Science (Fall 2019)/Assignments
From CommunityData
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 34: | Line 34: | ||
;Scheduled assignments | ;Scheduled assignments | ||
* '''A1 - 5 points''' (due 10/ | * '''A1 - 5 points''' (due 10/18): Data curation (programming/analysis) | ||
* '''A2 - 10 points''' (due | * '''A2 - 10 points''' (due 11/1): Sources of bias in data (programming/analysis) | ||
* '''A3 - 10 points''' (due | * '''A3 - 10 points''' (due 11/8): Crowdwork Ethnography (written) | ||
* '''A4 - | * '''A4 - 10 points''' (due 11/22): Final project plan (written) | ||
* '''A5 | * '''A5 - 10 points''' (due 12/6): Final project presentation (oral, slides) | ||
* '''A6 - 15 points''' (due 12/9): Final project report (programming/analysis, written) | |||
* ''' | |||
[[Human Centered Data Science (Fall 2019)/Assignments|more information...]] | [[Human Centered Data Science (Fall 2019)/Assignments|more information...]] | ||
Line 210: | Line 209: | ||
=== A2: Bias in data === | === A2: Bias in data === | ||
''to come'' | |||
<!-- | |||
The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article. | The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article. | ||
Line 216: | Line 216: | ||
# the countries with the greatest and least coverage of politicians on Wikipedia compared to their population. | # the countries with the greatest and least coverage of politicians on Wikipedia compared to their population. | ||
# the countries with the highest and lowest proportion of high quality articles about politicians. | # the countries with the highest and lowest proportion of high quality articles about politicians. | ||
You are also expected to write a short reflection on the project, that | You are also expected to write a short reflection on the project, that describes how this assignment helps you understand the causes and consequences of bias on Wikipedia. | ||
'''A repository with a README framework and examples of querying the ORES datastore in R and Python can be found [https://github.com/Ironholds/data-512-a2 here]''' | |||
==== Getting the article and population data ==== | ==== Getting the article and population data ==== | ||
The first step is getting the data, which lives in several different places. The | The first step is getting the data, which lives in several different places. The wikipedia dataset can be found [https://figshare.com/articles/Untitled_Item/5513449 on Figshare]. Read through the documentation for this repository, then download and unzip it. | ||
The population data is on [https://www.dropbox.com/s/5u7sy1xt7g0oi2c/WPDS_2018_data.csv?dl=0 Dropbox]. Download this data as a CSV file (hint: look for the 'Microsoft Excel' icon in the upper right). | |||
==== Getting article quality predictions ==== | ==== Getting article quality predictions ==== | ||
Now you need to get the predicted quality scores for each article in the Wikipedia dataset. | Now you need to get the predicted quality scores for each article in the Wikipedia dataset. For this step, we're using a Wikimedia API endpoint for a machine learning system called [https://www.mediawiki.org/wiki/ORES ORES] ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst: | ||
# FA - Featured article | # FA - Featured article | ||
Line 244: | Line 238: | ||
# Stub - Stub-class article | # Stub - Stub-class article | ||
For context, these quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. If you're curious, you can read more about what these assessment classes mean on [https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Grades English Wikipedia]. We will talk about what these categories mean, and how the ORES model predicts which category an article goes into, next week in class. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these 6 categories to any | For context, these quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. If you're curious, you can read more about what these assessment classes mean on [https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Grades English Wikipedia]. We will talk about what these categories mean, and how the ORES model predicts which category an article goes into, next week in class. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these 6 categories to any article you send it. | ||
The ORES | The ORES API is configured fairly similarly to the pageviews API we used last assignment; documentation can be found [https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model here]. It expects a revision ID, which is the third column in the Wikipedia dataset, and a model, which is "wp10". The [https://github.com/Ironholds/data-512-a2 sample iPython notebooks for this assignment] provide examples of a correctly-structured API query that you can use to understand how to gather your data, and also to examine the query output. | ||
In order to get article predictions for each article in the Wikipedia dataset, you will need to read <tt>page_data.csv</tt> into Python (or R), and then read through the dataset line by line, using the value of the <tt>last_edit</tt> column in the API query. If you're working in Python, the [https://docs.python.org/3/library/csv.html CSV module] will help with this. | |||
When you query the API, you will notice that ORES returns a <tt>prediction</tt> value that contains the name of one category, as well as <tt>probability</tt> values for each of the 6 quality categories. For this assignment, you only need to capture and use the value for <tt>prediction</tt>. We'll talk more about what the other values mean in class next week. | |||
==== Combining the datasets ==== | ==== Combining the datasets ==== | ||
Some processing of the data will be necessary! In particular, you'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which ''cannot'' be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or | Some processing of the data will be necessary! In particular, you'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which ''cannot'' be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vice versa. You will need to remove the rows that do not have matching data. | ||
Consolidate the remaining data into a single CSV file which looks something like this: | |||
Line 300: | Line 268: | ||
|} | |} | ||
Note: <tt>revision_id</tt> here is the same thing as <tt> | Note: <tt>revision_id</tt> here is the same thing as <tt>last_edit</tt>, which you used to get scores from the ORES API. | ||
==== Analysis ==== | ==== Analysis ==== | ||
Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country | Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes. | ||
Examples: | Examples: | ||
* if a country has a population of 10,000 people, and you found 10 articles about politicians from that country, then the percentage of articles-per-population would be .1%. | * if a country has a population of 10,000 people, and you found 10 articles about politicians from that country, then the percentage of articles-per-population would be .1%. | ||
* if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%. | * if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%. | ||
==== Tables ==== | |||
The tables should be pretty straightforward. Produce four tables that show: | |||
#10 highest-ranked countries in terms of number of politician articles as a proportion of country population | |||
#10 lowest-ranked countries in terms of number of politician articles as a proportion of country population | |||
#10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country | |||
#10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country | |||
Embed them in the iPython notebook. | |||
==== Writeup ==== | |||
Write a few paragraphs, either in the README or in the notebook, reflecting on what you have learned, what you found, what (if anything) surprised you about your findings, and/or what theories you have about why any biases might exist (if you find they exist). You can also include any questions this assignment raised for you about bias, Wikipedia, or machine learning. Particular questions you might want to answer: | |||
# What biases did you expect to find in the data | # What biases did you expect to find in the data, and why? | ||
# What | # What are the results? | ||
# What theories do you have about why the results are what they are? | |||
# What | |||
==== Submission instructions ==== | ==== Submission instructions ==== | ||
#Complete your | #Complete your Notebook and datasets in Jupyter Hub. | ||
# | #Create the data-512-a2 repository on GitHub w/ your code and data. | ||
#Submit the link to your GitHub repo | #Complete and add your README and LICENSE file. | ||
#Submit the link to your GitHub repo to: https://canvas.uw.edu/courses/1244514/assignments/4376107 | |||
==== Required deliverables ==== | ==== Required deliverables ==== | ||
A directory in your GitHub repository called <tt>data-512-a2</tt> that contains | A directory in your GitHub repository called <tt>data-512-a2</tt> that contains the following files: | ||
:# 1 final data file in CSV format that follows the formatting conventions. | |||
:# 1 final data file in CSV format that | :# 1 Jupyter notebook named <tt>hcds-a2-bias</tt> that contains all code as well as information necessary to understand each programming step, as well as your writeup (if you have not included it in the README) and the tables. | ||
:# 1 Jupyter notebook named <tt>hcds-a2-bias</tt> that contains all code as well as information necessary to understand each programming step, as well | |||
:# 1 README file in .txt or .md format that contains information to reproduce the analysis, including data descriptions, attributions and provenance information, and descriptions of all relevant resources and documentation (inside and outside the repo) and hyperlinks to those resources, and your writeup (if you have not included it in the notebook). A prototype framework is included in the [https://github.com/Ironholds/data-512-a2 sample repository] | :# 1 README file in .txt or .md format that contains information to reproduce the analysis, including data descriptions, attributions and provenance information, and descriptions of all relevant resources and documentation (inside and outside the repo) and hyperlinks to those resources, and your writeup (if you have not included it in the notebook). A prototype framework is included in the [https://github.com/Ironholds/data-512-a2 sample repository] | ||
:# 1 LICENSE file that contains an [https://opensource.org/licenses/MIT MIT LICENSE] for your code. | :# 1 LICENSE file that contains an [https://opensource.org/licenses/MIT MIT LICENSE] for your code. | ||
==== Helpful tips ==== | ==== Helpful tips ==== | ||
Line 359: | Line 311: | ||
* Experiment with queries in the sandbox of the technical documentation for the API to familiarize yourself with the schema and the data | * Experiment with queries in the sandbox of the technical documentation for the API to familiarize yourself with the schema and the data | ||
* Explore the data a bit before starting to be sure you understand how it is structured and what it contains | * Explore the data a bit before starting to be sure you understand how it is structured and what it contains | ||
* Ask questions on Slack if you're unsure about anything. | * Ask questions on Slack if you're unsure about anything. Please email Os to set up a meeting, or come to office hours, if you want to! This time is set aside specifically for you - it is not an imposition. | ||
* When documenting/describing your project, think: "If I found this GitHub repo, and wanted to fully reproduce the analysis, what information would I want? What information would I need?" | * When documenting/describing your project, think: "If I found this GitHub repo, and wanted to fully reproduce the analysis, what information would I want? What information would I need?" | ||
--> | |||
=== A3: Crowdwork ethnography === | === A3: Crowdwork ethnography === | ||
For this assignment, you will go undercover as a member of the Amazon Mechanical Turk community. You will preview or perform Mechanical Turk tasks (called "HITs"), lurk in Turk worker discussion forums, and write an ethnographic account of your experience as a crowdworker, and how this experience changes your understanding of the phenomenon of crowdwork. | For this assignment, you will go undercover as a member of the Amazon Mechanical Turk community. You will preview or perform Mechanical Turk tasks (called "HITs"), lurk in Turk worker discussion forums, and write an ethnographic account of your experience as a crowdworker, and how this experience changes your understanding of the phenomenon of crowdwork. | ||
The full assignment description is available [https://docs.google.com/document/d/16lZdTxkw1meUPMzA-BYl8TVtk0Jxv4Wh8mbZq_BursM/edit?usp=sharing as a Google doc]. | The full assignment description is available [https://docs.google.com/document/d/16lZdTxkw1meUPMzA-BYl8TVtk0Jxv4Wh8mbZq_BursM/edit?usp=sharing as a Google doc] and [[:File:HCDS_Crowdwork_ethnography_instructions.pdf|as a PDF]]. | ||
=== A4: Final project | === A4: Final project plan === | ||
''to come'' | |||
=== A5: Final project plan === | |||
''to come'' | |||
<!-- | |||
''For examples of datasets you may want to use for your final project, see [[HCDS_(Fall_2017)/Datasets]].'' | |||
--> | |||
For this assignment, you will write up a study plan for your final class project. The plan will cover a variety of details about your final project, including what data you will use, what you will do with the data (e.g. statistical analysis, train a model), what results you expect or intend, and most importantly, why your project is interesting or important (and to whom, besides yourself). | For this assignment, you will write up a study plan for your final class project. The plan will cover a variety of details about your final project, including what data you will use, what you will do with the data (e.g. statistical analysis, train a model), what results you expect or intend, and most importantly, why your project is interesting or important (and to whom, besides yourself). | ||
=== A6: Final project presentation === | === A6: Final project presentation === | ||
For this assignment, you will give an in-class presentation of your final project. The goal of this | For this assignment, you will give an in-class presentation of your final project. The goal of this assignment is to demonstrate that you are able to effectively communicate your research questions, methods, conclusions, and implications to your target audience. | ||
=== A7: Final project report === | === A7: Final project report === | ||
For this assignment, you will publish the complete code, data, and analysis of your final research project. The goal is to demonstrate that you can incorporate all of the human-centered design considerations you learned in this course and create research artifacts that are understandable, impactful, and reproducible. | For this assignment, you will publish the complete code, data, and analysis of your final research project. The goal is to demonstrate that you can incorporate all of the human-centered design considerations you learned in this course and create research artifacts that are understandable, impactful, and reproducible. | ||
[[Category:HCDS (Fall 2019)]] | [[Category:HCDS (Fall 2019)]] |