Editing Human Centered Data Science (Fall 2019)/Assignments (section)

==== Getting article quality predictions ====
''A repository with a template README file and examples of querying the ORES datastore in R and Python can be found [https://github.com/Ironholds/data-512-a2 here]''

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called [https://www.mediawiki.org/wiki/ORES ORES] ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst:

# FA - Featured article
# GA - Good article
# B - B-class article
# C - C-class article
# Start - Start-class article
# Stub - Stub-class article

For context, these quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. If you're curious, you can read more about what these assessment classes mean on [https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Grades English Wikipedia]. We will talk about what these categories mean, and how the ORES model predicts which category an article goes into, next week in class. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these 6 categories to any <tt>rev_id</tt> you send it.

In order to get article predictions for each article in the Wikipedia dataset, you will first need to read <tt>page_data.csv</tt> into Python (or R), and then read through the dataset line by line, using the value of the <tt>rev_id</tt> column in the API query. If you're working in Python, the [https://docs.python.org/3/library/csv.html CSV module] will help with this.

You have two options for getting data from the ORES: 

;Option 1: Install and run the ORES client (preferred, Python only)

You can ''pip install ores'' in your local notebook environment (https://github.com/wikimedia/ores installation instructions). This will allow you to get scores for list of multiple <tt>rev_id</tt> values in a single batch--you can even send all ~50k articles in the <tt>page_data.csv</tt> in a single batch! Here's some demo code:

 from ores import api
 #please provide this useragent string (second arg below) to help the ORES team track requests
 ores_session = api.Session("https://ores.wikimedia.org", "Class project <jmorgan@wikimedia.org>")
 #where 1234, 5678, 91011 below are rev_ids...
 results = ores_session.score("enwiki", ["articlequality"], [1234, 5678, 91011])
 for score in results:
     print(score)
 #where the value for 'prediction' in each response below corresponds to the predicted article quality class
 {'articlequality': {'score': {'prediction': 'B', 'probability': {'GA': 0.005565225912988614, 'Stub': 0.285072978841463, 'C': 0.1237249061020009, 'B': 0.2910788689339172, 'Start': 0.2859984921969326, 'FA': 0.008559528012697881}}}}
 {'articlequality': {'score': {'prediction': 'Start', 'probability': {'GA': 0.005264197821210708, 'Stub': 0.40368617053424666, 'C': 0.021887833774629408, 'B': 0.029933164235917967, 'Start': 0.5352849001253548, 'FA': 0.0039437335086407645}}}}
 {'articlequality': {'score': {'prediction': 'Stub', 'probability': {'GA': 0.0033975128938096197, 'Stub': 0.8980284163392759, 'C': 0.01216786960110309, 'B': 0.01579141569356552, 'Start': 0.06809640787450176, 'FA': 0.0025183775977442226}}}}

;Option 2: Use the REST API endpoint (Python or R)

The ORES REST API is configured fairly similarly to the pageviews API we used for Assignment 1. Documentation can be found [https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model here]. It expects a revision ID, which is the third column in the Wikipedia dataset, and a model, which is "wp10". The [https://github.com/Ironholds/data-512-a2 sample iPython notebooks for this assignment] provide examples of a correctly-structured API query that you can use to understand how to gather your data, and also to examine the query output.

Whether you query the API or use the client, you will notice that ORES returns a <tt>prediction</tt> value that contains the name of one category, as well as <tt>probability</tt> values for each of the 6 quality categories. For this assignment, you only need to capture and use the value for <tt>prediction</tt>. We'll talk more about what the other values mean in upcoming weeks.

''Note:'' It's possible that you will be unable to get a score for a particular article (there are various possible reasons for this, which we can talk about later). If that happens, make sure to maintain a log of articles for which you were not able to retrieve an ORES score. This log can be saved as a separate file, or (if it's only a few articles), simply printed and logged within the notebook. I leave the choice up to you.