Not logged in
Talk
Contributions
Create account
Log in
Navigation
Main page
About
People
Publications
Teaching
Resources
Research Blog
Wiki Functions
Recent changes
Help
Licensing
Page
Discussion
Edit
View history
Editing
Human Centered Data Science (Fall 2019)/Assignments
(section)
From CommunityData
Jump to:
navigation
,
search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==== Getting article quality predictions ==== ''A repository with a template README file and examples of querying the ORES datastore in R and Python can be found [https://github.com/Ironholds/data-512-a2 here]'' Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called [https://www.mediawiki.org/wiki/ORES ORES] ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst: # FA - Featured article # GA - Good article # B - B-class article # C - C-class article # Start - Start-class article # Stub - Stub-class article For context, these quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. If you're curious, you can read more about what these assessment classes mean on [https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Grades English Wikipedia]. We will talk about what these categories mean, and how the ORES model predicts which category an article goes into, next week in class. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these 6 categories to any <tt>rev_id</tt> you send it. In order to get article predictions for each article in the Wikipedia dataset, you will first need to read <tt>page_data.csv</tt> into Python (or R), and then read through the dataset line by line, using the value of the <tt>rev_id</tt> column in the API query. If you're working in Python, the [https://docs.python.org/3/library/csv.html CSV module] will help with this. You have two options for getting data from the ORES: ;Option 1: Install and run the ORES client (preferred, Python only) You can ''pip install ores'' in your local notebook environment (https://github.com/wikimedia/ores installation instructions). This will allow you to get scores for list of multiple <tt>rev_id</tt> values in a single batch--you can even send all ~50k articles in the <tt>page_data.csv</tt> in a single batch! Here's some demo code: from ores import api #please provide this useragent string (second arg below) to help the ORES team track requests ores_session = api.Session("https://ores.wikimedia.org", "Class project <jmorgan@wikimedia.org>") #where 1234, 5678, 91011 below are rev_ids... results = ores_session.score("enwiki", ["articlequality"], [1234, 5678, 91011]) for score in results: print(score) #where the value for 'prediction' in each response below corresponds to the predicted article quality class {'articlequality': {'score': {'prediction': 'B', 'probability': {'GA': 0.005565225912988614, 'Stub': 0.285072978841463, 'C': 0.1237249061020009, 'B': 0.2910788689339172, 'Start': 0.2859984921969326, 'FA': 0.008559528012697881}}}} {'articlequality': {'score': {'prediction': 'Start', 'probability': {'GA': 0.005264197821210708, 'Stub': 0.40368617053424666, 'C': 0.021887833774629408, 'B': 0.029933164235917967, 'Start': 0.5352849001253548, 'FA': 0.0039437335086407645}}}} {'articlequality': {'score': {'prediction': 'Stub', 'probability': {'GA': 0.0033975128938096197, 'Stub': 0.8980284163392759, 'C': 0.01216786960110309, 'B': 0.01579141569356552, 'Start': 0.06809640787450176, 'FA': 0.0025183775977442226}}}} ;Option 2: Use the REST API endpoint (Python or R) The ORES REST API is configured fairly similarly to the pageviews API we used for Assignment 1. Documentation can be found [https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model here]. It expects a revision ID, which is the third column in the Wikipedia dataset, and a model, which is "wp10". The [https://github.com/Ironholds/data-512-a2 sample iPython notebooks for this assignment] provide examples of a correctly-structured API query that you can use to understand how to gather your data, and also to examine the query output. Whether you query the API or use the client, you will notice that ORES returns a <tt>prediction</tt> value that contains the name of one category, as well as <tt>probability</tt> values for each of the 6 quality categories. For this assignment, you only need to capture and use the value for <tt>prediction</tt>. We'll talk more about what the other values mean in upcoming weeks. ''Note:'' It's possible that you will be unable to get a score for a particular article (there are various possible reasons for this, which we can talk about later). If that happens, make sure to maintain a log of articles for which you were not able to retrieve an ORES score. This log can be saved as a separate file, or (if it's only a few articles), simply printed and logged within the notebook. I leave the choice up to you.
Summary:
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see
CommunityData:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:
Cancel
Editing help
(opens in new window)
Tools
What links here
Related changes
Special pages
Page information