Editing CommunityData:Dataset And Tools Release 2018

From CommunityData

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
In summer 2018 [[People#Nathan_TeBlunthuis_.28University_of_Washington.29 | Nate]] is leading efforts to improve the code our research group uses to generate datasets from raw mediawiki dumps. The end goal is to release both the code and datasets generated on wikia and wikipedia wikis and to publish a data descriptor. This page documents these efforts.  
In summer 2018 [[Nate | People#Nathan_TeBlunthuis_.28University_of_Washington.29]] is leading efforts to improve the code our research group uses to generate datasets from raw mediawiki dumps. The end goal is to release both the code and datasets generated on wikia and wikipedia wikis and to publish a data descriptor. This page documents these efforts.  


== Overview ==
== Overview ==
Line 7: Line 7:
# Wiki level edits: for each wiki, a table where each row corresponds to an edit.
# Wiki level edits: for each wiki, a table where each row corresponds to an edit.
# Wiki level edit weeks: edit data aggregated by each week.  
# Wiki level edit weeks: edit data aggregated by each week.  
# [[User level edits | user level mediawiki datasets]]: for each user, a table where each row corresponds to an edit.  
# User level edits: for each user, a table where each row corresponds to an edit.  
# User level edit weeks: user level edits aggregated by week.
# User level edit weeks: user level edits aggregated by week.


Line 20: Line 20:
# Document the data sets with codebooks and example code.  
# Document the data sets with codebooks and example code.  
# Document the use and development of wikiq and build_edit_weeks to support future maintainers.
# Document the use and development of wikiq and build_edit_weeks to support future maintainers.
# Publicize the work through a dataset descriptor and presentations.
# Write a dataset descriptor to accompany the release of the code and data to publicize the release and explain the contribution.


=== Tasks ===  
=== Tasks ===  
Line 27: Line 27:
# Design user-level datasets (beginning with Jeremy and Kaylea's work).
# Design user-level datasets (beginning with Jeremy and Kaylea's work).
# Add variables to wikiq according to requirements.
# Add variables to wikiq according to requirements.
# Create unit test suits for wikiq and build edit weeks.
# Refactor wikiq to produce user-level datasets.
# Refactor wikiq to produce user-level datasets.
# Refactor wikiq to migrate from python-mediawiki-utilities to the new mediawiki-utilities projects.  
# Refactor wikiq to migrate from python-mediawiki-utilities to the new mediawiki-utilities projects.  
Line 34: Line 33:
# Document release version of build_edit_weeks.  
# Document release version of build_edit_weeks.  
# Generate datasets from wikia dumps, wikipedia language editions.  
# Generate datasets from wikia dumps, wikipedia language editions.  
# Write data descriptor to accompany the release of the code and data to publicize the release and explain the contribution.
# Write data descriptor.
# Give presentations and tutorials to consumers of the work (UCM, WMF, CDSC).
 
=== Schedule ===
[[File:Gantt_schedule.svg]]
 
 
=== Ideas for New Wikiq Variables ===
# Regex language for building new variables.
# Variable indicating what sections were changed in the edit (Built around regex language)?
# Variable indicate which if any Wikiprojects (for the wikipedia case) are associated with the page (Hard, probably requires 2 passes).
 
 
=== Requests ===
 
# Handle Default edits that initialize wikis.
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)