CommunityData:Dataset And Tools Release 2018: Difference between revisions

From CommunityData
(→‎Tasks: presentations)
Line 33: Line 33:
# Document release version of build_edit_weeks.  
# Document release version of build_edit_weeks.  
# Generate datasets from wikia dumps, wikipedia language editions.  
# Generate datasets from wikia dumps, wikipedia language editions.  
# Write data descriptor.
# Write data descriptor to accompany the release of the code and data to publicize the release and explain the contribution.
# Give presentations and tutorials to consumers of the work (UCM, WMF, CDSC).

Revision as of 22:43, 29 June 2018

In summer 2018 Nate is leading efforts to improve the code our research group uses to generate datasets from raw mediawiki dumps. The end goal is to release both the code and datasets generated on wikia and wikipedia wikis and to publish a data descriptor. This page documents these efforts.

Overview

There are 4 types of datasets we will support:

  1. Wiki level edits: for each wiki, a table where each row corresponds to an edit.
  2. Wiki level edit weeks: edit data aggregated by each week.
  3. User level edits: for each user, a table where each row corresponds to an edit.
  4. User level edit weeks: user level edits aggregated by week.

The program wikiq is used to generate wiki and (in the near future) user level edit datasets. Wikiq is a python script meant to be used through a command line interface. It depends on functionality from Mediawiki Utilities.

The build_edit_weeks.R Rscript is used to generate edit weeks datasets from wikiq outputs. Currently this script is scattered across projects.

Goals

  1. Improve the reliability, usability, and maintainability of our software utilities for generating datasets.
  2. Cover important and useful analytic variables in wikiq and build_edit_weeks.
  3. Document the data sets with codebooks and example code.
  4. Document the use and development of wikiq and build_edit_weeks to support future maintainers.
  5. Publicize the work through a dataset descriptor and presentations.

Tasks

  1. Talk to potential users of the code and datasets including the research teams at CDSC, UCM and Wikimedia to gather requirements.
  2. Design user-level datasets (beginning with Jeremy and Kaylea's work).
  3. Add variables to wikiq according to requirements.
  4. Refactor wikiq to produce user-level datasets.
  5. Refactor wikiq to migrate from python-mediawiki-utilities to the new mediawiki-utilities projects.
  6. Refactor build edit weeks into the RCommunity data repository and support usability via CLI.
  7. Document release version of wikiq
  8. Document release version of build_edit_weeks.
  9. Generate datasets from wikia dumps, wikipedia language editions.
  10. Write data descriptor to accompany the release of the code and data to publicize the release and explain the contribution.
  11. Give presentations and tutorials to consumers of the work (UCM, WMF, CDSC).