Wikiq

Wikiq is our tool for building tabular datasets from raw mediawiki edit data. Mediawiki outputs xml dump files, but these files are not so easy to work with, particularly because they contain the full text of every revision to every page. This makes it quite computationally expensive to process large wikis and leads to other technical problems. Wikiq efficiently processes mediawiki xml dumps to produce much smaller datasets that only contain variables that will be useful in our research. Nate is working this summer on improvements to wikiq. Let him know if you have any requests!.

Pattern matching matching
We are currently working on adding a general-purpose pattern matching feature to wikiq. Design Doc

! New Wikiq in 2018
If you want the new wikiq, with improved persistence measures, use. There will be some breaking changes to wikiq in summer 2018. Currently, the stable version of the new wikiq is called  on hyak. At some point we will switch to calling the old version  and the new version will be.

See Also: CommunityData:Dataset_And_Tools_Release_2018

Setting up Wikiq
Wikiq is a python3 program with dependencies. To run on Hyak, for now, you will need to install the dependencies using

Command Line Arguments
Some important command line flags control the behavior wikiq and change which variables are output.

This is used to safely handle text which might contain unicode characters that conflict with other parsing systems. You will probably want to url-decode these columns when you read them.

This is somewhat costly, and slow, to compute. You can specify,   or   methods of calculating persistence. Segment is the default, and recommended way, but it is somewhat slower than sequence. Segment persistence is a faster, but marginally less accurate, version of the algorithm presented in this [paper https://arxiv.org/abs/1703.08244].

This can be useful for addressing issues with text persistence measures.



Id of namespace to include. Can be specified more than once. For some wikis (e.g. Large Wikipedias) computing persistence for the project namespace can be extremely slow.

Codebook
The current version of wikiq provides one row for each edit (unless --collapse-user is passed, in which case each row corresponds to consecutive edits by the same editor): anon   articleid    collapsed_revs    date_time    deleted    editor    editor_id    minor    namespace    revert    reverteds    revid    sha1    text_chars    title    token_revs    tokens_added    tokens_removed    tokens_window

The meaning of the variables is:















(see https://www.mediawiki.org/wiki/Manual:Namespace#Built-in_namespaces)











The following variables refer to persistent word revisions (PWR) and are only provided when wikiq is called with the  argument:

This is the key PWR variable.







The following variables are output when wikiq is called with the  argument:



Bugs

 * Not all anonymous edits get flagged as anon. Editor name being an IP Address seems to work (Not confirmed).