Editing Wikiq

From CommunityData

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
[https://code.communitydata.cc/mediawiki_dump_tools.git Wikiq] is our tool for building tabular datasets from raw mediawiki edit data. Mediawiki outputs xml dump files, but these files are not so easy to work with, particularly because they contain the full text of every revision to every page. This makes it quite computationally expensive to process large wikis and leads to other technical problems. Wikiq efficiently processes mediawiki xml dumps to produce much smaller datasets that only contain variables that will be useful in our research. [[User:Groceryheist|Nate]] is working this summer on improvements to wikiq. Let him know if you have any requests!.  
[https://code.communitydata.cc/mediawiki_dump_tools.git Wikiq] is our tool for building tabular datasets from raw mediawiki edit data. Mediawiki outputs xml dump files, but these files are not so easy to work with, particularly because they contain the full text of every revision to every page. This makes it quite computationally expensive to process large wikis and leads to other technical problems. Wikiq efficiently processes mediawiki xml dumps to produce much smaller datasets that only contain variables that will be useful in our research. [[User:Groceryheist|Nate]] is working this summer on improvements to wikiq. Let him know if you have any requests!.  


== ! New feature for Wikiq 2019 ==
== Pattern matching matching ==


We have recently added a general-purpose pattern matching (regular expressions) feature to wikiq. The design doc can be seen [[Wikiq Pattern Matching | here]] and more information is given in the Command Line Arguments and Codebook below.
We are currently working on adding a general-purpose pattern matching feature to wikiq. [[Wikiq Pattern Matching | Design Doc]]  


See Also: [[CommunityData:Dataset_And_Tools_Release_2018]]
See Also: [[CommunityData:Dataset_And_Tools_Release_2018]]
Line 10: Line 10:
Wikiq is a python3 program with dependencies. To run on Hyak, for now, you will need to install the dependencies using  
Wikiq is a python3 program with dependencies. To run on Hyak, for now, you will need to install the dependencies using  
<code> pip install --user mwxml pandas git+https://github.com/mediawiki-utilities/python-mwpersistence.git mediawiki-utilities pymysql </code>  
<code> pip install --user mwxml pandas git+https://github.com/mediawiki-utilities/python-mwpersistence.git mediawiki-utilities pymysql </code>  


== Command Line Arguments ==
== Command Line Arguments ==
Line 23: Line 24:


<code>-n, --namespace-include</code> Id of namespace to include. Can be specified more than once. For some wikis (e.g. Large Wikipedias) computing persistence for the project namespace can be extremely slow.
<code>-n, --namespace-include</code> Id of namespace to include. Can be specified more than once. For some wikis (e.g. Large Wikipedias) computing persistence for the project namespace can be extremely slow.
=== Pattern matching arguments ===
Users can now search for patterns in edit revision text, with a list of matches for each edit being output in columns (a column for each pattern indicated by the pattern arguments below). Users may provide multiple revision patterns and accompanying labels. The patterns and the labels must be provided in the same order for wikiq to be able to correctly label the output columns.
<code>-RP</code> <code>--revision-pattern</code>: a regular expression
<code>-RPl</code> <code>--revision-pattern-label</code>: a label for the columns output based on matching revisions against the pattern.
In addition to revisions, we also wish to support pattern matching against revision ''summaries'' (comments). Therefore we also have corresponding command line arguments.
<code>-CP</code> <code>--comment-pattern</code>: a regular expression
<code>-CPl</code> <code>--comment-pattern-label</code>: a label for the columns output based on matching revisions against the pattern.
==== A note on named capture groups in pattern matching ====
The regular expressions in <code>-RP</code> and <code>-CP</code> may include one or more [https://docs.python.org/3.7/howto/regex.html#non-capturing-and-named-groups named capture groups]. If the `pattern` matches, it will then also capture values for each named capture group. If a <code>pattern</code> has one or more ''named capture groups'' wikiq will output a new column for each named capture group to store these values, with the column getting named: <code>&lt;pattern-label&gt;_&lt;capture-group-name&gt;</code>. Since a `pattern` can match a revision more than once it is possible that more than one value should go in this column (regardless of named capture group or not).
For cases in which the <code>-RP</code> or <code>-CP</code> have more than one named capture group and part of the searched string being searched matches for more than one capture group, only the first capture group will indicate a match because matching consumes characters in Python. For example, if a regular expression is <code>r"(?P<3_letters>\b\w{3}\b)|(?P<number>\b\d+\b)"</code> and the test string being searched is <code>dog and 500 bits of kibble</code>, we note that <code>500</code> works for both the <code>3_letters</code> and <code>number</code>. However, the capture group listed first (<code>3_letters</code>) consumes '500' when it matches, so the <code>3_letters</code> column will contain the list <code>[dog, and, 500]</code> while the <code>number</code> column will simple have <code>None</code>. As a result, one should consider the order of capture groups or create separate regular expression and label pairs.


== Codebook ==
== Codebook ==
The current version of wikiq provides one row for each edit (unless --collapse-user is passed, in which case each row corresponds to consecutive edits by the same editor), with columns for the following variables:  
The current version of wikiq provides one row for each edit (unless --collapse-user is passed, in which case each row corresponds to consecutive edits by the same editor):  
anon    articleid    collapsed_revs    date_time    deleted    editor    editor_id    minor    namespace    revert    reverteds    revid    sha1    text_chars    title    token_revs    tokens_added    tokens_removed    tokens_window
anon    articleid    collapsed_revs    date_time    deleted    editor    editor_id    minor    namespace    revert    reverteds    revid    sha1    text_chars    title    token_revs    tokens_added    tokens_removed    tokens_window


Line 92: Line 73:


<code>collapsed_revs</code> : The number of consecutive revisions the editor made that have been collapsed into the row.
<code>collapsed_revs</code> : The number of consecutive revisions the editor made that have been collapsed into the row.
The following variables are output when wikiq is called with the pattern matching arguments:
<code>[label]</code> from <code>-CPl/--comment-pattern-label or -RPl/--revision-pattern-label</code> : A list of the matches of the pattern given for this label found in that edit's revision text or comment (whatever specified). If none found, None.
<code>[label]_[named_capture_group]</code> : A list of the matches for the named capture group in the pattern given for this label in the edit's revision text of comment (whatever specified). If none found, None.


== Bugs ==  
== Bugs ==  
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)