Wikiq: Difference between revisions

From CommunityData
(Create page, explain important arguments and a codebook.)
 
(Improve formatting)
Line 1: Line 1:
[https://code.communitydata.cc/mediawiki_dump_tools.git Wikiq] is our tool for building tabular datasets from raw mediawiki edit data. Mediawiki outputs xml dump files, but these files are not so easy to work with, particularly because they contain the full text of every revision to every page. This makes it quite computationally expensive to process large wikis and leads to other technical problems. Wikiq efficiently processes mediawiki xml dumps to produce much smaller datasets that only contain variables that will be useful in our research. [[User:Groceryheist|Nate]] is working this summer on improvements to wikiq. Let him know if you have any requests!.  
[https://code.communitydata.cc/mediawiki_dump_tools.git Wikiq] is our tool for building tabular datasets from raw mediawiki edit data. Mediawiki outputs xml dump files, but these files are not so easy to work with, particularly because they contain the full text of every revision to every page. This makes it quite computationally expensive to process large wikis and leads to other technical problems. Wikiq efficiently processes mediawiki xml dumps to produce much smaller datasets that only contain variables that will be useful in our research. [[User:Groceryheist|Nate]] is working this summer on improvements to wikiq. Let him know if you have any requests!.  
== Setting up Wikiq ==
Wikiq is a python3 program with dependencies. To run on Hyak, for now, you will need to install the dependencies using
<code> pip install --user mwxml pandas git+https://github.com/mediawiki-utilities/python-mwpersistence.git mediawiki-utilities pymysql </code>


Some important command line flags control the behavior wikiq and change which variables are output.  
Some important command line flags control the behavior wikiq and change which variables are output.  
`--url-encode` : *Recommended* pass this in to url-encode text fields (page titles, editor names). This is used to safely handle text which might contain unicode characters that conflict with other parsing systems. You will probably want to url-decode these columns when you read them.  
 
`--persistence` : Compute persistent word revisions, a useful measure of contribution quality, for each edit.  This is somewhat costly, and slow, to compute.
<code>--url-encode</code> : *Recommended* pass this in to url-encode text fields (page titles, editor names). This is used to safely handle text which might contain unicode characters that conflict with other parsing systems. You will probably want to url-decode these columns when you read them.  
`--collapse-user` : Operate only on the final revision made by user a user within all sequences of consecutive edits made by a user. This can be useful for addressing issues with text persistence measures.
 
`--help` : Get help using Wikiq.  
<code>--persistence</code> : Compute persistent word revisions, a useful measure of contribution quality, for each edit.  This is somewhat costly, and slow, to compute.
jj
 
<code>--collapse-user</code> : Operate only on the final revision made by user a user within all sequences of consecutive edits made by a user. This can be useful for addressing issues with text persistence measures.
 
<code>--help</code> : Get help using Wikiq.  
 
The current version of wikiq provides one row for each edit (unless --collapse-user is passed, in which case each row corresponds to consecutive edits by the same editor):  
The current version of wikiq provides one row for each edit (unless --collapse-user is passed, in which case each row corresponds to consecutive edits by the same editor):  
anon    articleid    collapsed_revs    date_time    deleted    editor    editor_id    minor    namespace    revert    reverteds    revid    sha1    text_chars    title    token_revs    tokens_added    tokens_removed    tokens_window
anon    articleid    collapsed_revs    date_time    deleted    editor    editor_id    minor    namespace    revert    reverteds    revid    sha1    text_chars    title    token_revs    tokens_added    tokens_removed    tokens_window
Line 12: Line 20:
The meaning of the variables is:  
The meaning of the variables is:  


`anon` : Whether the editor is anonymous
<code>anon</code> : Whether the editor is anonymous
`articleid` : Unique identifier for the page
 
`date_time` : Timestamp of the edit
<code>articleid</code> : Unique identifier for the page
`deleted` : Whether the edit was deleted
 
`editor` : The user name of the editor
<code>date_time</code> : Timestamp of the edit
`editor_id` : Unique identifier for the editor
 
`minor` : Whether the edit is minor
<code>deleted</code> : Whether the edit was deleted
`namespace` : Id of the namespace. (see https://www.mediawiki.org/wiki/Manual:Namespace#Built-in_namespaces)
 
`revert` : The edit identity reverts one or more other edits.  
<code>editor</code> : The user name of the editor
`reverteds` : The ids of the edits that were reverted.
 
`revid` : Unique identifier of the revision.
<code>editor_id</code> : Unique identifier for the editor
`sha1` : Hash of the article text of the revision
 
`text_chars` : Length of the article in characters following the revision
<code>minor</code> : Whether the edit is minor
 
<code>namespace</code> : Id of the namespace. (see https://www.mediawiki.org/wiki/Manual:Namespace#Built-in_namespaces)
 
<code>revert</code> : The edit identity reverts one or more other edits.  
 
<code>reverteds</code> : The ids of the edits that were reverted.
 
<code>revid</code> : Unique identifier of the revision.
 
<code>sha1</code> : Hash of the article text of the revision
 
<code>text_chars</code> : Length of the article in characters following the revision
 
 
The following variables refer to persistent word revisions (PWR) and are only provided when wikiq is called with the <code>--persistence</code> argument:
 
<code>token revs</code> : The number of "token revisions" contributed by the edit. This is the key PWR variable.
 
<code>tokens added</code> : The number of tokens added by the edit.
 
<code>tokens removed</code>: The number of tokens removed by the edit.
 
<code>tokens window</code> : The maximum revisions examined in computing token revisions.


The following variables refer to persistent word revisions (PWR) and are only provided when wikiq is called with the `--persistence` argument:


`token revs` : The number of "token revisions" contributed by the edit. This is the key PWR variable.
The following variables are output when wikiq is called with the <code>--collapse-user</code> argument:  
`tokens added` : The number of tokens added by the edit.
`tokens removed`: The number of tokens removed by the edit.
`tokens window` : The maximum revisions examined in computing token revisions.


The following variables are output when wikiq is called with the `--collapse-user` argument:
<code>collapsed_revs</code> : The number of consecutive revisions the editor made that have been collapsed into the row.
  `collapsed_revs` : The number of consecutive revisions the editor made that have been collapsed into the row.

Revision as of 06:36, 10 July 2018

Wikiq is our tool for building tabular datasets from raw mediawiki edit data. Mediawiki outputs xml dump files, but these files are not so easy to work with, particularly because they contain the full text of every revision to every page. This makes it quite computationally expensive to process large wikis and leads to other technical problems. Wikiq efficiently processes mediawiki xml dumps to produce much smaller datasets that only contain variables that will be useful in our research. Nate is working this summer on improvements to wikiq. Let him know if you have any requests!.

Setting up Wikiq

Wikiq is a python3 program with dependencies. To run on Hyak, for now, you will need to install the dependencies using pip install --user mwxml pandas git+https://github.com/mediawiki-utilities/python-mwpersistence.git mediawiki-utilities pymysql

Some important command line flags control the behavior wikiq and change which variables are output.

--url-encode : *Recommended* pass this in to url-encode text fields (page titles, editor names). This is used to safely handle text which might contain unicode characters that conflict with other parsing systems. You will probably want to url-decode these columns when you read them.

--persistence : Compute persistent word revisions, a useful measure of contribution quality, for each edit. This is somewhat costly, and slow, to compute.

--collapse-user : Operate only on the final revision made by user a user within all sequences of consecutive edits made by a user. This can be useful for addressing issues with text persistence measures.

--help : Get help using Wikiq.

The current version of wikiq provides one row for each edit (unless --collapse-user is passed, in which case each row corresponds to consecutive edits by the same editor): anon articleid collapsed_revs date_time deleted editor editor_id minor namespace revert reverteds revid sha1 text_chars title token_revs tokens_added tokens_removed tokens_window

The meaning of the variables is:

anon : Whether the editor is anonymous

articleid : Unique identifier for the page

date_time : Timestamp of the edit

deleted : Whether the edit was deleted

editor : The user name of the editor

editor_id : Unique identifier for the editor

minor : Whether the edit is minor

namespace : Id of the namespace. (see https://www.mediawiki.org/wiki/Manual:Namespace#Built-in_namespaces)

revert : The edit identity reverts one or more other edits.

reverteds : The ids of the edits that were reverted.

revid : Unique identifier of the revision.

sha1 : Hash of the article text of the revision

text_chars : Length of the article in characters following the revision


The following variables refer to persistent word revisions (PWR) and are only provided when wikiq is called with the --persistence argument:

token revs : The number of "token revisions" contributed by the edit. This is the key PWR variable.

tokens added : The number of tokens added by the edit.

tokens removed: The number of tokens removed by the edit.

tokens window : The maximum revisions examined in computing token revisions.


The following variables are output when wikiq is called with the --collapse-user argument:

collapsed_revs : The number of consecutive revisions the editor made that have been collapsed into the row.