Wikiq: Difference between revisions
Groceryheist (talk | contribs) (New wikiq.) |
|||
(16 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
[https://code.communitydata.cc/mediawiki_dump_tools.git Wikiq] is our tool for building tabular datasets from raw mediawiki edit data. Mediawiki outputs xml dump files, but these files are not so easy to work with, particularly because they contain the full text of every revision to every page. This makes it quite computationally expensive to process large wikis and leads to other technical problems. Wikiq efficiently processes mediawiki xml dumps to produce much smaller datasets that only contain variables that will be useful in our research. [[User:Groceryheist|Nate]] is working this summer on improvements to wikiq. Let him know if you have any requests!. | [https://code.communitydata.cc/mediawiki_dump_tools.git Wikiq] is our tool for building tabular datasets from raw mediawiki edit data. Mediawiki outputs xml dump files, but these files are not so easy to work with, particularly because they contain the full text of every revision to every page. This makes it quite computationally expensive to process large wikis and leads to other technical problems. Wikiq efficiently processes mediawiki xml dumps to produce much smaller datasets that only contain variables that will be useful in our research. [[User:Groceryheist|Nate]] is working this summer on improvements to wikiq. Let him know if you have any requests!. | ||
== ! New Wikiq | == ! New feature for Wikiq 2019 == | ||
We have recently added a general-purpose pattern matching (regular expressions) feature to wikiq. The design doc can be seen [[Wikiq Pattern Matching | here]] and more information is given in the Command Line Arguments and Codebook below. | |||
See Also: [[CommunityData:Dataset_And_Tools_Release_2018]] | |||
== Setting up Wikiq == | == Setting up Wikiq == | ||
Wikiq is a python3 program with dependencies. To run on Hyak, for now, you will need to install the dependencies using | Wikiq is a python3 program with dependencies. To run on Hyak, for now, you will need to install the dependencies using | ||
<code> pip install --user mwxml pandas git+https://github.com/mediawiki-utilities/python-mwpersistence.git mediawiki-utilities pymysql </code> | <code> pip install --user mwxml pandas git+https://github.com/mediawiki-utilities/python-mwpersistence.git mediawiki-utilities pymysql </code> | ||
== Command Line Arguments == | == Command Line Arguments == | ||
Line 17: | Line 16: | ||
<code>--url-encode</code> : *Recommended* pass this in to url-encode text fields (page titles, editor names). This is used to safely handle text which might contain unicode characters that conflict with other parsing systems. You will probably want to url-decode these columns when you read them. | <code>--url-encode</code> : *Recommended* pass this in to url-encode text fields (page titles, editor names). This is used to safely handle text which might contain unicode characters that conflict with other parsing systems. You will probably want to url-decode these columns when you read them. | ||
<code>--persistence</code> : Compute persistent word revisions, a useful measure of contribution quality, for each edit. This is somewhat costly, and slow, to compute. | <code>--persistence</code> : Compute persistent word revisions, a useful measure of contribution quality, for each edit. This is somewhat costly, and slow, to compute. You can specify <code>segment</code>, <code>sequence</code> or <code>legacy</code> methods of calculating persistence. Segment is the default, and recommended way, but it is somewhat slower than sequence. Segment persistence is a faster, but marginally less accurate, version of the algorithm presented in this [paper https://arxiv.org/abs/1703.08244]. | ||
<code>--collapse-user</code> : Operate only on the final revision made by user a user within all sequences of consecutive edits made by a user. This can be useful for addressing issues with text persistence measures. | <code>--collapse-user</code> : Operate only on the final revision made by user a user within all sequences of consecutive edits made by a user. This can be useful for addressing issues with text persistence measures. | ||
<code>--help</code> : Get help using Wikiq. | <code>--help</code> : Get help using Wikiq. | ||
<code>-n, --namespace-include</code> Id of namespace to include. Can be specified more than once. For some wikis (e.g. Large Wikipedias) computing persistence for the project namespace can be extremely slow. | |||
=== Pattern matching arguments === | |||
Users can now search for patterns in edit revision text, with a list of matches for each edit being output in columns (a column for each pattern indicated by the pattern arguments below). Users may provide multiple revision patterns and accompanying labels. The patterns and the labels must be provided in the same order for wikiq to be able to correctly label the output columns. | |||
<code>-RP</code> <code>--revision-pattern</code>: a regular expression | |||
<code>-RPl</code> <code>--revision-pattern-label</code>: a label for the columns output based on matching revisions against the pattern. | |||
In addition to revisions, we also wish to support pattern matching against revision ''summaries'' (comments). Therefore we also have corresponding command line arguments. | |||
<code>-CP</code> <code>--comment-pattern</code>: a regular expression | |||
<code>-CPl</code> <code>--comment-pattern-label</code>: a label for the columns output based on matching revisions against the pattern. | |||
==== A note on named capture groups in pattern matching ==== | |||
The regular expressions in <code>-RP</code> and <code>-CP</code> may include one or more [https://docs.python.org/3.7/howto/regex.html#non-capturing-and-named-groups named capture groups]. If the `pattern` matches, it will then also capture values for each named capture group. If a <code>pattern</code> has one or more ''named capture groups'' wikiq will output a new column for each named capture group to store these values, with the column getting named: <code><pattern-label>_<capture-group-name></code>. Since a `pattern` can match a revision more than once it is possible that more than one value should go in this column (regardless of named capture group or not). | |||
For cases in which the <code>-RP</code> or <code>-CP</code> have more than one named capture group and part of the searched string being searched matches for more than one capture group, only the first capture group will indicate a match because matching consumes characters in Python. For example, if a regular expression is <code>r"(?P<3_letters>\b\w{3}\b)|(?P<number>\b\d+\b)"</code> and the test string being searched is <code>dog and 500 bits of kibble</code>, we note that <code>500</code> works for both the <code>3_letters</code> and <code>number</code>. However, the capture group listed first (<code>3_letters</code>) consumes '500' when it matches, so the <code>3_letters</code> column will contain the list <code>[dog, and, 500]</code> while the <code>number</code> column will simple have <code>None</code>. As a result, one should consider the order of capture groups or create separate regular expression and label pairs. | |||
== Codebook == | == Codebook == | ||
The current version of wikiq provides one row for each edit (unless --collapse-user is passed, in which case each row corresponds to consecutive edits by the same editor): | The current version of wikiq provides one row for each edit (unless --collapse-user is passed, in which case each row corresponds to consecutive edits by the same editor), with columns for the following variables: | ||
anon articleid collapsed_revs date_time deleted editor editor_id minor namespace revert reverteds revid sha1 text_chars title token_revs tokens_added tokens_removed tokens_window | anon articleid collapsed_revs date_time deleted editor editor_id minor namespace revert reverteds revid sha1 text_chars title token_revs tokens_added tokens_removed tokens_window | ||
Line 44: | Line 64: | ||
<code>minor</code> : Whether the edit is minor | <code>minor</code> : Whether the edit is minor | ||
<code>namespace</code> : | <code>namespace</code> : A namespace is the type of whose names begin with a particular reserved word followed by a colon. For example, User:mkross (see https://www.mediawiki.org/wiki/Manual:Namespace#Built-in_namespaces) | ||
<code>revert</code> : The edit identity reverts one or more other edits. | <code>revert</code> : The edit identity reverts one or more other edits. | ||
Line 56: | Line 76: | ||
<code>text_chars</code> : Length of the article in characters following the revision | <code>text_chars</code> : Length of the article in characters following the revision | ||
<code>title</code> : Text title of the page. | |||
The following variables refer to persistent word revisions (PWR) and are only provided when wikiq is called with the <code>--persistence</code> argument: | The following variables refer to persistent word revisions (PWR) and are only provided when wikiq is called with the <code>--persistence</code> argument: | ||
<code> | <code>token_revs</code> : The number of "token revisions" contributed by the edit. This is the key PWR variable. | ||
<code> | <code>tokens_added</code> : The number of tokens added by the edit. | ||
<code> | <code>tokens_removed</code>: The number of tokens removed by the edit. | ||
<code> | <code>tokens_window</code> : The maximum revisions examined in computing token revisions. | ||
Line 71: | Line 92: | ||
<code>collapsed_revs</code> : The number of consecutive revisions the editor made that have been collapsed into the row. | <code>collapsed_revs</code> : The number of consecutive revisions the editor made that have been collapsed into the row. | ||
The following variables are output when wikiq is called with the pattern matching arguments: | |||
<code>[label]</code> from <code>-CPl/--comment-pattern-label or -RPl/--revision-pattern-label</code> : A list of the matches of the pattern given for this label found in that edit's revision text or comment (whatever specified). If none found, None. | |||
<code>[label]_[named_capture_group]</code> : A list of the matches for the named capture group in the pattern given for this label in the edit's revision text of comment (whatever specified). If none found, None. | |||
== Bugs == | |||
* Not all anonymous edits get flagged as anon. Editor name being an IP Address seems to work (Not confirmed). (Note: I've never seen a bug with this and I've done a lot of work with anon edits. -kc) | |||
* | |||
== Samples == | |||
Kaylea likes to use a script-generating script for wikiq. | |||
Step 1: Create a script-generating script like this: | |||
<nowiki> | |||
#!/usr/bin/env python3 | |||
from os import path | |||
import os | |||
import stat | |||
import glob | |||
## this script makes wikiq scripts for a given dump path | |||
dumpHome = '/gscratch/comdata/raw_data/' | |||
outPath = '/gscratch/comdata/output/' | |||
langDump = dumpHome + enwiki_20230401 #customize if needed | |||
## customize output path | |||
outPath = outPath + "wikiq_enwiki_name_this_something_useful/" | |||
archives = glob.glob(langDump + "/*pages-meta-hist*.7z") #makes a list of all the files, about 800 of them | |||
if not os.path.exists(outPath): #makes the dir for storing the output | |||
os.makedirs(outPath) | |||
with open('run_wikiq.sh', 'w') as fh: #creates a script | |||
for item in archives: #select options to customize the below as needed | |||
# as you see above, wikiq has a ton of options. | |||
# note that -o requires next field to be outPath; if more cmdline args are added, place before the -o. | |||
# if you wanted to regex match misinf or disinf in the edit comment field, this is how you'd do it: | |||
#fh.write(f"wikiq -u -CP '.*(misinf|disinf).*' -CPl comment -n 0 -n 1 -o {outPath} {item}\n") | |||
# a more normal wikiq invocation is this: | |||
fh.write(f"wikiq --collapse-user -u -o {outPath} {item}\n") </nowiki> | |||
Step 2: use the split command to turn your giant run_wikiq.sh script into a bunch of smaller files, named automatically things like xaa, xab, xac. For example, to make 40 lines per smaller script, do: | |||
<nowiki> | |||
split -l 40 run_wikiq.sh</nowiki> | |||
After running split, if you type ls, you'll see the autonamed files, each containing part of your run_wikiq.sh script. | |||
Step 3: you can now run the subchunks of your script, e.g. use tmux to log in to the same node 10-15 times, running sh xaa in the first one, sh xab in the second one, and so on. This is more hands-on and not really a proper batch approach, but it lets you sail through certain kinds of disruptions while still getting your output quickly. |
Latest revision as of 22:00, 20 June 2023
Wikiq is our tool for building tabular datasets from raw mediawiki edit data. Mediawiki outputs xml dump files, but these files are not so easy to work with, particularly because they contain the full text of every revision to every page. This makes it quite computationally expensive to process large wikis and leads to other technical problems. Wikiq efficiently processes mediawiki xml dumps to produce much smaller datasets that only contain variables that will be useful in our research. Nate is working this summer on improvements to wikiq. Let him know if you have any requests!.
! New feature for Wikiq 2019[edit]
We have recently added a general-purpose pattern matching (regular expressions) feature to wikiq. The design doc can be seen here and more information is given in the Command Line Arguments and Codebook below.
See Also: CommunityData:Dataset_And_Tools_Release_2018
Setting up Wikiq[edit]
Wikiq is a python3 program with dependencies. To run on Hyak, for now, you will need to install the dependencies using
pip install --user mwxml pandas git+https://github.com/mediawiki-utilities/python-mwpersistence.git mediawiki-utilities pymysql
Command Line Arguments[edit]
Some important command line flags control the behavior wikiq and change which variables are output.
--url-encode
: *Recommended* pass this in to url-encode text fields (page titles, editor names). This is used to safely handle text which might contain unicode characters that conflict with other parsing systems. You will probably want to url-decode these columns when you read them.
--persistence
: Compute persistent word revisions, a useful measure of contribution quality, for each edit. This is somewhat costly, and slow, to compute. You can specify segment
, sequence
or legacy
methods of calculating persistence. Segment is the default, and recommended way, but it is somewhat slower than sequence. Segment persistence is a faster, but marginally less accurate, version of the algorithm presented in this [paper https://arxiv.org/abs/1703.08244].
--collapse-user
: Operate only on the final revision made by user a user within all sequences of consecutive edits made by a user. This can be useful for addressing issues with text persistence measures.
--help
: Get help using Wikiq.
-n, --namespace-include
Id of namespace to include. Can be specified more than once. For some wikis (e.g. Large Wikipedias) computing persistence for the project namespace can be extremely slow.
Pattern matching arguments[edit]
Users can now search for patterns in edit revision text, with a list of matches for each edit being output in columns (a column for each pattern indicated by the pattern arguments below). Users may provide multiple revision patterns and accompanying labels. The patterns and the labels must be provided in the same order for wikiq to be able to correctly label the output columns.
-RP
--revision-pattern
: a regular expression
-RPl
--revision-pattern-label
: a label for the columns output based on matching revisions against the pattern.
In addition to revisions, we also wish to support pattern matching against revision summaries (comments). Therefore we also have corresponding command line arguments.
-CP
--comment-pattern
: a regular expression
-CPl
--comment-pattern-label
: a label for the columns output based on matching revisions against the pattern.
A note on named capture groups in pattern matching[edit]
The regular expressions in -RP
and -CP
may include one or more named capture groups. If the `pattern` matches, it will then also capture values for each named capture group. If a pattern
has one or more named capture groups wikiq will output a new column for each named capture group to store these values, with the column getting named: <pattern-label>_<capture-group-name>
. Since a `pattern` can match a revision more than once it is possible that more than one value should go in this column (regardless of named capture group or not).
For cases in which the -RP
or -CP
have more than one named capture group and part of the searched string being searched matches for more than one capture group, only the first capture group will indicate a match because matching consumes characters in Python. For example, if a regular expression is r"(?P<3_letters>\b\w{3}\b)|(?P<number>\b\d+\b)"
and the test string being searched is dog and 500 bits of kibble
, we note that 500
works for both the 3_letters
and number
. However, the capture group listed first (3_letters
) consumes '500' when it matches, so the 3_letters
column will contain the list [dog, and, 500]
while the number
column will simple have None
. As a result, one should consider the order of capture groups or create separate regular expression and label pairs.
Codebook[edit]
The current version of wikiq provides one row for each edit (unless --collapse-user is passed, in which case each row corresponds to consecutive edits by the same editor), with columns for the following variables: anon articleid collapsed_revs date_time deleted editor editor_id minor namespace revert reverteds revid sha1 text_chars title token_revs tokens_added tokens_removed tokens_window
The meaning of the variables is:
anon
: Whether the editor is anonymous
articleid
: Unique identifier for the page
date_time
: Timestamp of the edit
deleted
: Whether the edit was deleted
editor
: The user name of the editor
editor_id
: Unique identifier for the editor
minor
: Whether the edit is minor
namespace
: A namespace is the type of whose names begin with a particular reserved word followed by a colon. For example, User:mkross (see https://www.mediawiki.org/wiki/Manual:Namespace#Built-in_namespaces)
revert
: The edit identity reverts one or more other edits.
reverteds
: The ids of the edits that were reverted.
revid
: Unique identifier of the revision.
sha1
: Hash of the article text of the revision
text_chars
: Length of the article in characters following the revision
title
: Text title of the page.
The following variables refer to persistent word revisions (PWR) and are only provided when wikiq is called with the --persistence
argument:
token_revs
: The number of "token revisions" contributed by the edit. This is the key PWR variable.
tokens_added
: The number of tokens added by the edit.
tokens_removed
: The number of tokens removed by the edit.
tokens_window
: The maximum revisions examined in computing token revisions.
The following variables are output when wikiq is called with the --collapse-user
argument:
collapsed_revs
: The number of consecutive revisions the editor made that have been collapsed into the row.
The following variables are output when wikiq is called with the pattern matching arguments:
[label]
from -CPl/--comment-pattern-label or -RPl/--revision-pattern-label
: A list of the matches of the pattern given for this label found in that edit's revision text or comment (whatever specified). If none found, None.
[label]_[named_capture_group]
: A list of the matches for the named capture group in the pattern given for this label in the edit's revision text of comment (whatever specified). If none found, None.
Bugs[edit]
- Not all anonymous edits get flagged as anon. Editor name being an IP Address seems to work (Not confirmed). (Note: I've never seen a bug with this and I've done a lot of work with anon edits. -kc)
Samples[edit]
Kaylea likes to use a script-generating script for wikiq.
Step 1: Create a script-generating script like this:
#!/usr/bin/env python3 from os import path import os import stat import glob ## this script makes wikiq scripts for a given dump path dumpHome = '/gscratch/comdata/raw_data/' outPath = '/gscratch/comdata/output/' langDump = dumpHome + enwiki_20230401 #customize if needed ## customize output path outPath = outPath + "wikiq_enwiki_name_this_something_useful/" archives = glob.glob(langDump + "/*pages-meta-hist*.7z") #makes a list of all the files, about 800 of them if not os.path.exists(outPath): #makes the dir for storing the output os.makedirs(outPath) with open('run_wikiq.sh', 'w') as fh: #creates a script for item in archives: #select options to customize the below as needed # as you see above, wikiq has a ton of options. # note that -o requires next field to be outPath; if more cmdline args are added, place before the -o. # if you wanted to regex match misinf or disinf in the edit comment field, this is how you'd do it: #fh.write(f"wikiq -u -CP '.*(misinf|disinf).*' -CPl comment -n 0 -n 1 -o {outPath} {item}\n") # a more normal wikiq invocation is this: fh.write(f"wikiq --collapse-user -u -o {outPath} {item}\n")
Step 2: use the split command to turn your giant run_wikiq.sh script into a bunch of smaller files, named automatically things like xaa, xab, xac. For example, to make 40 lines per smaller script, do:
split -l 40 run_wikiq.sh
After running split, if you type ls, you'll see the autonamed files, each containing part of your run_wikiq.sh script.
Step 3: you can now run the subchunks of your script, e.g. use tmux to log in to the same node 10-15 times, running sh xaa in the first one, sh xab in the second one, and so on. This is more hands-on and not really a proper batch approach, but it lets you sail through certain kinds of disruptions while still getting your output quickly.