Wikiq Pattern Matching
Processesing mediawiki xml dumps is computationally expensive and technically nuanced. Wikiq is python program that is a high-level command-line interface around robust mediawiki utilities that reduces the huge xml dumps into tsv files that have revision-level variables for downstream analysis. Because many people want to analyze edits to mediawiki wikis, we can reduce collective effort by putting some extra engineering thought into wikiq to make it useful for a wide array of applications.
Right now, wikiq doesn't provide any information about the actual text of revisions. It only outputs metadata. This page documents a design for adding pattern matching functionality to wikiq that will make it possible to build variables based on whether the text of revisions (or comments) match regular expressions and the values captured by named capture groups.
We want to support a broad range of research applications that extend beyond the needs of any particular project. Since we indent to support this project in the long-term, we want to minimize the complexity of our code implementation while maximizing the breadth of projects we can support.
Consider the following examples of analyses that will be enabled by this feature:
- Imagine a study of heterogenous notions of quality between editors and wikiprojects that looks at how different wikiprojects have rated the same articles. Using pattern matching in wikiq, a researcher could compose regular expressions that match wikiproject templates left on talk pages and extract the name of each wikiproject and the associated quality rating.
- A researcher might be interested in the effects of warnings left by algorithmic tools. By writing regular expressions that match the warnings left by known tools on user talk pages, she can find each revision where an editor was warned.
- You might be interested in studying policy invocations. If you collect a list of policies of interest, you can build a regular expression to match any of them and capture which ones were added in a given revision or were referenced in a revision summary (comment).
These use cases are meant to illustrate the broad array of applications this feature can support. They are not meant to define a scope for the feature. The scope is defined by the API.
To enable users to search for patterns in the text of revisions we will provide two additional command line arguments.
--revision-pattern: a regular expression
--revision-pattern-label: a label for the columns output based on matching revisions against the pattern.
Users may provide multiple revision patterns and accompanying labels. The patterns and the labels must be provided in the same order for wikiq to be able to correctly label the output columns.
In addition to revisions, we also wish to support pattern matching against revision summaries. Therefore we will also add corresponding command line arguments.
--comment-pattern: a regular expression
--comment-pattern-label: a label for the columns output based on matching revisions against the pattern.
pattern Wikiq should output a column indicating if the pattern matched a given revision, even if nothing matched any capture groups. This column should be named the corresponding
comment-patterns may include one or more named capture groups. If the `pattern` matches, it might also capture values for each named capture group. If a
pattern has one or more named capture groups wikiq needs to output a new column for each named capture group to store these values. The column should be named
<pattern-label>_<capture-group-name>. Since a `pattern` can match a revision more than once it is possible that more than one value should go in this column. Therefore the data type of this column will be a list. Downstream programs loading the output will have to be able to convert the text representation of a list into a list-type in whatever language they are using.
This section provides some tips for implementing this feature.
- Wikiq currently uses argparse to parse command line arguments. Since the arguments we will add should accept multiple values, they should be added to our parser using
- See named capture groups in the python
relibrary for details on how to extract the names of the groups and their values given a match.
- There is more than one reasonable way to output a list-type in a tsv. I (nate) suggest outputting JSON arrays inside double-quotes (
""). See this stack overflow example of how this format can be read in pandas. Since JSON is so widespread it's likely that users will be able to handle it easily.
- Add unit tests in the
testfolder. Unit tests should test new functions as well as a the