Editing Wikiq
From CommunityData
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 1: | Line 1: | ||
[https://code.communitydata.cc/mediawiki_dump_tools.git Wikiq] is our tool for building tabular datasets from raw mediawiki edit data. Mediawiki outputs xml dump files, but these files are not so easy to work with, particularly because they contain the full text of every revision to every page. This makes it quite computationally expensive to process large wikis and leads to other technical problems. Wikiq efficiently processes mediawiki xml dumps to produce much smaller datasets that only contain variables that will be useful in our research. [[User:Groceryheist|Nate]] is working this summer on improvements to wikiq. Let him know if you have any requests!. | [https://code.communitydata.cc/mediawiki_dump_tools.git Wikiq] is our tool for building tabular datasets from raw mediawiki edit data. Mediawiki outputs xml dump files, but these files are not so easy to work with, particularly because they contain the full text of every revision to every page. This makes it quite computationally expensive to process large wikis and leads to other technical problems. Wikiq efficiently processes mediawiki xml dumps to produce much smaller datasets that only contain variables that will be useful in our research. [[User:Groceryheist|Nate]] is working this summer on improvements to wikiq. Let him know if you have any requests!. | ||
== | == Pattern matching matching == | ||
We | We are currently working on adding a general-purpose pattern matching feature to wikiq. [[Wikiq Pattern Matching | Design Doc]] | ||
See Also: [[CommunityData:Dataset_And_Tools_Release_2018]] | See Also: [[CommunityData:Dataset_And_Tools_Release_2018]] | ||
Line 10: | Line 10: | ||
Wikiq is a python3 program with dependencies. To run on Hyak, for now, you will need to install the dependencies using | Wikiq is a python3 program with dependencies. To run on Hyak, for now, you will need to install the dependencies using | ||
<code> pip install --user mwxml pandas git+https://github.com/mediawiki-utilities/python-mwpersistence.git mediawiki-utilities pymysql </code> | <code> pip install --user mwxml pandas git+https://github.com/mediawiki-utilities/python-mwpersistence.git mediawiki-utilities pymysql </code> | ||
== Command Line Arguments == | == Command Line Arguments == | ||
Line 23: | Line 24: | ||
<code>-n, --namespace-include</code> Id of namespace to include. Can be specified more than once. For some wikis (e.g. Large Wikipedias) computing persistence for the project namespace can be extremely slow. | <code>-n, --namespace-include</code> Id of namespace to include. Can be specified more than once. For some wikis (e.g. Large Wikipedias) computing persistence for the project namespace can be extremely slow. | ||
== Codebook == | == Codebook == | ||
The current version of wikiq provides one row for each edit (unless --collapse-user is passed, in which case each row corresponds to consecutive edits by the same editor) | The current version of wikiq provides one row for each edit (unless --collapse-user is passed, in which case each row corresponds to consecutive edits by the same editor): | ||
anon articleid collapsed_revs date_time deleted editor editor_id minor namespace revert reverteds revid sha1 text_chars title token_revs tokens_added tokens_removed tokens_window | anon articleid collapsed_revs date_time deleted editor editor_id minor namespace revert reverteds revid sha1 text_chars title token_revs tokens_added tokens_removed tokens_window | ||
Line 92: | Line 73: | ||
<code>collapsed_revs</code> : The number of consecutive revisions the editor made that have been collapsed into the row. | <code>collapsed_revs</code> : The number of consecutive revisions the editor made that have been collapsed into the row. | ||
== Bugs == | == Bugs == |