Not logged in
Talk
Contributions
Create account
Log in
Navigation
Main page
About
People
Publications
Teaching
Resources
Research Blog
Wiki Functions
Recent changes
Help
Licensing
Project page
Discussion
Edit
View history
Editing
CommunityData:Message Walls Code
(section)
From CommunityData
Jump to:
navigation
,
search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Dataset status== We have a dataset of 3704 wikis! The tsv files are all in /com/projects/messagewalls/wikiq_output. This dataset is pretty large. Compiling all the wikiq tsvs is a pretty computationally intense process. I recommend working doing this on hyak. The wiki-week level dataset is smaller - about 41,000 lines. You could work with that locally if you prefer. For fitting models you'll still want to use hyak. I recommend checking out the git repository on hyak if you have not already and stepping through 01_build_edit_weeks.R with Rstudio to build the dataset. I updated 01_build_edit_weeks.R to use wikiList.3.csv (if this file doesn't exist it falls back to use wikiList.2.csv). The wikiList.3.csv file is wikiList.2.csv with 2 new columns. The first of these is FoundInWikiqLog which is just a sanity check that indicates that we expect to have a wikiq tsv for that wiki. The other column is error. This indicates whether wikiq threw an error. There were 16 wikiq errors. I didn't see any new errors kinds of errors from the scrapes. The errors are all missing </mediawiki> tags that indicate a truncated dump. This is what I expected. Wikiteam's problems are usually on bigger wikis. That said, I have not looked closely at the data and there might be strange things happening. It would be great for us to all spend some time exploring the data (and not just the wikiweeks, but also the edit level data) to find the weird, surprising, and broken things that we know are in there. Mako suggested putting together a wiki page to document the gotchyas that arise in Wikia data. Sneha and I should chat about updating the data flow diagram with the updated scripts. Salt, when you get a chance you might want to take a look at my changes in Git to see how I solved the problem with the scrapes. The scrape 7z archives had more files in them besides the xml dump. Modifying wikiq to read only the xml file did the trick. When it comes to processing the error logs, see filterWikiqErr.py. I like to keep track of errors in the wikilist instead of dropping data. -
Summary:
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see
CommunityData:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:
Cancel
Editing help
(opens in new window)
Tools
What links here
Related changes
Special pages
Page information