CommunityData:Message Walls Code



We are adapting the code the Mako and Aaron worked on for the AnonEdits paper. Here is the status of code that should work with the Message Walls dataset:

Ready
lib-01-build_wiki_list-stage1.R

01-build_edit_weeks.R

On Deck
Filter out bot/admin edits in wikiweeks.

Get questions from wikiq.

Get blocked users from AP

Things to find out
How much variance is there in the edit distributions among Wikia wikis. Estimate critical points of edit distributions. How much variation is there?

Principles for defining outcomes

 * Experience or newcomer definition should not depend on something MW changes

Defining newcomers
Nate and Sneha are arguing about this! There are two options we've considered for defining newcomer edits


 * Any edit made by a user who made their first edit m months ago is a 'newcomer edit'
 * Any edit made by any editor who has made less than n edits by the cutoff date is a 'newcomer edit'

We've reached a cautious consensus in taking the intersection of these two measures as a determinant of newcomer status.

Variables

 * For each wiki week:
 * total number of edits made to talk pages/message walls -- DONE
 * number of edits made to talk pages/message walls by 'new users -- DONE
 * exclude blocked folks from newcomers
 * number of reverts / reverted newcomers
 * number of edits made to talk pages/message walls by 'veteran users' -- DONE
 * number of edits made by a newcomer on a veteran talk page/message walls, or vice versa
 * total number of questions asked on talk pages/message walls
 * number of questions asked by newcomers
 * total number of edits made to article pages -- DONE
 * number of edits made to article pages by newcomers -- DONE
 * number of edits made to article pages by veterans -- DONE

Wiki level criteria
Check what data looks like before implementing these


 * Wikis must be at least 4 weeks old at the cutoff date
 * Wikis must have no edits to ns3 after the cutoff (indicating that the transition happened)
 * Wikis must have at least 1 contribution in at least 70% of the weeks in the entire study period
 * Wikis must have contributions from at least 2 different users during the entire study period

Dealing with cutoffs

 * For each wiki, use the first cutoff date. Exclude the wiki if message walls were turned off for more than 10 minutes during the 8 weeks after that date.

Dataset status
We have a dataset of 3704 wikis!

The tsv files are all in /com/projects/messagewalls/wikiq_output. This dataset is pretty large. Compiling all the wikiq tsvs is a pretty computationally intense process. I recommend working doing this on hyak. The wiki-week level dataset is smaller - about 41,000 lines. You could work with that locally if you prefer. For fitting models you'll still want to use hyak.

I recommend checking out the git repository on hyak if you have not already and stepping through 01_build_edit_weeks.R with Rstudio to build the dataset.

I updated 01_build_edit_weeks.R to use wikiList.3.csv (if this file doesn't exist it falls back to use wikiList.2.csv). The wikiList.3.csv file is wikiList.2.csv with 2 new columns. The first of these is FoundInWikiqLog which is just a sanity check that indicates that we expect to have a wikiq tsv for that wiki. The other column is error. This indicates whether wikiq threw an error. There were 16 wikiq errors. I didn't see any new errors kinds of errors from the scrapes. The errors are all missing tags that indicate a truncated dump. This is what I expected. Wikiteam's problems are usually on bigger wikis.

That said, I have not looked closely at the data and there might be strange things happening. It would be great for us to all spend some time exploring the data (and not just the wikiweeks, but also the edit level data) to find the weird, surprising, and broken things that we know are in there. Mako suggested putting together a wiki page to document the gotchyas that arise in Wikia data.

Sneha and I should chat about updating the data flow diagram with the updated scripts.

Salt, when you get a chance you might want to take a look at my changes in Git to see how I solved the problem with the scrapes. The scrape 7z archives had more files in them besides the xml dump. Modifying wikiq to read only the xml file did the trick. When it comes to processing the error logs, see filterWikiqErr.py. I like to keep track of errors in the wikilist instead of dropping data.

-