Editing CommunityData:Message Walls Code

From CommunityData

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
[[File:Messagewalls_code_diagram.png|200px|thumb|right|Dataflow diagram for this project.]]
We are adapting the code the Mako and Aaron worked on for the AnonEdits paper.
We are adapting the code the Mako and Aaron worked on for the AnonEdits paper.
Here is the status of code that should work with the [[Message Walls]] dataset:
Here is the status of code that should work with the MessageWalls dataset:


== Ready ==  
== Ready ==  
Line 15: Line 11:


== On Deck ==
== On Deck ==
 
New code to use the Wikia API to map wikis in the TSV to Wikiteam dumps.
Filter out bot/admin edits in wikiweeks.
 
Get questions from wikiq.
 
Get blocked users from AP
 
== Things to find out ==
How much variance is there in the edit distributions among Wikia wikis.  
Estimate critical points of edit distributions. How much variation is there?
 
 


== List of Variables to build ==
== List of Variables to build ==


 
===Dependent Variables===
===Principles for defining outcomes===
* Experience or newcomer definition should not depend on something MW changes
 
===Defining newcomers===
Nate and Sneha are arguing about this! There are two options we've considered for defining newcomer edits
 
* Any edit made by a user who made their first edit ''m'' months ago is a 'newcomer edit'
* Any edit made by any editor who has made less than ''n'' edits by the cutoff date is a 'newcomer edit'
 
We've reached a cautious consensus in taking the intersection of these two measures as a determinant of newcomer status.
 
=== Variables===
 
 
*For each wiki week:
*For each wiki week:
**total number of edits made to talk pages/message walls -- DONE
**total number of edits made to talk pages/message walls
**number of edits made to talk pages/message walls by 'new users -- DONE
**number of edits made to talk pages/message walls by 'new users'
**exclude blocked folks from newcomers
**number of edits made to talk pages/message walls by 'veteran users'
**number of reverts / reverted newcomers
**number of edits made to talk pages/message walls by 'veteran users' -- DONE
**number of edits made by a newcomer on a veteran talk page/message walls, or vice versa  
**number of edits made by a newcomer on a veteran talk page/message walls, or vice versa  
**total number of questions asked on talk pages/message walls
**total number of questions asked on talk pages/message walls
**number of questions asked by newcomers
**number of questions asked by newcomers
**total number of edits made to article pages -- DONE
**total number of edits made to article pages
**number of edits made to article pages by newcomers -- DONE
**number of edits made to article pages by newcomers
**number of edits made to article pages by veterans -- DONE
**number of edits made to article pages by veterans
 
==Preliminary inclusion criteria==
 
=== Wiki level criteria ===
 
'''Check what data looks like before implementing these'''
 
* Wikis must be at least 4 weeks old at the cutoff date
* Wikis must have no edits to ns3 after the cutoff (indicating that the transition happened)
* Wikis must have at least 1 contribution in at least 70% of the weeks in the entire study period
* Wikis must have contributions from at least 2 different users during the entire study period
 
===Dealing with cutoffs===
* For each wiki, use the first cutoff date. Exclude the wiki if message walls were turned off for more than 10 minutes during the 8 weeks after that date.
 
==Dataset status==
 
We have a dataset of 3704 wikis!
 
The tsv files are all in /com/projects/messagewalls/wikiq_output. This dataset is pretty large. Compiling all the wikiq tsvs is a pretty computationally intense process. I recommend working doing this on hyak. The wiki-week level dataset is smaller - about 41,000 lines. You could work with that locally if you prefer. For fitting models you'll still want to use hyak.
 
I recommend checking out the git repository on hyak if you have not already and stepping through 01_build_edit_weeks.R with Rstudio to build the dataset.
 
I updated 01_build_edit_weeks.R to use wikiList.3.csv (if this file doesn't exist it falls back to use wikiList.2.csv). The wikiList.3.csv file is wikiList.2.csv with 2 new columns. The first of these is FoundInWikiqLog which is just a sanity check that indicates that we expect to have a wikiq tsv for that wiki.  The other column is error. This indicates whether wikiq threw an error.  There were 16 wikiq errors.  I didn't see any new errors kinds of errors from the scrapes. The errors are all missing </mediawiki> tags that indicate a truncated dump. This is what I expected. Wikiteam's problems are usually on bigger wikis.
 
That said, I have not looked closely at the data and there might be strange things happening. It would be great for us to all spend some time exploring the data (and not just the wikiweeks, but also the edit level data) to find the weird, surprising, and broken things that we know are in there. Mako suggested putting together a wiki page to document the gotchyas that arise in Wikia data.
 
Sneha and I should chat about updating the data flow diagram with the updated scripts.
 
Salt, when you get a chance you might want to take a look at my changes in Git to see how I solved the problem with the scrapes. The scrape 7z archives had more files in them besides the xml dump. Modifying wikiq to read only the xml file did the trick. When it comes to processing the error logs, see filterWikiqErr.py.  I like to keep track of errors in the wikilist instead of dropping data.
 
-
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)