Editing Wiki language research

Project management wiki for Darren, Jim (and Aaron's) cultural peer production research.

== Action Items ==

* write rough draft of preliminary findings for 6/21

=== Undergrads ===
'''Bennett'''
* continue hand-coding samples
* take notes on interesting patters (in notes document on spreadsheets)
* note if names of bots are included in "list of bots"

== Summer Schedule ==
* Week 1: Write initial analysis, get google doc - latex pipeline setup
* Week 2-3: Flag bot edits, pull new samples for coding based on updated percentiles, write new draft of analysis
* week 3-6: Develop hypotheses and run analysis
** Cross cultural deliberative practices
** Discussion structure
** Something else?
** more qualitative coding + bot spot checks
* Week 7-CHI Deadline: writing

== meeting logs & notes ==
=== 05-22-16 ===
DG: I spent some time looking at the data distributions and ran a bunch of models on the simple EN models overnight. The data for len_1 are reallllly long-tailed with very low frequencies -- this is causing the convergence issues. Below is a table of the simple model (len_1 ~ num_editors_1), run through a series of truncated data sets. The models will converge all the way up to removing the final data point out of the 4,077,819 data points we have. In other words, I was able to get convergence by dropping a single data point. Here's a quick table of the results from running the models: 

{|
! style="text-align:left;"| RESOLVES
! CUTOFF_LEN_1  
! PERCENT_DATA_COVERED  
! NUM_EDS1_COEFF  
! NUMEDS1_SE  
! ALPHA  
! FREQ_COUNTS_FOR_CUTOFF_BIN  
! NOTES
|-
|Yes
|50
|97.81%
|0.1392615
|0.0000551
|0.05119
|1,993	
|-
|Yes	
|500
|99.85%
|0.1028963
|0.0000536
|0.2065
|19	
|-
|Yes
|5000
|99.99%
|0.0913293
|0.0000536
|0.2743573
|1	
|-
|Yes
|10000
|99.99846%
|0.0903963
|0.0000536
|0.2806345
|1
|missing only 63 data points (4077756/4077819)
|-
|Yes
|50000
|99.999975%
|0.0896678
|0.0000536
|0.2858428
|1
|missing only 1 data point
|-
|No
|no cutoff
|100%
|.
|.
|.
|1
|cannot compute an improvement, discontinuous region encountered
|}

So, there are a few things we should take note of when looking at the data here: 
* len_1 ranges in this data set from: 2 - 133,529
* as the cutoff increases (i.e., more data included in the model) the coeff for editors decreases (to be expected since we have longer tail) and the SE stabilizes around .0000536. 
* alpha also increases (as expected)
* the last data point is len_1 = 133,529; but if you look at the tabs you get to 99.99% of the data at len_1 ~= 2500, and the last 5 len_1's are 28364, 30313, 34390, 44586, and then 133529. that's likely causing all kinds of havoc with numerous things (e.g., dispersion calculations, covariance and corr issues, etc.)

   



=== 05-16-16 ===
Jim, Darren, and Aaron discussed the introduction, related work, and research contribution for the CSCW paper.  Related work: get away from lit review style of writing (for instance "we know that temporal issues are important (cite)", don't use authors as subjects, end each section for warent for our work;  Discussion/Contribution: make a bulleted list of potential contributions for the paper.

=== 05-11-16 ===
Darren and Jim discussed the background section of the paper.
=== 05-10-16 ===
Aaron and Jim discussed the background section of the paper.  Key takeaways: write the Durkhiem section more concisely and explicitly show the connection to our study, change the en:wp to be less lit-review-y and just show how it connects to ideas and expectations in the paper (also mention that it informs methods), cut a lot of "Applications to Organizations, Communities, and Social Movements", think of using Omnipedia earlier on because it shows that culture is acting as structuring (for content).  Generally, think about building up conceptual relationships needed to understand the paper.  Try writing a one sentence description of what each section is supposed to do in the paper.
=== 05-04-16 ===
Jim and Darren discussed a qualitative coding plan for the undergrads, and writing goals for the coming week.  Jim showed Darren the location of data files on Jakku for future data analysis.
=== 04-27-16 ===
Jim and Darren discussed the overall outline of the paper and the framing of the background section.  Specifically: 1) write the intro using the "the world is increasingly globalized" framing, 2) first include previous work on english wikipedia (especially on collaboration and bias, e.g. keegan, kittur, shaw, wagner, bryant), 3) then include small studies hyperlingual studies and studies on hyperlingual content (e.g. hecht, herring, hara), and 3) write about gap in previous lit (i.e. difference in collaboration practices)
=== 04-22-16 ===
created coding scheme in order to describe structure of talk pages
=== 04-20-16 ===
discussed project balancing and timeline, goals for theoretical framing, and qualitative coding

== project resources & links ==
'''05-16-16'''

[https://docs.google.com/document/d/1zwajJca5RCYEtB5WXpfeGlg4zsN6YyC9tSFDhBmEJnw/edit?usp=sharing CSCW 2017 Rough Draft 2]

'''05-04-16'''

[https://docs.google.com/document/d/1V6IGcRqkKRtKD2qWa0Nhh9nLcwmUslaiRIgIL_Afei4/edit?usp=sharing CSCW 2017 Rough Daft]

[https://docs.google.com/document/d/1RGFn2rHcW-9H_WZlpDT4XCKfWnQHWZOzAxVsevpP_Ko/edit?usp=sharing Data Documentation]

[https://docs.google.com/document/d/1T-pA_J1Qcn3dJTSFVBjBVjlHQvZjGllKF4yh6KmgNSc/edit?usp=sharing Coding Notes]

'''04-27-16'''

[https://docs.google.com/document/d/1Zkdl0Mrpfb-0yaGOarMnXmOwjFjh2XL1diDgtPU_vHs/edit?usp=sharing Wikipeida Talk Page Codebook]

[https://docs.google.com/document/d/1bOCwwbiEYRsqjK_6LaYKwYndqjb8twQ7UrqNjD4_to8/edit?usp=sharing CSCW Hyperling Collaboration Outline]

== notes ==
@@ Line 3: / Line 3: @@
 == Action Items ==
-* eliminate bots from sample
+* write rough draft of preliminary findings for 6/21
-* find percent change for talk edits at median values of coefs for each language.
 === Undergrads ===
@@ Line 15: / Line 14: @@
 * Week 1: Write initial analysis, get google doc - latex pipeline setup
 * Week 2-3: Flag bot edits, pull new samples for coding based on updated percentiles, write new draft of analysis
-* Week 3-6: Develop hypotheses and run analysis
+* week 3-6: Develop hypotheses and run analysis
 ** Cross cultural deliberative practices
 ** Discussion structure
@@ Line 23: / Line 22: @@
 == meeting logs & notes ==
-=== 8-16-16 ===
-JM: Processing the ML corpus with the talk page aggregation fixes.
-=== 8-9-16 ===
-JM: Working on re-processing the ML corpus to account for archived talk pages across the language editions.
-All: Read up on Edgar Schein's organizational culture work and think about the ways in which it might apply to the ML communities and what the org cultures are that exist within the different language editions.
-DG: Work on getting models of marginals across the quantiles
-=== 7-8-16 ===
-Considering small groups and small teams / workgroup literature.
-Next steps: Read through the previous papers on wikis that are English, and see what the general results and findings are. Then use our corpus to see how much of this holds across language editions, and how much of this seems to be more uniquely English oriented.
-=== 06-28-16 ===
-Thinking about ways to do small group analyses within the various language editions. Consider different approaches to network construction here. It would also be good to think about this with respect to things like sequential data analysis techniques like chain revision graphs, etc.
-=== 06-21-16 ===
-JM, DG, and AS discussed approach for communicating findings and for next steps.  We might look into orgs, coordination, and collective intelligence lit.
 === 05-22-16 ===
 DG: I spent some time looking at the data distributions and ran a bunch of models on the simple EN models overnight. The data for len_1 are reallllly long-tailed with very low frequencies -- this is causing the convergence issues. Below is a table of the simple model (len_1 ~ num_editors_1), run through a series of truncated data sets. The models will converge all the way up to removing the final data point out of the 4,077,819 data points we have. In other words, I was able to get convergence by dropping a single data point. Here's a quick table of the results from running the models:
@@ Line 140: / Line 114: @@
 == project resources & links ==
-'''07-28-16'''
-[https://docs.google.com/document/d/1DKaq6uZdMFiqmJoxgLIKCNTFOvUjmawNrcjec9P1ZHs/edit?usp=sharing CHI 2016 Rough Draft]
 '''05-16-16'''