CommunityData:Exposure and Participation Processes

Summary
There are a number of social computing theories that focus on why people do (or don't) join online groups. For the most part, these theories focus on individuals deciding whether to join an individual group, and are typically validated at the individual level.

On the other hand, there is much interest in understanding the dynamics of online communities and populations of communities. These communities have extremely skewed levels of participation, with a few communities garnering the vast majority of the attention and contributions. New, ecological approaches to understanding these dynamics focus on competition and mutualism between groups. While they have been partially successful in explaining group outcomes these approaches rarely consider how dynamics between groups emerge.

We suggest that competition, mutualism, and eventual online community outcomes are the result of how individuals are exposed to communities and how they decide whether to participate. We build a series of agent-based simulations that act as a bridge between these levels of analysis and let us explore the implications of different theories of how people are exposed to and join groups.

Next steps (9 Jan)

 * Remove references to participation rate (mostly in results and discussion)
 * Create new plots without participation rate but with reddit (and Wikia? Github? data)
 * Figure out how to scale the reddit data so that it fits on the histogram
 * How to display power law and nearly normal distributions on the same plot? Maybe only put the reddit data in the final plots?
 * Review and edit front end
 * Change from "ecology" to "systems"
 * Do more to lead into our analysis
 * Create hypotheses / propositions?
 * Explain what we are looking for (asymmetry plus superstar communities?)
 * Do more to define terms from the network literature
 * Claim engagement w/network sci + sociology of cumulative advantage as contributions.
 * Realized that this data has startup costs = 0; run new simulations with positive startup costs and replace current visualizations?

Visualization and testing
I have struggled with figuring out how to visualize the simulations effectively and to make a convincing argument about the data.

There are a few fundamental problems that have made this more difficult.


 * 1) The scales are very different. We are looking at ~200 communities and 9K people, compared to ~78K communities and ~3.5M people on reddit in a given month. I've dealt with this by sampling and then dealing with ratios rather than values. For example, sample 200 subreddits, and then visualize the ratio of posts/sum(posts). Another approach I've considered is getting a sample of 200 subreddits and then only selecting the users who have posted on those subreddits. I haven't done this yet but it would obviously lead to a much more sparse distribution of subreddits per person and might not represent the true distribution well?
 * 2) This leads to a related problem. Data from reddit only captures people who actually decided to participate in at least one community. The full population of people who could participate is obviously much larger. A simulation which represents this reality should probably have many, many more people than our current simulation and most of them should be non-participants. This also leads to a difficulty in visualization. So far, I have been removing the non-participants so that the comparison starts at the same place for both subreddits and simulated communities but should we?

Currently, we do no statistical tests and simply show faceted histograms for each of the simulation conditions, like the following:



Possible improvements:


 * Summary statistic
 * Many of these problems go away if we can come up with a decent summary statistic for a distribution. We could, e.g., choose something like gini. It then becomes much easier to summarize multiple simulations across parameter levels and to compare them to the gini seen on reddit.
 * Like any summary statistic, this can be misleading and very different distributions can have the same gini.
 * Distribution comparison
 * Something like a K-S test would be even better, but again, the difference in the size of the distributions seems to make this tricky?
 * Quantile measures
 * Nate and I did some work thinking about visualizing not just something like gini, but the values across a few quantiles as a summary of the shape of a highly skewed distribution.

Post hoc additions
Another weakness of the project is that it has an unsatisfying conclusion. The distributions kinda sorta look like what we see empirically but kinda don't.

One possible way forward is to make the argument that these theories partially explain the higher-level dynamics but that we need to add additional complications in order to provide results that are satisfying. Some additions that seem reasonable:


 * Modeling communities as having topics and modeling heterogeneity of interest in the topic
 * Heterogeneity in costs to participate (e.g., representing differences in free time, skills, etc.)