Editing CommunityData:Exposure and Participation Processes
From CommunityData
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 44: | Line 44: | ||
# The scales are very different. We are looking at ~200 communities and 9K people, compared to ~78K communities and ~3.5M people on reddit in a given month. I've dealt with this by sampling and then dealing with ratios rather than values. For example, sample 200 subreddits, and then visualize the ratio of posts/sum(posts). Another approach I've considered is getting a sample of 200 subreddits and then only selecting the users who have posted on those subreddits. I haven't done this yet but it would obviously lead to a much more sparse distribution of subreddits per person and might not represent the true distribution well? | # The scales are very different. We are looking at ~200 communities and 9K people, compared to ~78K communities and ~3.5M people on reddit in a given month. I've dealt with this by sampling and then dealing with ratios rather than values. For example, sample 200 subreddits, and then visualize the ratio of posts/sum(posts). Another approach I've considered is getting a sample of 200 subreddits and then only selecting the users who have posted on those subreddits. I haven't done this yet but it would obviously lead to a much more sparse distribution of subreddits per person and might not represent the true distribution well? | ||
# This leads to a related problem. Data from reddit only captures people who actually decided to participate in at least one community. The full population of people who could participate is obviously much larger. A simulation which represents this reality should probably have many, many more people than our current simulation and most of them should be non-participants. This also leads to a difficulty in visualization. So far, I have been removing the non-participants so that the comparison starts at the same place for both subreddits and simulated communities but should we? | # This leads to a related problem. Data from reddit only captures people who actually decided to participate in at least one community. The full population of people who could participate is obviously much larger. A simulation which represents this reality should probably have many, many more people than our current simulation and most of them should be non-participants. This also leads to a difficulty in visualization. So far, I have been removing the non-participants so that the comparison starts at the same place for both subreddits and simulated communities but should we? | ||
Line 58: | Line 59: | ||
** Something like a K-S test would be even better, but again, the difference in the size of the distributions seems to make this tricky? | ** Something like a K-S test would be even better, but again, the difference in the size of the distributions seems to make this tricky? | ||
* Quantile measures | * Quantile measures | ||
** Nate and I did some work thinking about visualizing not just something like gini, but the values across a few quantiles as a summary of the shape of a highly skewed distribution. | ** Nate and I did some work thinking about visualizing not just something like gini, but the values across a few quantiles as a summary of the shape of a highly skewed distribution. | ||
=== Post hoc additions === | === Post hoc additions === |