CommunityData:Participation Pathways

Overview
This project is interested in exploring how people move between communities in online spaces. In particular, we are interested in identifying patterns in the order in which people participate in given communities.

This order of participation can be represented as a network, where an edge between two communities means that people are likely to move from community A to community B.

These pathways can be used to answer questions about the influence of membership in a community on future behavior, to identify potentially dangerous pathways of radicalization, etc.

Submission Data
I used data on submissions to reddit from the beginning of reddit through 2019 to build a dataset. It looks sequentially at submissions from each user, and the very first time that a user makes a new post in community j the edge between the community they posted in most recently (i) and j is incremented.

I then took a very simple Bayesian approach to building a posterior likelihood distribution for the proportion of the time that people submit to j after submitting to i (rather than the reverse). If $$C_{i,j}$$ is the count of times that j was posted in after i, then the posterior with a uniform prior is $$B(C_{i,j} + 1), B(C_{j,i} + 1)$$ where $$B$$ is the Beta distribution. I think calculate the proportion of the posterior distribution that is < 0.5. This (I think!) can be thought of as the likelihood that the true probability is less than 0.5.

By only taking edges where very little of the posterior probability is less than 0.5, we can identify communities where people are much more likely to post in j after i rather than the reverse. A dataset with only the edges where the posterior probability is less than .05 is at dataset.

Natural Experiment Data
I have also been scraping the reddit homepage to grab the subreddits that appear on the leaderboard. My tentative plan is to identify communities which are part of strong pathways. Then, to compare users who are exposed to these communities via leaderboards during the first few days after joining compared to users where these subreddits did not appear on the leaderboard (or not as high up).

Problems and Questions
I've been thinking about possible problems with this approach and would love others to help me to identify them (and identify solutions!)


 * I think age may be a big problem - if a user posts in i before j exists, then they couldn't have posted in j first
 * I thought it was at first, but this is not quite the same as testing if $$P(i|j) > P(j|i)$$. The downside of that approach is that pathways can get diluted in the case where a subreddit acts as a gateway to multiple other subreddits. I think that my new approach is closer to the intuition I wanted.
 * I also only look at the immediately subsequent subreddit and only the first time someone posts in a given subredit. This is not obviously the best way but controls somewhat for heterogeneity in activity levels.
 * This produces a lot of edges! I would like to figure out a good way to prune them - the simplest is to keep limiting by the proportion of the posterior < .5, but maybe there are some other ways?
 * What do I do now? What pathways are interesting? Pathways to banned subs?

Some very preliminary results
Here are the neighbors of r/conspiracy, when using a cutoff of .0001
 * People move from these communities to r/conspiracy
 * adviceanimals
 * askreddit
 * codzombies
 * dota2
 * fantasybaseball
 * funny
 * leagueoflegends
 * pewdiepiesubmissions
 * pics
 * politics
 * prettylittleliars
 * reddit.com
 * showerthoughts
 * squaredcircle
 * teenagers
 * the_donald
 * trees


 * People move from r/conspiracy to these communities:
 * astronomy
 * bad_cop_no_donut
 * cannabis
 * cbts_stream
 * christianity
 * circleoftrust
 * collapse
 * creepy
 * documentaries
 * economics
 * environment
 * fullmoviesonyoutube
 * health
 * hiphoptruth
 * history
 * latestagecapitalism
 * legaladvice
 * lifeprotips
 * military
 * mycology
 * nottheonion
 * occult
 * philosophy
 * quotes
 * redditrequest
 * science
 * shadowban
 * space
 * thecalmbeforethestorm
 * topmindsofreddit
 * ufos
 * upliftingnews
 * futurology
 * ideasfortheadmins
 * activism
 * anticonsumption
 * conspiracytheories
 * conspiratard
 * hailcorporate
 * preppers
 * wikileaks
 * 911truth
 * actualconspiracies
 * alternativehistory
 * altnewz
 * c_s_t
 * conspiracies
 * conspiracy_commons
 * conspiracydocumentary
 * conspiracyfact
 * conspiracyfacts
 * conspiracyhub
 * conspiracyii
 * conspiracymemes
 * conspiracyright
 * conspiracyundone
 * conspiro
 * culturallayer
 * descentintotyranny
 * endlesswar
 * falseflagwatch
 * fringetheory
 * governmentoppression
 * highstrangeness
 * holofractal
 * intelligence
 * jfkresearcher
 * limitedhangouts
 * occultconspiracy
 * pedogate
 * propaganda
 * romerules
 * truthleaks
 * unagenda21
 * undelete