Editing CommunityData:CDSC Reddit

From CommunityData

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
The reddit_cdsc [https://code.communitydata.science/cdsc_reddit.git/ git repository] contains tools for working with Reddit data.
The reddit_cdsc project contains tools for working with Reddit data. The project is designed for the hyak super computing system at The University of Washington. It consists of a set of python and bash scripts and uses the [https://spark.apache.org/docs/latest/api/python/index.html pyspark] and [https://arrow.apache.org/docs/python/ pyarrow] to process large datasets. As of November 1st 2020, the project is under active development by [https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Washington.29 Nate TeBlunthuis] and provides scripts for:
The project is designed for the hyak super computing system at The University of Washington. It consists of a set of python and bash scripts and uses [https://spark.apache.org/docs/latest/api/python/index.html pyspark] and [https://arrow.apache.org/docs/python/ pyarrow] to process large datasets. As of March 1st 2021, the project is under active development by [https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Washington.29 Nate TeBlunthuis] and provides scripts for:


* Pulling and updating dumps from [https://pushshift.io Pushshift] in <code>dumps/pull_pushshift_comments.sh</code> and <code>dumps/pull_pushshift_submissions.sh</code>.
* Pulling and updating dumps from [https://pushshift.io Pushshift] in <code>pull_pushshift_comments.sh</code> and <code>pull_pushshift_submissions.sh</code>.
* Uncompressing and parsing the dumps into [https://parquet.apache.org/ Parquet] [https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets datasets] using scripts in <code>datasets</code>
* Uncompressing and parsing the dumps into [https://parquet.apache.org/ Parquet] [https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets datasets].
* Running text analysis based on [https://en.wikipedia.org/wiki/Tf%E2%80%93idf TF-IDF] including  
* Running text analysis based on [https://en.wikipedia.org/wiki/Tf%E2%80%93idf TF-IDF] including
** Extracting terms from Reddit comments in <code>ngrams/tf_comments.py</code>
** Extracting terms from Reddit comments in <code>tf_comments.py</code>
** Detecting common phrases based on [https://en.wikipedia.org/wiki/Pointwise_mutual_information Pointwise mutual information] “Wikipedia article on pointwise mutual information”) in <code>ngrams/top_comment_phrases</code>
** Detecting common phrases based on [https://en.wikipedia.org/wiki/Pointwise_mutual_information Pointwise mutual information] “Wikipedia article on pointwise mutual information”)
** Building TF-IDF vectors for each subreddit <code>similarities/tfidf.py</code> and also at the subreddit-week level.
** Building TF-IDF vectors for each subreddit <code>idf_comments.py</code> and (more experimentally) at the subreddit-week level <code>idf_comments_weekly.py</code>
** Computing cosine similarities between subreddits based on TF-IDF in <code>similarities/cosine_similarities.py</code>.
** Computing cosine similarities between subreddits based on TF-IDF <code>term_cosine_similarity.py</code>.
* Measuring similarity and clustering subreddits based on user overlaps using TF-IDF (and also just frequency) cosine similarities of commenters.
** Clustering subreddits based on user and term similarities in <code>clustering/clustering.py</code>
* [https://github.com/google/python-fire Fire-based] command line interfaces to make it easier for others to extend and resuse this work in your projects!


[[File:Reddit Dataflow.jpg|left|thumb|Dataflow diagram illustrating what pieces of code and data go into producing subreddit similarity measures and clusters
Right now, two steps are still in earlier stages of progress:
[https://miro.com/app/board/o9J_lSiN4TM=/ (link to miro board)]]]


The TF-IDF for comments still has some kinks to iron out to remove hyper links and bot comments.  
* Tf-idf based on comment authors for similarity between subreddits to measure user overlaps.
* Clustering subreddits based on cosine-similarities using [http://www.cs.cmu.edu/~wcohen/postscript/icml2010-pic-final.pdf power iteration clustering (PIC)]
 
The TF-IDF for comments still has some kinks to iron out to remove hyper links and bot comments. Right now subreddits that have similar automoderation messages appear very similar.
 
The user interfaces for most of the scripts are pretty crappy and need to be refined for re-use by others.


== Pulling data from [https://pushshift.io Pushshift] ==
== Pulling data from [https://pushshift.io Pushshift] ==
Line 40: Line 40:


== Subreddit Similarity ==
== Subreddit Similarity ==
By default, the scripts in <code>similarities</code> take a <code>TopN</code> parameter which selects the subreddits to include in the similarity dataset according to how many total comments they have.  You can alternatively pass a value to the <code>included_subreddits</code> parameter to a file with the names of the subreddits you would like to include on each line.


=== Datasets ===
=== Datasets ===


Subreddit similarity datasets based on comment terms and comment authors are available on hyak in <code>/gscratch/comdata/output/reddit_similarity</code>. The overall approach to subreddit similarity seems to work reasonably well and the code is stabilizing. If you want help using these similarities in a project, just reach out to [[User:groceryheist | Nate]].
Subreddit similarity datasets based on comment terms and comment authors are available on hyak in <code>/gscratch/comdata/output/reddit_similarity</code>. The overall approach to subreddit similarity seems to work reasonably well and the code is stabilizing. If you want help using these similarities in a project, just reach out to [[User:groceryheist | Nate]].




Line 117: Line 116:


Intuitively, we represent two subreddits as lines in a high-dimensional space (tf-idf vectors). In linear algebra, the dot product (<math display="inline">\cdot</math>) between two vectors takes their weighted sum (e.g. linear regression is a dot product of a vector of covariates and a vector of weights).<br />
Intuitively, we represent two subreddits as lines in a high-dimensional space (tf-idf vectors). In linear algebra, the dot product (<math display="inline">\cdot</math>) between two vectors takes their weighted sum (e.g. linear regression is a dot product of a vector of covariates and a vector of weights).<br />
The vectors might have different lengths like if one subreddit has more words in comments than the other, so in cosine similarity the dot product is normalized by the magnitude (lengths) of the vectors. It turns out that this is equivalent to taking the cosine of the two vectors. So cosine similarity in essence quantifies the angle between the two lines in high-dimensional space. If the cosine similarity between two subreddits is greater then their tf-idf vectors are more correlated.
The vectors might have different lengths like if one subreddit has words in comments than the other, so in cosine similarity the dot product is normalized by the magnitude (lengths) of the vectors. It turns out that this is equivalent to taking the cosine of the two vectors. So cosine similarity in essence quantifies the angle between the two lines in high-dimensional space. If the cosine similarity between two subreddits is greater then their tf-idf vectors are more correlated.


Cosine similarity with tf-idf is popular (indeed it has been applied to Reddit in research several times before) because it quantifies the correlation between the most characteristic terms for two communities.
Cosine similarity with tf-idf is popular (indeed it has been applied to Reddit in research several times before) because it quantifies the correlation between the most characteristic terms for two communities.
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)