Not logged in
Talk
Contributions
Create account
Log in
Navigation
Main page
About
People
Publications
Teaching
Resources
Research Blog
Wiki Functions
Recent changes
Help
Licensing
Project page
Discussion
Edit
View history
Editing
CommunityData:CDSC Reddit
(section)
From CommunityData
Jump to:
navigation
,
search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Cosine Similarity === Once the tf-idf vectors are built, making a similarity score between two subreddits is straightforward using cosine similarity. <math display="inline">\text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }</math> Intuitively, we represent two subreddits as lines in a high-dimensional space (tf-idf vectors). In linear algebra, the dot product (<math display="inline">\cdot</math>) between two vectors takes their weighted sum (e.g. linear regression is a dot product of a vector of covariates and a vector of weights).<br /> The vectors might have different lengths like if one subreddit has more words in comments than the other, so in cosine similarity the dot product is normalized by the magnitude (lengths) of the vectors. It turns out that this is equivalent to taking the cosine of the two vectors. So cosine similarity in essence quantifies the angle between the two lines in high-dimensional space. If the cosine similarity between two subreddits is greater then their tf-idf vectors are more correlated. Cosine similarity with tf-idf is popular (indeed it has been applied to Reddit in research several times before) because it quantifies the correlation between the most characteristic terms for two communities. Compared to other approach to similarity like those using word embeddings or topic models it may struggle to handle polysemy, synonymy, or correlations between different terms. Using phrase detection helps with this a little bit. The advantages of this approach are simplicity and scalability. I’m thinking about using [https://en.wikipedia.org/wiki/Latent_semantic_analysis Latent Semantic Analysis] as an intermediate step to improve upon similarities based on raw tf-idfs. Even still, computing similarities between a large number of subreddits is computationally expensive and requires <math display="inline">n(n-1)/2</math> dot-product evaluations. This can be sped up by passing <code>similarity-threshold=X</code> where <math display="inline">X>0</math> into <code>term_comment_similarity.py</code>. I used a cosine similarity function that’s built into the spark matrix library which supports the <code>DIMSUM</code> algorithm for approximating matrix-matrix products. This algorithm is commonly used in industry (i.e. at Twitter, Google) for large-scale similarity scoring.
Summary:
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see
CommunityData:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:
Cancel
Editing help
(opens in new window)
Tools
What links here
Related changes
Special pages
Page information