Editing CommunityData:CDSC Reddit (section)

=== Cosine Similarity ===

Once the tf-idf vectors are built, making a similarity score between two subreddits is straightforward using cosine similarity.

<math display="inline">\text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }</math>

Intuitively, we represent two subreddits as lines in a high-dimensional space (tf-idf vectors). In linear algebra, the dot product (<math display="inline">\cdot</math>) between two vectors takes their weighted sum (e.g. linear regression is a dot product of a vector of covariates and a vector of weights).<br />
The vectors might have different lengths like if one subreddit has more words in comments than the other, so in cosine similarity the dot product is normalized by the magnitude (lengths) of the vectors. It turns out that this is equivalent to taking the cosine of the two vectors. So cosine similarity in essence quantifies the angle between the two lines in high-dimensional space. If the cosine similarity between two subreddits is greater then their tf-idf vectors are more correlated.

Cosine similarity with tf-idf is popular (indeed it has been applied to Reddit in research several times before) because it quantifies the correlation between the most characteristic terms for two communities.

Compared to other approach to similarity like those using word embeddings or topic models it may struggle to handle polysemy, synonymy, or correlations between different terms. Using phrase detection helps with this a little bit. The advantages of this approach are simplicity and scalability. I’m thinking about using [https://en.wikipedia.org/wiki/Latent_semantic_analysis Latent Semantic Analysis] as an intermediate step to improve upon similarities based on raw tf-idfs.

Even still, computing similarities between a large number of subreddits is computationally expensive and requires <math display="inline">n(n-1)/2</math> dot-product evaluations. This can be sped up by passing <code>similarity-threshold=X</code> where <math display="inline">X>0</math> into <code>term_comment_similarity.py</code>. I used a cosine similarity function that’s built into the spark matrix library which supports the <code>DIMSUM</code> algorithm for approximating matrix-matrix products. This algorithm is commonly used in industry (i.e. at Twitter, Google) for large-scale similarity scoring.