Editing CommunityData:CDSC Reddit
From CommunityData
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 117: | Line 117: | ||
Intuitively, we represent two subreddits as lines in a high-dimensional space (tf-idf vectors). In linear algebra, the dot product (<math display="inline">\cdot</math>) between two vectors takes their weighted sum (e.g. linear regression is a dot product of a vector of covariates and a vector of weights).<br /> | Intuitively, we represent two subreddits as lines in a high-dimensional space (tf-idf vectors). In linear algebra, the dot product (<math display="inline">\cdot</math>) between two vectors takes their weighted sum (e.g. linear regression is a dot product of a vector of covariates and a vector of weights).<br /> | ||
The vectors might have different lengths like if one subreddit has | The vectors might have different lengths like if one subreddit has words in comments than the other, so in cosine similarity the dot product is normalized by the magnitude (lengths) of the vectors. It turns out that this is equivalent to taking the cosine of the two vectors. So cosine similarity in essence quantifies the angle between the two lines in high-dimensional space. If the cosine similarity between two subreddits is greater then their tf-idf vectors are more correlated. | ||
Cosine similarity with tf-idf is popular (indeed it has been applied to Reddit in research several times before) because it quantifies the correlation between the most characteristic terms for two communities. | Cosine similarity with tf-idf is popular (indeed it has been applied to Reddit in research several times before) because it quantifies the correlation between the most characteristic terms for two communities. |