Editing CommunityData:CDSC Reddit
From CommunityData
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 117: | Line 117: | ||
Intuitively, we represent two subreddits as lines in a high-dimensional space (tf-idf vectors). In linear algebra, the dot product (<math display="inline">\cdot</math>) between two vectors takes their weighted sum (e.g. linear regression is a dot product of a vector of covariates and a vector of weights).<br /> | Intuitively, we represent two subreddits as lines in a high-dimensional space (tf-idf vectors). In linear algebra, the dot product (<math display="inline">\cdot</math>) between two vectors takes their weighted sum (e.g. linear regression is a dot product of a vector of covariates and a vector of weights).<br /> | ||
The vectors might have different lengths like if one subreddit has | The vectors might have different lengths like if one subreddit has words in comments than the other, so in cosine similarity the dot product is normalized by the magnitude (lengths) of the vectors. It turns out that this is equivalent to taking the cosine of the two vectors. So cosine similarity in essence quantifies the angle between the two lines in high-dimensional space. If the cosine similarity between two subreddits is greater then their tf-idf vectors are more correlated. | ||
Cosine similarity with tf-idf is popular (indeed it has been applied to Reddit in research several times before) because it quantifies the correlation between the most characteristic terms for two communities. | Cosine similarity with tf-idf is popular (indeed it has been applied to Reddit in research several times before) because it quantifies the correlation between the most characteristic terms for two communities. |