Not logged in
Talk
Contributions
Create account
Log in
Navigation
Main page
About
People
Publications
Teaching
Resources
Research Blog
Wiki Functions
Recent changes
Help
Licensing
Project page
Discussion
Edit
View history
Editing
CommunityData:CDSC Reddit
(section)
From CommunityData
Jump to:
navigation
,
search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Building TF-IDF vectors === The process for building TF-IDF vectors has four steps: # Extracting terms using <code>tf_comments.py</code> # Detecting common phrases using <code>top_comment_phrases.py</code> # Extracting terms and common phrases using <code>tf_comments.py --mwe-pass='second'</code> # Building idf and tf-idf scores in <code>idf_comments.py</code> ==== Running <code>tf_comments.py</code> on the backfill queue ==== The main reason that I did it in 4 steps instead of one is to take advantage of the backfill queue for running <code>tf_comments.py</code>. This step requires reading all of the text in every comment and converting it to a bag of words at the subreddit-level. This is a lot of computation that is easily parallelizable. The script <code>run_tf_jobs.sh</code> partially automates running steps 1 (or 3) on the backfill queue. ==== Phrase detection using Pointwise Mutual Information ==== TF-IDF is simple, but only uses single words (unigrams). Sequences of multiple words can be important to account for how words have different meanings in different contexts or how sequences of words refer to distinct things like names. Dealing with context or longer sequences of words is a common challenge in natural language processing since the number of possible n-grams grows like crazy as n gets bigger. Phrase detection helps this problem by limiting the set of n-grams to those most informative. But how do we detect phrases? I implemented [https://en.wikipedia.org/wiki/Pointwise_mutual_information Pointwise mutual information] “Wikipedia article on pointwise mutual information”), which is a pretty simple way, but seems to work pretty well. PMI is an quantity derived from information theory. The intuition is that if two words occur together quite frequently compared to how often they appear separately then the cooccurrance is likely to be informative. <math display="inline">\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}.</math> In <code>tf_comments.py</code> if <code>--mwe-pass=first</code> then a 10% sample of 1-4-grams (sequences of terms up to length 4) will be written to a file to be consumed by <code>top_comment_phrases.py</code>. <code>top_comment_phrases.py</code> computes the PMI for these possible phrases and writes those that occur at least 3500 times in the sample of n-grams and have a PWMI of at least 3 (about 65000 expressions). <code>tf_comments.py --mwe-pass=second</code> then uses the detected phrases and adds them to the term frequency data.
Summary:
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see
CommunityData:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:
Cancel
Editing help
(opens in new window)
Tools
What links here
Related changes
Special pages
Page information