Text as Data (Spring 2026)
- COM 597 B / CSSS 594 A: Text as Data - Department of Communication, University of Washington
- Instructor:
- Teaching Assistant:
- Eddie Hock / ehock@uw.edu
- Course Meetings: 10:30–12:20pm, Mondays & Wednesdays, in PAR 120
- Course Websites:
- Canvas: for announcements and turning in assignments
- A group chat system of "our" choice: Signal? Discord? Slack? Element/Matrix? We will discuss this in the first class.
- Discussion opt-out form: to let me know if you are not coming to class or don't want to be called upon that day for any reason (remember: you must fill this out at least one hour before class begins)
- Everything else will be linked on this page.
Overview and Learning Objectives[edit]
Text is everywhere. Increased digitization has led to a proliferation of social media posts, news articles, government documents, interview transcripts, court records, and more. New advances in AI mean that images, audio, and video can also be converted into and analyzed as data.
This advanced graduate methods course is targeted at PhD-student researchers across the social sciences. It seeks to teach students to use text as a source of data in statistical social scientific analyses, with a focus on integrating textual data into quantitative analyses. While many students will be most familiar with the use of large language models, the course also covers a range of unsupervised methods (e.g., topic modeling) and supervised machine learning (e.g., automated classification), as well as dictionary-based approaches and textual embeddings. Issues related to data collection, preparation, ethics, and validation are addressed throughout.
Students from communication, political science, sociology, and public policy will find examples, papers, and datasets from their fields incorporated into the syllabus. Students from a wide range of social-scientific fields including information science, public health, marketing, and many other disciplines will also find the course useful.
I will consider the course a complete success if every student is able to do all of these things at the end of the quarter:
- Describe and compare the major approaches to computational text analysis, including dictionary methods, topic modeling, supervised classification, word embeddings, and LLM-based approaches.
- Articulate the particular assumptions, strengths, and threats to validity associated with each method.
- Read, interpret, and critically evaluate empirical social scientific research that uses text as data methods.
- For each method, be able to carry out a scaffolded analysis including producing code in Python or R and interpreting and describing what you've done, the code you've produced, and what you've found.
- For at least one method, design and carry out an original analysis using a corpus relevant to their own research.
Prerequisites[edit]
Although there are no formal prerequisites for this course, there are two things you will need to be prepared:
- An introductory statistics sequence up through linear regression. This will usually take the form of at least two introductory applied statistics courses at the level of SOC 504 and SOC 505 (or equivalent). This is a standard prerequisite for CSSS advanced statistics courses.
- Basic fluency in reading, writing, and modifying code in either the Python or R programming languages.
Note About This Syllabus[edit]
This is my first time teaching this course. It's also a rapidly changing area. As a result, there is still quite a bit up in the air and we're going to spend a bunch of time figuring this out together as we go along.
You should expect this syllabus to be a dynamic document, not a contract. Although the core expectations for this class are fixed, the details of readings and assignments will shift based on how the class goes, any guest speakers I might arrange, my readings in this area, etc. As a result, there are three important things to keep in mind:
- Although details on this syllabus will change, I will try to ensure that I never change readings more than six days before they are due. This means that if I don't fill in a reading marked "[To Be Decided]" six days before it's due, it is dropped. If I don't change something marked "[Tentative]" before the deadline, then it is assigned. This also means that if you plan to read more than six days ahead, contact me first.
- Because this syllabus is a wiki, you can track every change by clicking the history button on this page when I make changes. I will summarize these changes in the weekly announcement on Canvas that will be emailed to everybody in the class. Closely monitor your email or the announcements section on the course website on Canvas to ensure you don't miss these announcements.
- I will frequently ask the class for voluntary, anonymous feedback, especially toward the beginning of the quarter. Please let me know what is working and what can be improved. In the past, I have made many adjustments to the courses that I teach while the quarter progressed based on this feedback.
Readings[edit]

This course will rely heavily on the book Text as Data by Justin Grimmer, Maggie Roberts, and Brandon Stewart. I expect you all to have access to this book:
- Grimmer, Justin, Margaret E. Roberts, and Brandon M. Stewart. 2022. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton, NJ: Princeton University Press.
The book is an updated version of what is now "the" classic text and was published in 2022. It's excellent but there have been some very big changes in the last several years. The most obvious one is massive advances in transformers and large language models (LLMs) which are only touched on very briefly in the book. I will be supplementing and/or replacing some of the text.
You will be asked to conduct analyses in R or Python throughout the course and to modify code that either I or the TA share with you. There are two books—one for folks using Python and one for folks using R—although I won't be assigning chapters of these books because I know people's background will vary, I will attempt to list relevant sections of these books in the optional readings:
- [Python] Hovy, Dirk. 2021. Text Analysis in Python for Social Scientists: Discovery and Exploration. Cambridge University Press. https://www.cambridge.org/core/elements/text-analysis-in-python-for-social-scientists/BFAB0A3604C7E29F6198EA2F7941DFF3. [Available from UW libraries]
- [R] Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. O’Reilly Media. [Available through UW libraries]
Some other useful books are:
- Bengfort, Benjamin, Rebecca Bilbro, and Tony Ojeda. 2018. Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning. O’Reilly Media. [Available through UW libraries]
- Brown, Taylor R. 2023. An Introduction to R and Python for Data Analysis: A Side-By-Side Approach. CRC Press. https://doi.org/10.1201/9781003263241. [Available from UW libraries]
- Hvitfeldt, Emil, and Julia Silge. 2022. Supervised Machine Learning for Text Analysis in R. Chapman and Hall/CRC. [Available from instructor]
- Jockers, Matthew L., and Rosamond Thalken. 2020. Text Analysis with R. Springer. https://link.springer.com/book/10.1007/978-3-319-03164-4. [Available from UW libraries]
Access to Readings[edit]
Many readings are marked as "[Available through UW libraries]". Most of these will be accessible to anybody who connects from the UW network. This means that if you're on campus, it will likely work. Although you can go through the UW libraries website to get most of these, the easiest way is using the UW library proxy bookmarklet. This is a little button you can drag and drop onto the bookmarks toolbar on your browser. When you press the button, it will ask you to log in using your UW NetID and then will automatically send your traffic through UW libraries. You can also use the other tools on this UW libraries webpage.
Workload[edit]
This class is a 5 credit class. According to the UW policy, students should expect to devote about 3 hours per week per credit—on average across weeks and students. With this in mind, I plan to assign about a book worth of reading each week. Because we will spend 3-4 hours in class and an hour or two on assignments, on average, I expect everybody to read for about 8-10 hours each week (i.e., about one book's worth of reading time). For some people, reading a book's worth of articles will take longer. For many, it will take less.
I understand this class involves a lot of reading compared to some other courses, especially outside of the social sciences. Historically, students suggest my courses take more time than most classes at UW but less time per week (on average) than 3 hours per credit. Please let me know if you are spending more than 15 hours a week on the class.
Assignments[edit]
Your assignments consist of two major components: (1) problem sets and in-class discussion, and (2) a final research project. Your grade in the course will be assessed as described in the #Grading and Assessment section of this page.
There will be no exams or quizzes. Unless otherwise noted, all assignments are due at the end of the day (i.e., 11:59pm on the day they are due).
Daily Problem Sets[edit]
In advance of each class, I will post a list of questions and/or tasks related to the course topic for the day. On Understanding days, these focus entirely on the methods texts and papers that we're reading for the day. On Applying days they will include both these kinds of questions and others that require you to write code in Python or R to conduct analyses.
For each Applying day, you will need to push any code you write to the class GitHub repository by 11:59pm on the night before class so that we have time to review your code before class. Applying days are always on Wednesdays, so code is due by 11:59pm on Tuesdays — except in the final week, when the schedule shifts. With the exception of code, we will not be asking you to turn in answers to the problem sets.
Instead, we will coldcall students in the class and ask you to provide and explain your answers to the question. On Applying days, We will pull your code up on the projector and ask you to explain and walk through your code and to explain what your code is doing, potentially line-by-line, and to have you justify and interpret the impact of your technical decisions on the substantive outcomes.
To help with writing your code, you are welcome to use AI tools and coding agents to write most or even all of your code. In fact, I recommend you use these tools. I have started using Claude Code extensively on my own. Given that I expect you to use these tools, I am budgeting much less time for you to spend puzzling through coding challenges than I would have if I taught this class even one or two years ago. More details are available in the #Use of AI Tools section of this syllabus.
Cold Calling[edit]
If you will not be in class or do not want to be called upon that day for any reason, please fill out the discussion opt-out form at least one hour before class begins.
Because I understand that cold calling can be terrifying for some students, I will be circulating a list of questions (labeled "Problem Set" in the syllabus) that describe the kinds of questions I am likely to ask each session along with the weekly announcements (i.e., at least 6 days in advance). Although it is a very good idea to take notes guided by these questions or to write out answers to these questions in advance, we will not be collecting these answers. You are welcome to work with other students to brainstorm possible answers. Although I will also ask questions I do not distribute ahead of time, these questions will give you a good sense of where to focus your reading and note-taking.
I have written a computer program that will generate a random list of students each day, and I will use this list to randomly cold call students in the class. To try to maintain participation balance, the program will try to ensure that everybody is cold called a similar number of times during the quarter. Although there is always some chance that you will be called upon next, you will become less likely to be called upon relative to your classmates each time you are called upon.
For coding challenges, I will randomly select a student's code and display it on the screen, and ask them to walk through and explain their solution.
Assessment[edit]
Assessment will be done by the familiarity and fluency you show in the readings. If it is an Applying day, it will be based on the degree to which you can explain your code and interpret the results of what the code is doing. I have placed detailed information on the case discussion section of my assessment page which describes both the rubric I will use to assess your daily problem sets and how I will compute final grades. The assessment mentions "case discussions." We will not be running formal cases, but we will be assessing and grading your work in the same way.
Final Project[edit]
For the final project, I want you to take what you've learned in the class and apply it to an original research project. As a demonstration of your learning in this course, you will design a plan for a text-as-data research project and will, if possible, also collect and analyze (at least) an initial sample of a dataset that you will use to complete the project.
The genre of the paper you can produce can take one of the several forms including:
- A draft of a manuscript for submission to a conference or journal.
- A proposal for funding (e.g., for submission for the NSF for a graduate student fellowship).
- A draft of the methods chapter of your dissertation.
In any of the three paths, I expect you take this opportunity to produce a document that will further your academic career outside of the class. If none of these approaches work for you, I'm willing to discuss other possible deliverables.
I am open to having folks select a fourth path for their final projects. In any case, I will want a clear set of deliverables articulated in writing as part of the #Final Project Identification assignment.
Final Project Identification[edit]
- Due Date
- April 17
- Maximum paper length
- 800 words (~3 pages)
- Deliverables
- Turn in the appropriate Canvas dropbox
Early on, I want you to identify your final project. Your proposal should be short and can be either paragraphs or bullets. It should include the following things:
- The genre of the project and a short description of how it fits into your career trajectory. This will serve as a description of the deliverables. For ideas about genre, see the description in the #Final Project section above.
- A one-paragraph abstract of the proposed study and research question, theory, community, and/or groups you plan to study.
- A short description of the dataset you plan to collect or use, including its source and approximate size.
Final Project Check-ins[edit]
- Grading
- Complete / Incomplete (5% of final grade)
As part of the final project, you are required to meet with the TA twice during the quarter for an approximately 20-minute check-in on your project progress. These meetings can take place during the TA's regular office hours (Tuesdays, 1:30–3:00pm, SAV 216-D) or at another time arranged directly.
- Check-in 1 must take place before the outline is due on May 15.
- Check-in 2 must take place between May 15 and June 9.
Both check-ins are graded complete/incomplete. You will receive credit for each meeting simply by attending and discussing your project.
Final Paper[edit]
- Paper Due Date
- June 12
- Maximum final paper length
- 8000 words (~27 double-spaced pages)
- All Deliverables
- Turn in in the appropriate Canvas dropbox
Because the emphasis in this class is on methods and because I'm not an expert in each of your areas or fields, I'm happy to assume that your paper, proposal, or thesis has already established the relevance and significance of your study and has a comprehensive literature review, well-grounded conceptual approach, and compelling reason why this research is so important. Instead of providing all of these details, feel free to start with a brief summary of the purpose and importance of this research, and an introduction to your research questions or hypotheses. If you provide more detail, that's fine, but I won't give you detailed feedback on these parts.
Whatever you choose to turn in for your final project should include:
- a statement of the purpose, central focus, relevance and significance of your project;
- a description of the dataset to be analyzed — its source, scope, and any known limitations;
- key research questions or hypotheses;
- operationalization of key concepts;
- a description and rationale of the specific method(s) (if more than one method will be used, explain how the methods will produce complementary findings);
- a description of the step-by-step plan for data collection and text preprocessing;
- description and rationale of the level(s), unit(s), and process of analysis (if more than one kind of data are generated, explain how each kind will be analyzed individually and/or comparatively);
- an explanation of how these analyses will enable you to answer the RQs;
- a description of how you validated your text analysis approach;
- a sample dataset and description of a formative analysis you have completed;
- a description of actual or anticipated results and any potential problems with their interpretation;
- a plan for publishing/disseminating the findings from this research;
- a summary of technical, ethical, human subjects, and legal issues that may be encountered in this research, and how you will address them;
- a schedule (using specific dates) and proposed budget if applicable.
I also expect each student to begin data collection and analysis for your project and describe your progress in this regard in your paper. If collecting data for a proposed project is impractical (e.g., because of IRB applications, funding, etc.), let's talk.
I prefer that you write this paper individually, but I'm open to the idea that you may want to work with others in the class.
Outline / Draft[edit]
- Due Date
- May 15
- All Deliverables
- Turn in in the appropriate Canvas dropbox
I want you all to turn in an outline or draft several weeks before the final project is due. Although the specific format will vary based on the nature of your project and your progress on it, it should demonstrate major progress on your final deliverables for the class and provide an answer—in outline form—to every applicable item on the list in the #Final Paper section above.
If you're looking for an outline format that is useful for writing papers, I typically use what my group calls Matsuzaki outlines (described in detail on our wiki). The Matsuzaki outline is particularly well-suited to quantitative social scientific work.
If you're looking for information on how to organize a quantitative academic paper in the social sciences, check out my page on the structure of a quantitative empirical research paper.
Final Presentation[edit]
- Presentation Dates
- June 1 and June 3
Your projects are at different stages, so there will be a variation in what is presented. That said, I expect nearly everyone will present one of two kinds of presentations:
- An overview and summary of your final project in its current state so that your classmates and I can give you helpful feedback for your final written project due a week later. Present your research questions and context, and walk us through the key deliverables and your current progress. Emphasize your text analysis methods since this is what we will be best positioned to provide you feedback on. If you have specific things you want feedback on, please communicate this during your talk and/or on our group chat.
- If your project is a complete paper, you might want to do a full research presentation instead of what you would give at a conference. This would be fine as well.
The class will be run as panels. We've sorted people into four groups over two days and will meet in two different rooms. Everybody will have 20 minutes for both their presentation and the Q&A. The target length for your presentation is 12 minutes and absolutely not longer than 15 minutes. I expect most people will use slides. Treat this as you would a presentation at your discipline's annual meeting.
Each panel will have a chair (one of the instructors) who will keep time, take notes, and give detailed written feedback on your presentation. Because feedback from the chair is given in writing, the Q&A time is for questions and feedback from the other student attendees.
If you are not presenting, you are welcome to attend either panel in either room. We expect you to listen, engage, and give feedback.
Specific room assignments, panel groupings, and presentation order are posted on a Canvas page.
Grading and Assessment[edit]
The writing rubric section of the detailed page on assessment gives the rubric I will use to evaluate both your #Daily Problem Sets and your #Final Project.
Your participation in the course will be assessed using my detailed participation rubric. Please also pay close attention to the section on maintaining participation balance.
I have put together a very detailed page that describes the way I approach assessment and grading — both in general and in this course. Please read it carefully. I will assign grades for each of the following items on the UW 4.0 grade scale according to the weights below:
- Daily Problem Sets / Discussion: 45%
- Final project identification: 5%
- Final project check-ins: 5%
- Final project outline: 10%
- Final project presentation: 10%
- Final project paper: 25%
Schedule[edit]
Monday March 30: Course Overview[edit]
There are no readings or assignment for the first day. If you have time, please read the syllabus and come to class with any questions you have about the course or expectations.
Tuesday March 31: Student Information Sheet Due[edit]
Please fill out the student information sheet by 11:59pm. Completing this form is required. If you do not submit this form, we'll assume you are not interested in taking the class and will drop you.
Wednesday April 1: Introducing Text as Data[edit]
Resources
Required Readings:
- Grimmer, Justin, and Brandon M. Stewart. 2013. "Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts." Political Analysis, January 22, mps028. https://doi.org/10.1093/pan/mps028. [Available from UW libraries]
- Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, et al. 2011. "Quantitative Analysis of Culture Using Millions of Digitized Books." Science 331 (6014): 176–82. https://doi.org/10.1126/science.1199644. [Available from UW libraries]
- Bollen, Johan, Huina Mao, and Xiaojun Zeng. 2011. "Twitter Mood Predicts the Stock Market." Journal of Computational Science 2 (1): 1–8. https://doi.org/10.1016/j.jocs.2010.12.007. [Available free online]
Optional Readings:
- Grimmer, Stewart, and Roberts: Ch. 1–2 (pp. 3–32)
- Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. 2019. "Text as Data." Journal of Economic Literature 57 (3): 535–74. https://doi.org/10.1257/jel.20181020. [Available from UW libraries]
Monday April 6: Representing Text — Understanding[edit]
Resources
Required Readings:
- Grimmer, Stewart, and Roberts: Ch. 5–7, 9 (pp. 48–77, 90–98)
- Sparck Jones, Karen. 1972. "A Statistical Interpretation of Term Specificity and Its Application in Retrieval." Journal of Documentation 28 (1): 11–21. https://doi.org/10.1108/eb026526. [Available free online]
- Argamon, Shlomo, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni. 2003. "Gender, Genre, and Writing Style in Formal Written Texts." Text & Talk 23 (3): 321–346. https://doi.org/10.1515/text.2003.014. [Available from UW libraries]
- Jaros, Kyle, and Jennifer Pan. 2018. "China's Newsmakers: Official Media Coverage and Political Shifts in the Xi Jinping Era." The China Quarterly 233: 111–136. https://doi.org/10.1017/S0305741017001679. [Available from UW libraries]
Optional Readings:
- Voigt, Rob, Nicholas P. Camp, Vinodkumar Prabhakaran, William L. Hamilton, Rebecca C. Hetey, Camilla M. Griffiths, David Jurgens, Dan Jurafsky, and Jennifer L. Eberhardt. 2017. "Language from Police Body Camera Footage Shows Racial Disparities in Officer Respect." Proceedings of the National Academy of Sciences 114 (25): 6521–6526. https://doi.org/10.1073/pnas.1702413114. [Available from UW libraries]
- Hughes, James M., Nicholas J. Foti, David C. Krakauer, and Daniel N. Rockmore. 2012. "Quantitative Patterns of Stylistic Influence in the Evolution of Literature." Proceedings of the National Academy of Sciences 109 (20): 7682–7686. https://doi.org/10.1073/pnas.1115407109. [Available from UW libraries]
Wednesday April 8: Representing Text — Applying[edit]
Resources
Monday April 13: Dictionary Methods & Sentiment Analysis — Understanding[edit]
Resources
Required Readings:
- Grimmer, Stewart, and Roberts: Ch. 15–16 (pp. 173–183)
- Tausczik, Yla R., and James W. Pennebaker. 2010. "The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods." Journal of Language and Social Psychology 29 (1): 24–54. https://doi.org/10.1177/0261927X09351676. [Available from UW libraries]
- Hancock, Jeffrey T., Lauren E. Curry, Saurabh Goorha, and Michael Woodworth. 2007. "On Lying and Being Lied To: A Linguistic Analysis of Deception in Computer-Mediated Communication." Discourse Processes 45 (1): 1–23. https://doi.org/10.1080/01638530701739181. [Available from UW libraries]
- Dodds, Peter Sheridan, Kameron Decker Harris, Isabel M. Kloumann, Catherine A. Bliss, and Christopher M. Danforth. 2011. "Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter." PLOS ONE 6 (12): e26752. https://doi.org/10.1371/journal.pone.0026752. [Available free online]
Optional Readings:
- Loughran, Tim, and Bill McDonald. 2011. "When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks." The Journal of Finance 66 (1): 35–65. https://doi.org/10.1111/j.1540-6261.2010.01625.x. [Available from UW libraries]
- Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. "Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks." Proceedings of the National Academy of Sciences 111 (24): 8788–90. [Available from UW libraries]
Wednesday April 15: Dictionary Methods & Sentiment Analysis — Applying[edit]
Resources
Friday April 17: #Final Project Identification Due[edit]
See #Final Project Identification for details and the Canvas dropbox link.
Monday April 20: Word Embeddings — Understanding[edit]
Resources
Required Readings:
- Grimmer, Stewart, and Roberts: Ch. 8 (pp. 78–89)
- Rodriguez, Pedro L., and Arthur Spirling. 2022. "Word Embeddings: What Works, What Doesn't, and How to Tell the Difference for Applied Research." The Journal of Politics 84 (1): 101–15. https://doi.org/10.1086/715162. [Available from UW libraries]
- Kozlowski, Austin C., Matt Taddy, and James A. Evans. 2019. "The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings." American Sociological Review 84 (5): 905–49. https://doi.org/10.1177/0003122419877135. [Available from UW libraries]
- Nelson, Laura K. 2021. "Leveraging the Alignment between Machine Learning and Intersectionality: Using Word Embeddings to Measure Intersectional Experiences of the Nineteenth Century U.S. South." Poetics 88 (October): 101539. https://doi.org/10.1016/j.poetic.2021.101539. [Available free online]
- Caliskan, Aylin, Joanna J. Bryson, and Arvind Narayanan. 2017. "Semantics Derived Automatically from Language Corpora Contain Human-like Biases." Science 356 (6334): 183–86. https://doi.org/10.1126/science.aal4230. [Available from UW libraries]
Optional Readings:
- Stoltz, Dustin S., and Marshall A. Taylor. 2021. "Cultural Cartography with Word Embeddings." Poetics 88 (October): 101567. https://doi.org/10.1016/j.poetic.2021.101567. [Available from UW libraries]
- Hamilton, William L., Jure Leskovec, and Dan Jurafsky. 2016. "Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1489–1501. https://aclanthology.org/P16-1141.pdf. [Available free online]
- Best, Rachel Kahn, and Alina Arseniev-Koehler. 2023. "The Stigma of Diseases: Unequal Burden, Uneven Decline." American Sociological Review 88 (5): 938–69. https://doi.org/10.1177/00031224231197436. [Available from UW libraries]
- Waller, Isaac, and Ashton Anderson. 2021. "Quantifying Social Organization and Political Polarization in Online Platforms." Nature 600 (7888): 264–68. https://doi.org/10.1038/s41586-021-04167-x. [Available from UW libraries]
Wednesday April 22: Word Embeddings — Applying[edit]
Resources
Monday April 27: Unsupervised and Inductive Methods — Understanding[edit]
Resources
Required Readings:
- Grimmer, Stewart, and Roberts: Ch. 10, 12–13 (pp. 103–109, 123–160)
- DiMaggio, Paul, Manish Nag, and David Blei. 2013. "Exploiting Affinities between Topic Modeling and the Sociological Perspective on Culture: Application to Newspaper Coverage of U.S. Government Arts Funding." Poetics 41 (6): 570–606. https://doi.org/10.1016/j.poetic.2013.08.004. [Available free online]
- Hansen, Stephen, Michael McMahon, and Andrea Prat. 2018. "Transparency and Deliberation Within the FOMC: A Computational Linguistics Approach." The Quarterly Journal of Economics 133 (2): 801–870. https://doi.org/10.1093/qje/qjx045. Required: §I and §IV (pp. 801–807, 815–831); reading the full paper is encouraged. [Available from UW libraries]
- Tvinnereim, Endre, and Kjersti Fløttum. 2015. "Explaining Topic Prevalence in Answers to Open-Ended Survey Questions about Climate Change." Nature Climate Change 5 (8): 744–747. https://doi.org/10.1038/nclimate2663. [Available from UW libraries]
Optional Readings:
- Grimmer, Stewart, and Roberts: Ch. 11, 14 (pp. 111–121, 162–169)
- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation." Journal of Machine Learning Research 3: 993–1022. https://jmlr.csail.mit.edu/papers/v3/blei03a.html. [Available free online]
- Baumer, Eric P. S., David Mimno, Shion Guha, Emily Quan, and Geri K. Gay. 2017. "Comparing Grounded Theory and Topic Modeling: Extreme Divergence or Unlikely Convergence?" Journal of the Association for Information Science and Technology 68 (6): 1397–1410. https://doi.org/10.1002/asi.23786. [Available from UW libraries]
- Jockers, Matthew L., and David Mimno. 2013. "Significant Themes in 19th-Century Literature." Poetics 41 (6): 750–769. https://doi.org/10.1016/j.poetic.2013.08.005. [Available free online]
Wednesday April 29: Unsupervised and Inductive Methods — Applying[edit]
Resources
Monday May 4: Supervised Classification — Understanding[edit]
Resources
Required Readings:
- Grimmer, Stewart, and Roberts: Ch. 17–20 (pp. 189–218)
- Voigt, Rob, Nicholas P. Camp, Vinodkumar Prabhakaran, William L. Hamilton, Rebecca C. Hetey, Camilla M. Griffiths, David Jurgens, Dan Jurafsky, and Jennifer L. Eberhardt. 2017. "Language from Police Body Camera Footage Shows Racial Disparities in Officer Respect." Proceedings of the National Academy of Sciences 114 (25): 6521–26. https://doi.org/10.1073/pnas.1702413114. [Available free online]
- Vosoughi, Soroush, Deb Roy, and Sinan Aral. 2018. "The Spread of True and False News Online." Science 359 (6380): 1146–51. https://doi.org/10.1126/science.aap9559. [Available from UW libraries]
- TeBlunthuis, Nathan, Valerie Hase, and Chung-Hong Chan. 2024. "Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can!" Communication Methods and Measures 18 (3): 278–99. https://doi.org/10.1080/19312458.2023.2293713. [Available from UW libraries]
Optional Readings:
- Hopkins, Daniel J., and Gary King. 2010. "A Method of Automated Nonparametric Content Analysis for Social Science." American Journal of Political Science 54 (1): 229–47. https://doi.org/10.1111/j.1540-5907.2009.00428.x. [Available from UW libraries]
- Wulczyn, Ellery, Nithum Thain, and Lucas Dixon. 2017. "Ex Machina: Personal Attacks Seen at Scale." In Proceedings of the 26th International Conference on World Wide Web, 1391–99. https://doi.org/10.1145/3038912.3052591. [Available free online]
- Davidson, Thomas, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. "Automated Hate Speech Detection and the Problem of Offensive Language." Proceedings of the International AAAI Conference on Web and Social Media 11 (1): 512–15. https://doi.org/10.1609/icwsm.v11i1.14955. [Available free online]
Wednesday May 6: Supervised Classification — Applying[edit]
Resources
Monday May 11: Large Language Models & Transfer Learning — Understanding[edit]
Resources
Required Readings:
- Wankmüller, Sandra. 2024. "Introduction to Neural Transfer Learning With Transformers for Social Science Text Analysis." Sociological Methods & Research 53 (4): 1676–1752. https://doi.org/10.1177/00491241221134527. [Available free online]
- Ziems, Caleb, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, and Diyi Yang. 2024. "Can Large Language Models Transform Computational Social Science?" Computational Linguistics 50 (1): 237–91. https://doi.org/10.1162/coli_a_00502. [Available free online]
- Hanley, Hans W. A., Yingdan Lu, and Jennifer Pan. 2025. "Across the Firewall: Foreign Media's Role in Shaping Chinese Social Media Narratives on the Russo-Ukrainian War." Proceedings of the National Academy of Sciences 122 (1): e2420607122. https://doi.org/10.1073/pnas.2420607122. [Available free online] [Note: Please read both the paper and the full supplementary materials.]
Optional Readings:
- Bail, Christopher A. 2024. "Can Generative AI Improve Social Science?" Proceedings of the National Academy of Sciences 121 (21): e2314021121. https://doi.org/10.1073/pnas.2314021121. [Available free online]
- Spirling, Arthur. 2023. "Why Open-Source Generative AI Models Are an Ethical Way Forward for Science." Nature 616 (7957): 413. https://doi.org/10.1038/d41586-023-01295-4. [Available from UW libraries]
- Abdurahman, Suhaib, Alireza Salkhordeh Ziabari, Alexander K. Moore, Daniel M. Bartels, and Morteza Dehghani. 2025. "A Primer for Evaluating Large Language Models in Social-Science Research." Advances in Methods and Practices in Psychological Science 8 (2). https://doi.org/10.1177/25152459251325174. [Available from UW libraries]
- Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding." In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423. [Available free online]
- Grootendorst, Maarten. 2022. "BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure." arXiv:2203.05794. [Available free online]
Wednesday May 13: Large Language Models & Transfer Learning — Applying[edit]
Resources
Friday May 15: #Outline / Draft Due[edit]
See #Outline / Draft for details and the Canvas dropbox link.
Monday May 18: No Class — Instructor Traveling[edit]
Instructor traveling. No class.
Wednesday May 20: Causal Inference — Understanding[edit]
Resources
Required Readings:
- Kleinberg, Jon, Jens Ludwig, Sendhil Mullainathan, and Ziad Obermeyer. 2015. "Prediction Policy Problems." The American Economic Review 105 (5): 491–495. [Available free online]
- Egami, Naoki, Christian J. Fong, Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart. 2022. "How to Make Causal Inferences Using Texts." Science Advances 8 (42): eabg2652. https://doi.org/10.1126/sciadv.abg2652. [Available free online]
- Keith, Katherine, David Jensen, and Brendan O'Connor. 2020. "Text and Causal Inference: A Review of Using Text to Remove Confounding from Causal Estimates." In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.474. [Available free online]
- Pryzant, Reid, Dallas Card, Dan Jurafsky, Victor Veitch, and Dhanya Sridhar. 2021. "Causal Effects of Linguistic Properties." In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.323. [Available free online]
- Betti, Lorenzo, Paolo Bajardi, and Gianmarco De Francisci Morales. 2025. "Moral Judgments in Online Discourse Are Not Biased by Gender." Scientific Reports 15 (1): 21555. https://doi.org/10.1038/s41598-025-08749-x. [Available free online]
Optional Readings:
- Veitch, Victor, Dhanya Sridhar, and David Blei. 2020. "Adapting Text Embeddings for Causal Inference." In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), 919–928. [Available free online]
- Roberts, Margaret E., Brandon M. Stewart, and Richard A. Nielsen. 2020. "Adjusting for Confounding with Text Matching." American Journal of Political Science 64 (4): 887–903. https://doi.org/10.1111/ajps.12526. [Available from UW libraries]
- Modarressi, Iman, Jann Spiess, and Amar Venugopal. 2025. "Causal Inference on Outcomes Learned from Text." arXiv:2503.00725. [Available free online]
- Wood-Doughty, Zach, Ilya Shpitser, and Mark Dredze. 2018. "Challenges of Using Text Classifiers for Causal Inference." In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1488. [Available free online]
- Mullainathan, Sendhil, and Jann Spiess. 2017. "Machine Learning: An Applied Econometric Approach." Journal of Economic Perspectives 31 (2): 87–106. https://doi.org/10.1257/jep.31.2.87. [Available free online]
Monday May 25: No Class — Memorial Day[edit]
University holiday. No class.
Wednesday May 27: Causal Inference — Applying[edit]
Resources
Monday June 1: Final Presentations[edit]
The entire class will be devoted to final presentations.
See #Final Presentation for expectations.
Wednesday June 3: Final Presentations[edit]
The entire class will be devoted to final presentations.
See #Final Presentation for expectations.
Friday June 12: #Final Paper Due[edit]
See #Final Paper for details and the Canvas dropbox link.
Administrative Notes[edit]
Office Hours[edit]
- Eddie Hock (TA)
- Regular office hours are Tuesdays, 1:30–3:00pm, Savery Hall (SAV) 216-D. Eddie is particularly focused on R for coding questions, but can field general questions. Eddie is also available for meetings outside of regular office hours — reach out to him by email to arrange a time.
- Benjamin Mako Hill (Instructor)
- Regular office hours are Thursdays, 4:30–6:00pm, Communications Building (CMU) 333. Mako is particularly focused on Python for coding questions, but can help with R as well. Mako is also available for 30-minute meetings outside of regular office hours. You can view his calendar and/or schedule a meeting directly. If you schedule a meeting, we'll meet in the Jitsi room (
makooffice). You will get a link to the room via the scheduling system, but you can also find it at https://meet.jit.si.
Both Eddie and Mako will also idle in the "Office Hours" voice channel on Discord (which also supports video and screen sharing). You're welcome to join that way remotely if it's more convenient.
- NOTE
- For Memorial Day week, Mako will instead be holding office hours during Eddie's normal office hours time (1:30-3:00 PM on Tuesday). Eddie will instead be holding office hours from 1:30-3:00 PM on that Friday.
Religious Accommodations[edit]
Washington state law requires that UW develop a policy for accommodation of student absences or significant hardship due to reasons of faith or conscience, or for organized religious activities. The UW's policy, including more information about how to request an accommodation, is available at Religious Accommodations Policy. Accommodations must be requested within the first two weeks of this course using the Religious Accommodations Request form.
Student Conduct[edit]
The University of Washington Student Conduct Code (WAC 478-121) defines prohibited academic and behavioral conduct and describes how the University holds students accountable as they pursue their academic goals. Allegations of misconduct by students may be referred to the appropriate campus office for investigation and resolution. More information can be found online at https://www.washington.edu/studentconduct/
Call SafeCampus at 206-685-7233 anytime–no matter where you work or study–to anonymously discuss safety and well-being concerns for yourself or others. SafeCampus's team of caring professionals will provide individualized support, while discussing short- and long-term solutions and connecting you with additional resources when requested.
Use of AI Tools[edit]
For the Applying days on the syllabus, you will be asked to write code in Python or R. As I said in the section on #Daily Problem Sets, you are welcome—encouraged in fact—to use AI coding tools for these assignments. There are a range of these out there including Claude Code, Microsoft Copilot, and so on. You can also simply paste things into chatbots like Google Gemini or OpenAI's ChatGPT. There is no requirement to use these tools, but I strongly suspect these tools are the future of coding and I think it's probably a good idea to start building familiarity with them.
That said, your coding agents are making fundamental decisions that will not only affect your analysis—they will be your analysis. As the person who will publish work done by coding agents in the future, you will be responsible for every single line that goes into your analysis. The same will be true for your class projects.
As part of ensuring that you understand your code, I will ask you to explain it in front of the class. That means you need to be able to read and understand every single line of code produced. This fluency comes from experience reading code. Luckily, coding agents and AI chatbots are pretty good at explaining code. If an AI tool gives you code you do not understand, you can prompt it to rewrite the code in a different or simpler way or you can ask it what the code does and why it works the way it does. Remember, I'm going to ask you to explain your code in class as part of the #Daily Problem Sets. If you have any doubts about your ability to do this, you need to spend more time studying "your" code to build your confidence.
Unless otherwise noted, everything other than code that you produce for this course must be your own work. Using generative AI outside of coding tasks for anything turned in to the course will be considered academic misconduct and subject to investigation.
If you have any questions about what constitutes academic integrity in this course or at the University of Washington, please contact me to discuss your concerns.
Please note that I do not consider grammar/spellchecking a prohibited use of AI.
- Text adapted from: UW sample syllabus statements.
Academic Dishonesty[edit]
This includes cheating on assignments, plagiarizing (misrepresenting work by another author as your own, paraphrasing or quoting sources without acknowledging the original author or using information from the internet without proper citation), and submitting the same or similar paper to meet the requirements of more than one course without instructor approval. Academic dishonesty in any part of this course is grounds for failure and further disciplinary action. The first incident of plagiarism will result in the student's receiving a zero on the plagiarized assignment. The second incident of plagiarism will result in the student's receiving a zero in the class.
Disability Resources[edit]
If you have already established accommodations with Disability Resources for Students (DRS), please communicate your approved accommodations through their processes at your earliest convenience so we can discuss your needs in this course.
If you have not yet established services through DRS, but have a temporary health condition or permanent disability that requires accommodations (conditions include but not limited to; mental health, attention-related, learning, vision, hearing, physical or health impacts), you are welcome to contact DRS at 206-543-8924 or uwdrs@uw.edu or disability.uw.edu. DRS offers resources and coordinates reasonable accommodations for students with disabilities and/or temporary health conditions. Reasonable accommodations are established through an interactive process between you, your instructor(s) and DRS. It is the policy and practice of the University of Washington to create inclusive and accessible learning environments consistent with federal and state law.
Mental Health[edit]
Your mental health is important. If you are feeling distressed, anxious, depressed, or in any way struggling with your emotional and psychological wellness, please know that you are not alone. Graduate school can be a profoundly difficult time for many of us.
Resources are available for you:
- UW 24/7 Help Line 1.866.775.0608
- https://wellbeing.uw.edu/topic/mental-health/
- https://www.crisistextline.org/
Other Student Support[edit]
Any student who has difficulty affording groceries or accessing sufficient food to eat every day or who lacks a safe and stable place to live and believes this may affect their performance in the course is urged to contact the graduate program advisor for support. Furthermore, please notify me if you are comfortable doing so. This will enable me to provide any resources that I may possess (adapted from Sara Goldrick-Rab). Please also note the student food pantry, Any Hungry Husky at the ECC.
Credit and Notes[edit]
This is the first time I have taught this course at UW. The structure and some content of this syllabus is adapted from a previous version of this course taught by John D. Wilkerson at the University of Washington. The structure, assignment design, administrative boilerplate, and other elements draw from several of my previous courses:
- Designing Internet Research (Spring 2025)
- Building Successful Online Communities (Fall 2025)
- Statistics and Statistical Programming (Winter 2021)
