Statistics and Statistical Programming (Winter 2017): Difference between revisions

From CommunityData
(copy material over from COM528)
 
 
(239 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<div style="float:right;" class="toclimit-2">__TOC__</div>
<div style="float:right;" class="toclimit-2">__TOC__</div>


:'''Designing Internet Research'''
:'''Advanced Stastical Methods in Communication: Statistics and Statistical Programming'''
:'''COM521''' - Department of Communication, University of Washington
:'''COM521''' - Department of Communication, University of Washington
:'''Instructor:''' [http://mako.cc/academic/ Benjamin Mako Hill] ([http://www.com.washington.edu/hill/ University of Washington])
:'''Instructor:''' [http://mako.cc/academic/ Benjamin Mako Hill] ([http://www.com.washington.edu/hill/ University of Washington])
:'''Course Websites''':
:'''Course Websites''':
:* We will use Canvas for [https://canvas.uw.edu/courses/1124086/announcements announcements], [https://canvas.uw.edu/courses/1124086/assignments turning in assignments], and [https://canvas.uw.edu/courses/1124086/discussion_topics discussion] (if you choose to use them)
:* We will use Canvas for [https://canvas.uw.edu/courses/1098035/announcements announcements], [https://canvas.uw.edu/courses/1098035/assignments turning in assignments], and [https://canvas.uw.edu/courses/1098035/discussion_topics discussion] (if you choose to use them)
:* Everything else will be linked on this page.
:* Everything else will be linked on this page.
:'''Course Catalog Description:'''
:* [[Statistics and Statistical Programming (Winter 2017)/List of student git repositories]]
:'''Course Catalog Description:[https://www.washington.edu/students/crscat/com.html#com521]'''


::Discusses complexities in quantitative research on communication. Focus on multivariate data design and analysis, including multiple and logistic regression, ANOVA and MANOVA, and factor analysis.
::Discusses complexities in quantitative research on communication. Focus on multivariate data design and analysis, including multiple and logistic regression, ANOVA and MANOVA, and factor analysis.
Line 13: Line 14:
== Overview and Learning Objectives ==
== Overview and Learning Objectives ==


What new lines of inquiry and approaches to social research are made possible and necessary by the Internet? In what ways have established research methods been affected by the Internet? How does the Internet challenge established methods of social research? How are researchers responding to these challenges?
This course is the second course in a two-quarter quantitative methods sequence in the University of Washington's Department of Communication MA/PhD program. The first course (COM 520) is an introduction to quantitative social science in communication and focuses primarily on what you might think of the "soft skills" associated with doing social science: the conceptualization, operationalization of quantifiable variables, and the design of quantitative analyses. That course introduces some univariate and bivariate statistics at the end and briefly touches on linear regression. That said, all of the statistical work in that course this is done using the tools that students already know (e.g. with spreadsheet software like LibreOffice, Google Sheets or Microsoft Excel). This class assumes that students have taken COM 520 and that they understand what is involved in describing and testing social scientific theories with data and that basic terminology of quantitative social science is going to be familiar.


These are some of the key questions we will explore in this course. The course will focus on assessing the incorporation of Internet tools in established and emergent methods of social research, the adaptation of social research methods to study online phenomena, and the development of new methods and tools that correspond with the particular capacities and characteristics of the Internet. The readings will include both descriptions of Internet-related research methods with an eye to introducing skills and examples of studies that use them. The legal and ethical aspects of Internet research will receive ongoing consideration throughout the course. The purpose of this course is to help prepare students to design high quality research projects that use the Internet to study online communicative, social, cultural, and political phenomena.
This course (COM 521) is focused on technical skill-building and aims to be a get-your-hands-dirty introduction to statistics and statistical programming. The point of the course is to give you the mathematical and technical tools to carry out your own statistical analyses. Through the process, we're going to try to help you become more sophisticated consumers of quantitative research.
 
Although we'll be doing some math in the course, this is not a math class. I am going to assume you're familiar with basic algebra and arithmetic. This course will not require knowledge of calculus. In general we're not going to cover the math behind the techniques we'll be covering. Unlike many statistics classes, I'm definitely not going to be doing proofs on the board.  Instead, the class is unapologetically focused on ''the application of statistic methodology''. In that sense, the goal of the is course is to create ''informed consumers'' of quantitative methodology, not producers of new types of methods. My goal is to train producers of social scientific research that use statistics as a means toward an end.
 
This course does not seek to be the last stats class you take. I started grad school having not taken a math class since high school (basically) and took 12 different statistics and math courses over the course of my time in graduate school. Honestly, I wish I had done more. What this class seeks to do is give you a solid basis on which to build statistical knowledge. Anyone who finishes this class should feel comfortable moving on to take advance classes in CSSS (classes above 510 on [https://www.csss.washington.edu/academics/courses this list]) and to start building toward a [https://www.csss.washington.edu/academics/phd-tracks/communication Statistics Concentration in the Department of Communication MA/PhD Program] or a [https://www.csss.washington.edu/academics/phd-tracks similar CSSS certificate/track] in another department.
 
We'll cover theses basic statistical techniques: t-tests; chi-squared tests; ANOVA, MANOVA, and related methods; linear regression; and end with logistic regression.


I will consider the course a complete success if every student is able to do all of these things at the end of the quarter:
I will consider the course a complete success if every student is able to do all of these things at the end of the quarter:


* Discuss and compare distinct types of Internet research including: web archiving; textual analysis; ethnography; interviews; network analyses of social and hyperlink networks; analysis of digital trace data, traditional, natural, and field experiments; design research; interviewing; survey research; design research; and narrative and visual analyses.
* Carry out a complete analysis of a quantitative research project, start to finish.
* Describe particular challenges and threats to research validity associated with each method.
* Read, modify, and create short programs in the GNU R statistical programming language.
* For at least one method, be able to provide a detailed description of a research project and feel comfortable embarking on a formative study using this methodology.
* Feel comfortable reading papers that use basic statistical techniques.
* Given a manuscript (e.g., in the context of a request for peer review), be able to evaluate a Internet-based study in terms of its use its methodological choices.
* Feel comfortable and prepared enrolling in future statistics courses in CSSS.
* Use a modern programming language (e.g., Python) to collect a dataset from a web API like the APIs from Twitter and Wikipedia.


== Note About This Syllabus ==
== Why Statistical Programming? ==


You should expect this syllabus to be a dynamic document and you will notice that there are a few places marked "To Be Determined." Although the core expectations for this class are fixed, the details of readings and assignments will shift. As a result, there are three important things to keep in mind:
This class will focus much more on statistical programming in R than most similar classes. Most similar classes in communication will focus on using an easier to use statistical package like SPSS.


# Although details on this syllabus will change, I will not change readings or assignments less than one week before they are due. If I don't fill in a "To Be Determined" one week before it's due, it is dropped. If you plan to read more than one week ahead, contact me first.
We're focusing on programming instead of a package like SPSS for several reasons:
# Closely monitor your email or [https://canvas.uw.edu/courses/1124086/announcements the announcements section on the course website on Canvas]. When I make changes, these changes will be recorded in [http://wiki.communitydata.cc/index.php?title=Internet_Research_Methods_%28Spring_2016%29&action=history the history of this page] so that you can track what has changed and I will summarize these changes in an announcement on Canvas that will be emailed to everybody in the class.
# I will ask the class for voluntary anonymous feedback frequently — especially toward the beginning of the quarter. Please let me know what is working and what can be improved. In the past, I have made many adjustments based on this feedback.


== Books ==
* Student who understands a programming language won't be limited to the "canned" functions in the off-the-shelf packages.
* Pedagogically, programming supports students in building a deeper understanding of the mathematics and assumptions behind the canned functions by both allowing them to read the code "behind" the canned functions and by allowing the students to implement the functions themselves in assignments.
* Analyses composed of code instead of clicks supports reproducible analyses that can document every step of the process of an analysis including during data cleaning and conversion where errors are common and very difficult to detect.
* Because programming is a skill that is in demand in our department and discipline more generally and that I strongly believe is generally useful.


This class has no textbook and I am not requiring you to buy any books for this class. That said, several required readings and many suggested readings, will come from several excellent books which you might want consider adding to your library:
Of course, there are other programming languages well suited to statistics including Stata and Python.  Ultimately, I'm teaching R because a few of us that seemed mostly to teach in this sequence going forward future got together and the decision was that R made the most sense and because there was consensus among the faculty in the department who were likely to teach statistics classes in the future that this made the most sense.


These books include:
Our reasoning was that:


== Assignments ==
* R is freely available and open source
* R is becoming the most widely used package in statistical fields and is (by our estimate) used by most academics in my cohort or later in statistics, political science, and economics already.
* R is the system (along with Stata) that will be in other CSSS advanced stats classes we hope students will continue to take after COM521.
* R is better general purpose programming language than software like Stata which means that R programming skills will let students solve non-stastical problems like collecting data from the web and will make it easier to learn other programming languages.


The assignments in this class are designed to give you an opportunity to try your hand at using the conceptual material taught in the class. There will be no exams or quizzes. Unless otherwise noted, all assignments are due at the end of the day (i.e., 11:59pm on the day they are due).
For students with a strong psychometric focus or whose research will be limited to linear and logistic regression or ANOVA on small pre-collected datasets and similar, SPSS will likely be fine. R has a higher barrier to entry than SPSS but it's ceiling is ''much'' higher.


=== Research Project ===
== Note About This Syllabus ==


As a demonstration of your learning in this course, you will design a plan for an internet research project and will, if possible, also collect (at least) an initial sample of a dataset that you will use to complete the project.
You should expect this syllabus to be a dynamic document and you will notice that there are a few places marked "To Be Determined." Although the core expectations for this class are fixed, the details of readings and assignments will shift. As a result, there are three important things to keep in mind:


The genre of the paper you can produce can one of the following three things:
# Although details on this syllabus will change, I will not change readings or assignments less than one week before they are due. If I don't fill in a "To Be Determined" one week before it's due, it is dropped. If you plan to read more than one week ahead, contact me first.
# Closely monitor your email or [https://canvas.uw.edu/courses/1098035/announcements the announcements section on the course website on Canvas]. When I make changes, these changes will be recorded in [http://wiki.communitydata.cc/index.php?title=Statistics_and_Statistical_Programming_(Winter_2017)&action=history the history of this page] so that you can track what has changed and I will summarize these changes in an announcement on Canvas that will be emailed to everybody in the class.
# I will ask the class for voluntary anonymous feedback frequently — especially toward the beginning of the quarter. Please let me know what is working and what can be improved. In the past, I have made many adjustments based on this feedback.


# A draft of a manuscript for submission to a conference or journal.
== Books and Resources ==
# A proposal for funding (e.g., for submission for the NSF for a graduate student fellowship).
# A draft of the methods chapter of your dissertation.


In any the three paths, I expect you take this opportunity to produce a document that will further your to academic career outside of the class.
Although I've never taught with a textbook in a proper sense, statistics is very well covered terrain and, as a result, there is an enormous amount of excellent curricular material out there I think we would be wise to build from. As a result, this class is going to use two textbooks:


==== Project Identification ====
* Diez, David M., Christopher D. Barr, and Mine Çetinkaya-Rundel. 2015. ''OpenIntro Statistics''. 3rd edition. OpenIntro, Inc. ([https://www.openintro.org/download.php?file=os3&referrer=/stat/textbook.php PDF]; [https://www.openintro.org/download.php?file=os3_tablet&referrer=/stat/textbook.php Table-friendly PDF]; [https://www.openintro.org/stat/textbook.php Other])
* Verzani, John. 2014. ''Using R for Introductory Statistics, Second Edition''. 2 edition. Boca Raton: Chapman and Hall/CRC. ([https://en.wikipedia.org/wiki/Special:BookSources/978-1-4665-9073-1 Various Sources]; [https://www.amazon.com/Using-Introductory-Statistics-Second-Chapman/dp/1466590734/ref=mt_hardcover?_encoding=UTF8&me= Amazon])


;Due Date: April 10
Diez, Barr, and Çetinkaya-Rundel's is a free, and freely-licensed, online statistics textbook. Over the last seven years, the book has also developed a large online community of students and teachers who have shared other resources. The book, lectures notes, and more are all freely licensed which has allowed the text to be adapted in a series of different fields. The book is excellent and it has been adopted extraordinarily widely. You can buy versions from Amazon in either [https://www.openintro.org/redirect.php?go=amazon_os3_hardcover&referrer=/stat/textbook.php full color hardcover] ($19.99) or in [https://www.openintro.org/redirect.php?go=createspace_os3&referrer=/stat/textbook.php black and white paperback] ($7.60). I haven't purchased a paper copy so I can't speak to the quality of either.
;Maximum paper length: 500 words (~1-2 page)
;Deliverables: Turn in in Canvas


Early on, I want you to identify your final project. Your proposal should be short and can be either paragraphs or bullets. It should include the following things:
Verzani's book is an introduction to the R programming language. It's designed to be used as a companion to a basic introductory statistics textbook (like OpenIntro). It's a poor stand-alone text but it will provide good resources for the material we're covering in the course and it should act as a good reference going forward. The book is available online for about $50.


* The genre of the project and a short description of how it fits into your career trajectory.
Although it's not required for the course, I want to point you to these two books. When I was learning R, these both were very useful references:
* A one paragraph abstract of the proposed study and research question, theory, community, and/or groups you plan to study.
* A short description of the type of data you plan to collect as part of your final project.


==== Final Project ====
* Teetor, Paul. 2011. ''R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics''. 1 edition. Sebastopol, CA: O’Reilly Media. ([http://proquest.safaribooksonline.com/9780596809287 Safari Proquest/UW Libraries]; [https://en.wikipedia.org/wiki/Special:BookSources/978-0-596-80915-7 Various Sources]; [https://www.amazon.com/Cookbook-Analysis-Statistics-Graphics-Cookbooks/dp/0596809158/ref=sr_1_1?ie=UTF8&qid=1482802812&sr=8-1&keywords=r+cookbook Amazon])
* Wickham, Hadley. 2010. ''ggplot2: Elegant Graphics for Data Analysis''. 1st ed. 2009. Corr. 3rd printing 2010 edition. New York: Springer. ([https://link.springer.com/book/10.1007%2F978-3-319-24277-4 Springer/UW Libraries]; [https://en.wikipedia.org/wiki/Special:BookSources/978-0-596-80915-7 Various Sources])


;Outline Due Date: May 8
There are also two non-textbook resources I wanted to point you to that are invaluable:
;Maximum outline length: 2 pages
;Paper Due Date: June 12
;Maximum outline length: 6000 words (~20 pages)
;Presentation Date: June 2
;All Deliverables: Turn in in Canvas


Because the emphasis in this class is on methods and because I'm not an expert in each of your areas or fields, I'm happy to assume that your paper, proposal, or thesis chapter has already established the relevance and significance of your study and has a comprehensive literature review, well-grounded conceptual approach, and compelling reason why this research is so important. Instead of providing all of this details, instead feel free to start with a brief summary of the purpose and importance of this research, and an introduction of your research questions or hypotheses. If your provide more detail, that's fine, but I won't give you detailed feedback on this parts.
* [ftp://cran.r-project.org/pub/R/doc/contrib/Baggott-refcard-v2.pdf Baggott's R Reference Card v2] — When I was learning R, I ''literally'' took a similar reference card with me everywhere and looked at it dozens of times a day.
* [https://stackoverflow.com/questions/tagged/r StackOverflow R Tag] — Somebody already had your question about how to do ''X'' in R. They asked it, and several people have answered it, on StackOverflow. Learning to read this effectively will take time but as build up some basic familiarity with R and with StackOverflow, it will get easier. I promise.
* [http://rseek.org/ Rseek] — Rseek is a modified version of Google that just search R websites online. Sometimes, R is hard to search before because R is a common letter. This has become much easier over time as R has become more popular but it might still be the case sometimes and Rseek is a good solution.


The final paper should include:
== Assignments ==


* a statement of the purpose, central focus, relevance and significance of this research;
The assignments in this class are designed to give you an opportunity to try your hand at using the conceptual material taught in the class. There will be no exams or quizzes. Unless otherwise noted, all assignments are due at the end of the day (i.e., 11:59pm on the day they are due).
* a description of the specific Internet application(s) and/or environment(s) and/or objects to be studied and employed in the research;
* key research questions or hypotheses;
* operationalization of key concepts;
* a description and rationale of the specific method(s), (if more than one method will be used, explain how the methods will produce complementary findings);
* a description of the step-by-step plan for data collection;
* description and rationale of the level(s), unit(s) and process of analysis (if more than one kind of data are generated, explain how each kind will be analyzed individually and/or comparatively);
* an explanation of how these analyses will enable you to answer the RQs
* a sample instrument (as appropriate);
* a sample dataset and description of a formative analysis you have completed;
* a description of actual or anticipated results and any potential problems with their interpretation;
* a plan for publishing/disseminating the findings from this research
* a summary of technical, ethical, human subjects and legal issues that may be encountered in this research, and how you will address them;
* a schedule (using specific dates) and proposed budget.


I also expect each student to begin data for your project (i.e., using the technical skills you learn in the class) and describe your progress in this regard this in your paper. If collecting data for a proposed project is impractical (e.g., because of IRB applications, funding, etc) I would love for you to engage in the collection of public dataset as part of a pilot or formative study. If this is not feasible or useful, we can discuss other options.
=== Weekly Problem Sets and Participation ===


I have a strong preference for you to write this paper individually but I'm open to the idea that you may want to work with others in the class.
Each week I will post a problem set with a list of questions. Some of these will be drawn from the textbooks and some will be ones I design or write. The questions will cover:


=== Participation ===
* '''Statistics questions''' — These will be questions about statistics from the OpenIntro sections as well as any empirical papers that are listed as required for that that day.
* '''Programming challenges''' — These will be R programming problems that cover material from the Verzani text that was listed as required from the previous session.


The course relies heavily on participation and discussion. It is important to realize that we will not summarize reading in class and I will not cover it in lecture. I expect you all to have read it and we will jump in and start discussing it. The "Participation Rubric" section of [https://mako.cc/teaching/assessment.html my detailed page on assessment] gives the rubric I will use in evaluating participation.
I won't be grading these assignment and I won't be asking you to turn in anything for the ''statistics questions'' portion of the weekly assignment. That said, we will spend a good chunk of class each day going through the answers to the questions due on that day.


=== Grading ===
Because randomness is an extremely important concept in statistics, I will use a small R program to '''randomly cold call''' on students in the class to walk through your "answer" to each question and explain your reasoning to the class. We'll then have an opportunity to discuss the different approaches as a group. I don't promise to ask all of these questions in class (especially if it's clear that folks get the point). Although I might ask them, I won't cold call for questions that are not on the list.


I have put together a very detailed page that describes [https://mako.cc/teaching/assessment.html grading rubric] I will be using in this course. Please read it carefully I will assign grades for each of following items on the UW 4.0 grade scale according to the weights below:
For the programming challenges, I will ask that everybody shares code for any solutions to programming problems before class so we can walk through in class. If you get completely stuck on a problem and cannot "solve" it, that's OK, but share the code that you do have so that you can walk us through what you did and what you were thinking.


* Participation: 25%
Although the problem sets are not going to be graded, it is critical that you be at class and that you be able to discuss your answers to each of the questions. Your ability to do these latter two things will be reflected in your participation grade for the course which makes a full 40% of your grade. I can't emphasize enough how important it will be to be in class.
* Presentation of method/approach: 15%
* Proposal identification: 5%
* Final paper outline: 5%
* Final Presentation: 10%
* Final Paper: 40%


== Schedule ==
I'm not going to form groups for you but it's totally fine with me if you want to work on these problem sets in small groups.
=== Week 1: Monday March 28: Introduction and Framing ===


'''Resources:'''
The "Participation Rubric" section of [https://mako.cc/teaching/assessment.html my page on assessment] gives the details on how I evaluate participation in my classes. If you sense a conflict between material in this section and material on that page, you can safely assume that the syllabus takes precedence.


* [https://canvas.uw.edu/files/35861026/download?download_frd=1 Week 1 Reading Note] — Read this first!
=== Research Project ===


'''Required Readings:'''
As a demonstration of your learning in this course, you will design and carry out a quantitative research project, start to finish. This means you will all:


* Agre, Philip, “[http://polaris.gseis.ucla.edu/pagre/research.html Internet Research: For and Against.]” in Mia Consalvo, Nancy Baym, Jeremy Hunsinger, Klaus Bruhn Jensen, John Logie, Monica Murero, and Leslie Regan Shade, eds, Internet Research Annual, Volume 1: Selected Papers from the Association of Internet Researchers Conferences 2000-2002, New York: Peter Lang, 2004. ''[Free Online]''
* '''Design and describe a social scientific study''' —  You should all have experience doing this at least once in COM520. The study you design should involves quantitative analysis and should be something you can complete at least a first pass at over the course of this quarter.
* Sandvig, Christian, 2010, "[http://blogs.law.harvard.edu/niftyc/archives/277 Why the Internet is On the Verge of Blowing Up All Our Methods Courses]." ''[Free Online]''
* '''Find a dataset''' — Very quickly, you should identify a dataset you will use to complete this project. For most of you, I suspect you will be engaging in secondary data analysis or a re-analysis of a previously collected dataset.
* Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabasi, A. L., Brewer, D., … Van Alstyne, M. (2009). [http://doi.org/10.1126/science.1167742 Computational Social Science.] Science, 323(5915), 721–723. ''[Available through UW Libraries]''
* '''Engage in descriptive data analysis''' — Use R to create descriptive statistics and visualization to describe your data.
* '''Test a hypotheses about relationships between two or more variables'''
* '''Report your findings''' — I'll expect you all to report your findings in both a short paper and a short presentation.
* '''Ensure replicability''' — I'll expect you all to provide code and data for your analysis in a way that makes your work replicable by other researchers.


'''Optional Reading:'''
Although it's not required, I ''strongly urge each of you'' to take this opportunity to produce a document that will further your academic career outside of the class. There are many ways that this can happen but the obvious ones are that the paper is something you can submit for publication to a journal or conference, that provides primarily analysis for or acts as a pilot analysis that you can report in a grant proposal or thesis proposal, and/or that serves as part of your masters thesis or dissertation.


* Gane, Nicholas, and Beer, David, 2008, "[https://canvas.uw.edu/files/35860995/download?download_frd=1 Introduction: Concepts and Media]." from New Media: The Key Concepts, Berg, pp. 1-13. ''[Available in Canvas]''
==== Project and Dataset Identification ====
* Bruhn Jensen, Klaus, 2011, "[https://canvas.uw.edu/files/35860992/download?download_frd=1 New media, old methods — Internet methodologies and the online/offline divide]," in Consalvo & Ess (Eds.), The Handbook of Internet Studies, Blackwell, pp. 43-58. ''[Available in Canvas]''
* Hesse-Biber, Sharlene Nagy, "[https://canvas.uw.edu/files/35860993/download?download_frd=1 Emergent Technologies in Social Research: Pushing Against the Boundaries of Research Praxis]," [HET], pp. 3-24. ''[Available in Canvas]''
* December, John. (March, 1996). "[http://onlinelibrary.wiley.com/enhanced/doi/10.1111/j.1083-6101.1996.tb00173.x/ Units of Analysis for Internet Communication.]" Journal of Computer-Mediated Communication, V.1, N.4. ''[Available through UW libraries]''
* Steven M. Schneider & Kirsten A. Foot, "[http://people.sunyit.edu/~steve/papers/schneider-foot-webasobject-20030826.pdf The Web as an Object of Study]." New Media and Society, V. 6, N.1, 114-122, 2004. ''[Free Online]''
* Gunkel, David, "[https://canvas.uw.edu/files/35860991/download?download_frd=1 To Tell the Truth: The Internet and Emergent Epistemological Challenges in Social Research]," [HET], pp. 47-64. ''[Available in Canvas]''
* Baym, Nancy. (2006). "[https://canvas.uw.edu/files/35860990/download?download_frd=1 Finding the Quality in Qualitative Internet Research]," in Critical Cyberculture Studies, David Silver and Adrienne Massanari, eds., New York University Press, NY. pp. 79-87. ''[Available in Canvas]''
* Hackett, Edward, "[https://canvas.uw.edu/files/35861007/download?download_frd=1 Possible dreams: Research technologies and the transformation of the human sciences]," Ch 1 in HET. ''[Available in Canvas]''


=== Week 1: Wednesday March 30: Ethics ===
;Due Date: January 17
;Maximum paper length: 500 words (~1-2 page)
;Deliverables: Turn in in Canvas


'''Resources:'''
Early on, I want you to identify and describe your final project. Your proposal should be short and can be either paragraphs or bullets. It should include the following things:


* [https://canvas.uw.edu/files/35861026/download?download_frd=1 Week 1 Reading Note] — Read this first!
* A one paragraph abstract of the proposed study and research question, theory, community, and/or groups you plan to study.
* A short description of how the project will fit into your career trajectory.
* An identification of the dataset you will use and a description of the columns or type of data it will include. If you do not currently have access to these data, explain when you will have access to the data.


'''Required Readings:'''
==== Final Project Ouline ====


* Association of Internet Researchers, Ethics Working Committee, 2011, “[http://aoirethics.ijire.net/aoirethicsprintablecopy.pdf Ethics Guidelines Review Draft].” ''[Free Online]'' ([http://aoirethics.ijire.net/ Browseable Web Version])
;Outline Due Date: February 21
* Kramer, A. D. I., Guillory, J. E., & Hancock, J. T. (2014). [http://doi.org/10.1073/pnas.1320040111 Experimental evidence of massive-scale emotional contagion through social networks.] Proceedings of the National Academy of Sciences, 111(24), 8788–8790. ''[Available through UW Libraries]''
;Maximum outline length: 5 pages
* [Look Over Briefly] Grimmelmann, James. (2014) [http://laboratorium.net/archive/2014/06/30/the_facebook_emotional_manipulation_study_source The Facebook Emotional Manipulation Study: Sources]. ''[Free Online]''
;Deliverables: Turn in in Canvas
* Carr, N. (2014, September 14). [http://lareviewofbooks.org/essay/manipulators-facebooks-social-engineering-project/ The Manipulators: Facebook’s Social Engineering Project]. Retrieved March 26, 2015. ''[Free Online]''
* Bernstein, M. (2014, July 7). [https://medium.com/@msbernst/the-destructive-silence-of-social-computing-researchers-9155cdff659 The Destructive Silence of Social Computing Researchers]. Retrieved March 26, 2015. ''[Free Online]''
* Lampe, C. (2014, July 8). [http://chronicle.com/blogs/conversation/2014/07/08/facebook-is-good-for-science/ Facebook Is Good for Science]. ''[Free Online]''
* Monroy-Hernández, Andrés, and Benjamin Mako Hill. (2016) "[https://canvas.uw.edu/files/35914734/download?download_frd=1 The Scratch Online Community Dataset.]" Working Paper. Seattle, Washington. — Read at least the ''Background & Summary'', ''Setting'', ''Defining Public Data'', and ''Research Ethics'' sections and skim the rest. Please also read the
* [https://canvas.uw.edu/files/35914781/download?download_frd=1 Scratch Data Sharing Agreement] (Draft as of 2016-03-28)


'''Optional Readings:'''
The outline should should have the following sections: (a) Rationale, (b) Objectives; (b.1) General Objectives; (b.2) Specific Objectives; (c) Null hypotheses; (d) Conceptual Diagram; (e) Measures; (e) Dummy Tables.


* [http://www.hhs.gov/ohrp/policy/belmont.html The Belmont Report]. (1979).
An excellent example from my partner Mika Matsuzakis is [https://canvas.uw.edu/courses/1098035/files/40388318/download?wrap=1 online in Canavs]. Your diagram will likely be much less complicated than Matsuzaki's. Also, please don't be distracted by the fact that Mika does public health. It's the basic form I want you all to emulate, not the content. You can read [http://ajcn.nutrition.org/content/99/6/1450.full the published paper] to compare.
* American Association for the Advancement of Science, 1999, “[http://www.aaas.org/page/ethical-and-legal-aspects-human-subjects-research-cyberspace Ethical and Legal Aspects of Human Subjects Research in Cyberspace].” ''[Free Online]''
* [http://www.copyright.gov/legislation/dmca.pdf Digital Millenium Copyright Act] and these explanatory/commentary essays & sites:
** The [https://www.eff.org/ Electronic Frontier Foundation's] [https://www.eff.org/issues/dmca page on the DMCA].
** Templeton, Brad's [http://www.templetons.com/brad/copyright.html A Brief Intro to Copyright] & [http://www.templetons.com/brad/copymyths.html 10 Big Myths about Copyright Explained]
** Sections on Copyright, Privacy, and Social Media in the “Internet Case Digest” of the [http://www.perkinscoie.com/casedigest/ Perkins Coie LLP “Case Digest” site].
* Amy Bruckman's two 2016 blog posts about researchers violating terms of Service (TOS) while doing academic research: [https://nextbison.wordpress.com/2016/02/26/tos/ Do Researchers Need to Abide by Terms of Service (TOS)? An Answer.] and [https://nextbison.wordpress.com/2016/02/29/tos2/ More on TOS: Maybe Documenting Intent Is Not So Smart]
* Narayanan, A., & Shmatikov, V. (2008). [http://doi.org/10.1109/SP.2008.33 Robust De-anonymization of Large Sparse Datasets.] In IEEE Symposium on Security and Privacy, 2008. SP 2008 (pp. 111–125). http://doi.org/10.1109/SP.2008.33 ''[Available through UW Libraries]''
* Markham, A. (2012). [http://doi.org/10.1080/1369118X.2011.641993 Fabrication as Ethical Practice.] Information, Communication & Society, 15(3), 334–353. ''[Available through UW Libraries]''
* Trevisan, F., & Reilly, P. (2014). [http://doi.org/10.1080/1369118X.2014.889188 Ethical dilemmas in researching sensitive issues online: lessons from the study of British disability dissent networks.] Information, Communication & Society, 17(9), 1131–1146. ''[Available through UW Libraries]''
* Bond, R. M., Fariss, C. J., Jones, J. J., Kramer, A. D. I., Marlow, C., Settle, J. E., & Fowler, J. H. (2012). [http://doi.org/10.1038/nature11421 A 61-million-person experiment in social influence and political mobilization.] Nature, 489(7415), 295–298. ''[Available through UW Libraries]''
* Bruckman, A., Luther, K., & Fiesler, C. (2015). [http://blog.kurtluther.com/pdf/bruckman_real_names.pdf When Should We Use Real Names in Published Accounts of Internet Research?] In Digital Research Confidential: The Secrets of Studying Behavior Online (pp. 243–259). Cambridge, Massachusetts: MIT Press.


=== Week 2: Monday April 4: NO CLASS ===
The example includes everything except a "Measures" section. Your Measures section only needs to include two column table where column 1 is the name of each variable in your analysis and 2 is the specific operationalization of this measures and a description of how you will create it.
=== Week 2: Wednesday April 6: Web Archiving ===


<!-- NOTE: probably drop rogers; probably add chapter 2 in eszter and christian's book -->
==== Final Project ====
'''Facilitator:''' Janny


'''Required Readings:'''
;Paper Due Date: March 19
;Maximum length: 6000 words (~20 pages)
;Presentation Date: March 14
;All Deliverables: Turn in in Canvas


* Shumate, M., & Weber, M. S. (2015). [https://canvas.uw.edu/files/35982884/download?download_frd=1 The Art of Web Crawling for Social Science Research.] In E. Hargittai & C. Sandvig (Eds.), Digital Research Confidential: The Secrets of Studying Behavior Online (1 edition). The MIT Press. ''[Available in Canvas]''
I'm expecting you to produce a draft of a short research paper that, after some additional work, you could consider submitting for publication. I'm also very open to the project being a part of a dissertation. I don't expect the papers to be ''publication ready'' but I do expect them to have well considered drafts of all the necessary pieces in terms of quantitative methodology.
* Schneider, Steven, Kirsten Foot, and Paul Wouters, 2009, “[https://canvas.uw.edu/files/35982270/download?download_frd=1 Web Archiving as E-Research],” in e-Research: Transformation in Scholarly Practice, Nicholas Jankowski (Ed.), Routledge, pp. 205-221. ''[Available in Canvas]''
* Brügger, N. (2011). [https://canvas.uw.edu/files/35981768/download?download_frd=1 Web archiving—Between past, present, and future.] In M. Consalvo & C. Ess (Eds.), The Handbook of Internet Studies (pp. 24–42). Chichester, West Susssex: Blackwell. ''[Available in Canvas]''
* Rogers, Richard, Chapter 3 "[https://canvas.uw.edu/files/35982101/download?download_frd=1 The Website as Archived Object]" from Digital Methods, pp. 61-82. ''[Available through Canvas]''<!-- REMOVE NEXT TIME -->
* Graeff, E., Stempeck, M., & Zuckerman, E. (2014). [http://firstmonday.org/ojs/index.php/fm/article/view/4947 The battle for “Trayvon Martin”: Mapping a media controversy online and off-line.] First Monday, 19(2). ''[Free Online]''


'''Optional Readings:'''
Because the emphasis in this class is on statistics and methodology and because I'm not an expert in each of your areas or fields, I'm happy to assume that your paper, proposal, or thesis chapter has already established the relevance and significance of your study and has a comprehensive literature review, well-grounded conceptual approach, and compelling reason why this research is so important. Instead of providing all of these details, feel free to start with a brief summary of the purpose and importance of this research, and an introduction of your research questions or hypotheses. If your provide more detail, that's fine, but I won't give you detailed feedback on these parts.


* Gherab-Martin, Karim, "[https://canvas.uw.edu/files/35982369/download?download_frd=1 Digital repositories, folksonomies, and interdisciplinary research: New social epistemology tools]," Ch. 10 in HET. ''[Available in Canvas]''
I have a strong preference for you to write this paper individually but I'm open to the idea that you may want to work with others in the class.
* [https://www.digitalmethods.net/Digitalmethods/TheSpheres Digital Methods Initiative]. (2009). The Spheres. ''[Free Online]''
* Rogers, Richard, Chapter 4 "[https://canvas.uw.edu/files/35982103/download?download_frd=1 Googlization and the Inculpable Search Engine]" from Digital Methods. ''[Available through Canvas]''
* Schneider, S. M., & Foot, K. A. (2004). [http://doi.org/10.1177/1461444804039912 The Web as an Object of Study.] New Media & Society, 6(1), 114–122. ''[Available through UW Libraries]''
* Spaniol, M., Denev, D., Mazeika, A., Weikum, G., & Senellart, P. (2009). [http://doi.org/10.1145/1526993.1526999 Data Quality in Web Archiving.] In Proceedings of the 3rd Workshop on Information Credibility on the Web (pp. 19–26). New York, NY, USA: ACM.  ''[Available through UW Libraries]''
* [http://www.archiveteam.org/index.php?title=Main_Page Archive Team] is an online community that archives websites. They are a fantastic resource and include many pieces of detailed technical documentation on the practice of engaging in web archiving. For example, here are detailed explanations of [http://www.archiveteam.org/index.php?title=Wget#Mirroring_a_website mirroring a website with GNU wget] which is the piece of free software I usually use to archive websites.
* Weber, M. S. (2014). [https://dl.acm.org/ft_gateway.cfm?id=2579213&ftid=1444311&dwn=1&CFID=766805565&CFTOKEN=99522210 Observing the Web by Understanding the Past: Archival Internet Research.] In Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion (pp. 1031–1036). Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee. ''[Available through UW Libraries]''


=== Week 2: Friday April 8: CDSW Session 0 ===
In terms of content:


As description in the section on technical skills above, I expect everybody who is not comfortable with at least basic programming and data collection to attend the [[Community Data Science Workshops (Spring 2016)]] which I am running concurrently with this class.
* In terms of the structure of the paper, please see the page that I've written on the [[structure of a quantitative empirical research paper]].
* In terms of the structure of your presentation, you've got some latitude but this document on [https://canvas.uw.edu/files/40848246/download?download_frd=1 Creating a Successful Scholarly Presentation] (link is in Canvas) will likely be useful.


This session will run from 6-9pm and is the only session which can probably be missed. Please do contact me, however, if you will not be able to attend it.
=== Grading ===


=== Week 2: Saturday April 9: CDSW Session 1 ===
I have put together a very detailed page that describes [https://mako.cc/teaching/assessment.html grading rubric] I will be using in this course. Please read it carefully I will assign grades for each of the following items on the UW 4.0 grade scale according to the weights below:


As description in the section on technical skills above, I expect everybody who is not comfortable with at least basic programming and data collection to attend the [[Community Data Science Workshops (Spring 2016)]] which I am running concurrently with this class.
* Participation: 40%
* Proposal identification: 5%
* Final paper outline: 5%
* Final Presentation: 10%
* Final Paper: 40%


This session will run from 9am-3pm. Details on the [[CDSW Spring 2016]] page.
== Finding a Dataset ==


=== Week 3: Monday April 11: Textual Analyses ===
In order to complete your project, you will each need a dataset. If you are at the stage of your career where you already have a dataset, great! If not, there are many datasets to draw from. Here are some ideas:


'''Facilitator:''' Adam
* Ask your advisor for a dataset they have collected and used in previous papers. Are there other variables you could use?
* If there's an author of a study you loved, you can send a polite email asking if they are able or willing to share an archival or replication version of the dataset used in their paper. Be very polite and make it clear that this is starting as a class project but that might turn into a paper for publication. Make your timeline clear. In communication, replication datasets are still very rare, so be prepared for a negative answer.
* Do some Google Scholar and normal Google searching for datasets in your research area. You'd be surprised at what's available.
* Take a look at datasets available in the [https://dataverse.harvard.edu/ Harvard Dataverse] (the largest collection of social science research data) or one of the other members of the [http://dataverse.org/ Dataverse network].
* Look at the collection of social scientific datasets at [https://www.icpsr.umich.edu/icpsrweb/ICPSR/ ICPSR] (UW is a member). There are an enormous number of very rich datasets.
* Use the [http://scientificdata.isa-explorer.org/index.html ISA Explorer] to find datasets. Keep in mind the large majority of datasets it will search are drawn from the natural sciences.
* Set up a meeting with Jennifer Muilenburg — Data Curriculum and Communications Librarian who runs [https://www.lib.washington.edu/digitalscholarship/services/data research data services at the UW libraries]. Her email is: libdata@uw.edu I've have talked to her about this course and she is excited about meeting with you to help.
* [http://fivethirtyeight.com FiveThirtyEight.com] has published a [https://cran.r-project.org/web/packages/fivethirtyeight/vignettes/fivethirtyeight.html GitHub repository and an R package] with pre-processed and cleaned versions of many of the datasets they use for articles published on their website.


'''Required Readings:'''
In general, you're responsible for make sure that you're on the right side of the human subject rules and that work is ethical. Class projects generally do not need IRB approval but I hope that each of your projects will turn into something more. If your study involves human subjects research, ''that'' work will need IRB oversight of some sort. In general, you can't do a class project with IRB approval and then retroactively get it later. Secondary analysis of anonymized data is generally not considered human subjects research but I strongly suggest that you get a determination from [https://www.washington.edu/research/hsd UW's Human Subject Division] before you start. For work that is not considered human subjects research, this can often happen in a few hours or days. If you need a faculty sponsor, that should ideally be your advisor. If that doesn't make sense for any of you, I'm happy to talk about serving as the faculty supervisor for the work.


* McMillan, S. J. (2000). [http://jmq.sagepub.com/content/77/1/80.short The microscope and the moving target: The challenge of applying content analysis to the World Wide Web.] Journalism and Mass Communication Quarterly, 77(1), 80-98. ''[Available through UW Libraries]''
== Structure of Class ==
* Mishne, Gilad and Natalie Glance (2006), “[http://www.ambuehler.ethz.ch/CDstore/www2006/www.blogpulse.com/www2006-workshop/papers/wwe2006-blogcomments.pdf Leave a reply: An analysis of weblog comments].” Third Annual Conference on the Weblogging Ecosystem, at WWW 2006. ''[Free Online]''
* Grimmer, J., & Stewart, B. M. (2013). [https://pan.oxfordjournals.org/content/21/3/267.full Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.] Political Analysis. ''[Available through UW Libraries]''
* DiMaggio, P., Nag, M., & Blei, D. (2013). [http://doi.org/10.1016/j.poetic.2013.08.004 Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of U.S. government arts funding.] Poetics, 41(6), 570–606. ''[Available through UW Libraries]''


'''Optional Readings:'''
I expect everybody to come to class, every week, with their laptop and a power cord, being ready to answer any question on the problem set and having uploaded and shared code to the code related questions. The class is listed as nearly 4 hours long and, with the exception of a few short breaks, I intend to use the entire period. Be in class on time and be plugged in and ready to go.


I'm assuming you have at least a rough familiarity with [https://en.wikipedia.org/wiki/Content_analysis content analysis] as a methodology. If your not as comfortable with this, check out the Wikipedia article to start. These help provide more of a background into content analysis (in general, and online):
When it comes to the statistics part of this material, this will be a primarily "flipped" classroom. What this means is that we'll be relying on the textbook and other resources to introduce the material and we'll be using the class to discuss it and answer questions that come up.


* Van Selm, Martine & Jankowski, Nick, (2005) "[https://canvas.uw.edu/files/36066292/download?download_frd=1 Content Analysis of Internet-Based Documents.]" Unpublished Manuscript. ''[Available in Canvas]''
Although structure of class will vary, it will generally include the following parts.
* Neuendorf, K. A. (2002). The content analysis guidebook. Thousand Oaks, Calif.: Sage Publications. ''[Available from Instructor]''
* Krippendorff, K. (2005). Content analysis: an introduction to its methodology. Thousand Oaks; London; New Delhi: Sage. ''[Available from Instructor]''


Examples of more traditional content analysis using online content:
# Quick updates about assignments, projects, and a meta-discussion about the class.
# Discussion of '''programming challenges''' due that day.
# [''Possibly/Sometimes''] Short lecture and/or Q&A about new material in Diez, Barr, and Çetinkaya-Rundel
# Discussion of  '''statistics questions''' related to new material in Diez, Barr, and Çetinkaya-Rundel and any exemplary empirical paper we have read to discuss.
# Interactive lecture introducing new statistical programming concepts.
# [''Possibly/Sometimes''] Time to begin work on next week's programming assignments.


* Trammell, K. D., Tarkowski, A., Hofmokl, J., & Sapp, A. M. (2006). [http://doi.org/10.1111/j.1083-6101.2006.00032.x Rzeczpospolita blogów (Republic of Blog): Examining Polish Bloggers Through Content Analysis.] Journal of Computer-Mediated Communication, 11(3), 702–722. ''[Free Online]''
== Schedule ==
* Woolley, J. K., Limperos, A. M., & Oliver, M. B. (2010). [http://doi.org/10.1080/15205436.2010.516864 The 2008 Presidential Election, 2.0: A Content Analysis of User-Generated Political Facebook Groups.] Mass Communication and Society, 13(5), 631–652. ''[Available from UW Libraries]'''


Another example of topic modeling, but from political science:
When reading the schedule below, the following key might help resolve ambiguity: §n denotes chapter n; §n.x denotes section x of chapter; §n.x-y denotes sections x through y of chapter n.


* Barberá, P., Bonneau, R., Egan, P., Jost, J. T., Nagler, J., & Tucker, J. (2014). [http://smapp.nyu.edu/SMAPP_Website_Papers_Articles/leadersAndFollowersMeasuringPolitical.pdf Leaders or Followers? Measuring Political Responsiveness in the US Congress Using Social Media Data.] Presented at the Annual Meeting of the American Political Science Association. ''[Free Online]''
=== Week 1: Tuesday January 3: Introduction, Setup, and Data and Variables ===


=== Week 3: Wednesday April 13: Digital Ethnography & Trace Ethnography ===
Hopefully, the material in OpenIntro feels very familiar from COM520. The programming material will be new but I want you to read it before you come to class so we can work through the examples a group.


'''Session Coordinator:''' Nate
'''Required Readings:'''


'''Required Readings:'''
* Diez, Barr, and Çetinkaya-Rundel: §1 (Introduction to data)
* Verzani: §1 (Getting Started), §2 (Univariate data) [[https://faculty.washington.edu/makohill/com521/verzani-usingr-ch1_ch2.pdf Available with UWNetID]]
* Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” ''Proceedings of the National Academy of Sciences'' 111(24):8788–90. [[http://www.pnas.org/content/111/24/8788.full Available through UW libraries]]


* Robinson, Laura and Jeremy Schulz, "[https://canvas.uw.edu/files/36067312/download?download_frd=1 New fieldsites, new methods: New ethnographic opportunities]," Ch. 8 in HET. ''[Available in Canvas]''
* [Selections] Jemielniak, D. (2014). Common Knowledge?: An Ethnography of Wikipedia. Stanford, California: Stanford University Press. [https://canvas.uw.edu/files/36067463/download?download_frd=1 "Introduction" and "Appendix A: Methodology."] ''[Available in Canvas]''
* Geiger, R. S., & Ribes, D. (2010). [http://doi.org/10.1145/1718918.1718941 The Work of Sustaining Order in Wikipedia: The Banning of a Vandal.] In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work (pp. 117–126). New York, NY, USA: ACM. ''[Available through UW Libraries]''
* Geiger, R. S., & Ribes, D. (2011). [http://doi.org/10.1109/HICSS.2011.455 Trace Ethnography: Following Coordination Through Documentary Practices.] In Proceedings of the 2011 44th Hawaii International Conference on System Sciences (pp. 1–10). Washington, DC, USA: IEEE Computer Society. ''[Available through UW Libraries]''
'''Optional Readings:'''
'''Optional Readings:'''


* Coleman, E. G. (2010). [http://doi.org/10.1146/annurev.anthro.012809.104945 Ethnographic Approaches to Digital Media.] Annual Review of Anthropology, 39(1), 487–505. ''[Available through UW Libraries]''
* Verzani: §A (Programming)
* [https://canvas.uw.edu/files/36079509/download?download_frd=1 Response by danah boyd To Hine's "Question One: How Can Qualitative Internet Researchers Define the Boundaries of Their Projects?"] from Internet Inquiry: Conversations About Method, Annette Markham and Nancy Baym (Eds.), Sage, 2009, pp. 1-32. ''[Available in Canvas]''
:Note: You may also be interest in reading [https://canvas.uw.edu/files/36079510/download?download_frd=1 the essay by Hine that boyd is responding to]. ''[Available in Canvas]''


This is the canonical book-length account and ''the'' main citation in this space:
'''Assignment (Complete Before Class):'''


* Hine, C. (2000). Virtual ethnography. London, UK: SAGE Publications. ''[Available from Instructor]''
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 1]]


These are all other interesting and/or frequently cited examples of Internet-based ethnographies:
'''Lectures:'''


* Humphreys, L. (2007). [http://doi.org/10.1111/j.1083-6101.2007.00399.x Mobile Social Networks and Social Practice: A Case Study of Dodgeball.] Journal of Computer-Mediated Communication, 13(1), 341–360. ''[Available through UW Libraries]''
* [https://communitydata.cc/~mako/2017-COM521/com521-week_01-r_programming_intro-20170103.ogv Week 1 R lecture screencast (Part I): Introduction to R and univariate statistics] (~1 hour 47 minutes)
: Note: Dodgeball is a mobile social network system (MSNS) that allows groups of friends to connect and meet up via mobile phone. The author employed participant observation in order to understand norms of interaction in the MSNS "space".
* [https://communitydata.cc/~mako/2017-COM521/com521-week_01-github_rscripts-20170104.ogv Week 1 R lecture screencast (Part II): Setting up git/GitHub and saving files in RStudio] (~40 minutes)
* Brotsky, S. R., & Giles, D. (2007). [http://doi.org/10.1080/10640260701190600 Inside the “Pro-ana” Community: A Covert Online Participant Observation.] Eating Disorders, 15(2), 93–109. ''[Available through UW Libraries]''
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 1]]
:Note: To conduct the study reported in this paper the authors created a used a fake profile in order to observe the psychological support offered to participants.
* Williams, M. (2007). [http://doi.org/10.1177/1468794107071408 Avatar watching: participant observation in graphical online environments.] Qualitative Research, 7(1), 5–24. ''[Available through UW Libraries]''
: Note: Fantastic more general introduction but takeaways that are more specifically targetted toward people studying virtual reality type environments with virtual physicality.


'''Apropos of class discussion:'''
'''Resources:'''
 
* [https://www.openintro.org/download.php?file=os3_slides_01&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §1 Lecture Notes]
* Borges, J. L. (1998). [https://canvas.uw.edu/files/36195543/download?download_frd=1 Pierre Mendard, author of the Quixote]. In A. Hurley (Trans.), Collected Fictions (pp. 88–95). New York, N.Y., U.S.A: Viking Press. ''[Available in Canvas]''
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including some for §1
* Maxwell, J. A. (2002). [https://canvas.uw.edu/files/36195592/download?download_frd=1 Understanding and validity in qualitative research]. In A. M. Huberman & M. B. Miles (Eds.), The Qualitative Researcher’s Companion (pp. 37–64). SAGE. ''[Available in Canvas]''
* [[Statistics and Statistical Programming (Winter 2017)/Session plan: Week 1]]
 
=== Week 4: Monday April 18: Online Interviews ===


'''Facilitator:''' Julia
=== Week 2: Tuesday January 10: Probability and Visualization ===


'''Required Readings:'''
'''Required Readings:'''


* O’Connor, H., Madge, C., Shaw, R., & Wellens, J. (2008). [http://srmo.sagepub.com/view/the-sage-handbook-of-online-research-methods/n15.xml Internet-based Interviewing]. In N. G. Fielding, R. M. Lee, & G. Blank (Eds.), The SAGE Handbook of Online Research Methods (pp. 271–289). London, UK: SAGE Publications, Ltd. ''[Available through UW Libraries]''
* Diez, Barr, and Çetinkaya-Rundel: §2 (Probability)
* Stewart, K., & Williams, M. (2005). [http://doi.org/10.1177/1468794105056916 Researching online populations: the use of online focus groups for social research.] Qualitative Research, 5(4), 395–416.
* Verzani: §3.1-2 (Bivariate data), §4 (Multivariate data), §5 (Multivariate graphics) [[https://faculty.washington.edu/makohill/com521/verzani-usingr-ch3.1-2_ch4_ch5.pdf Available with UW NetID]]
* Hanna, P. (2012). [http://doi.org/10.1177/1468794111426607 Using internet technologies (such as Skype) as a research medium: a research note.] Qualitative Research, 12(2), 239–242. ''[Available through UW Libraries]''
* Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in ''Proceedings of the 8th ACM Conference on Designing Interactive Systems.'' Aarhus, Denmark: ACM. [[https://mako.cc/academic/buechley_hill_DIS_10.pdf PDF available on my personal website]]
: Note: Short article you can basically skim. Read it quickly so you can cite it later.
* Dowling, S. (2012). [http://srmo.sagepub.com/view/cases-in-online-interview-research/n11.xml Online Asynchronous and Face-to-Face Interviewing: Comparing Methods for Exploring Women’s Experiences of Breastfeeding Long Term]. In Salmons, J. (Ed.), Cases in Online Interview Research (pp. 277–303). 2455 Teller Road,  Thousand Oaks  California  91320  United States: SAGE Publications, Inc. ''[Available through UW Libraries]''


'''Optional Readings:'''
'''Assignment (Complete Before Class):'''


* boyd,  danah. (2015). [https://canvas.uw.edu/files/36133652/download?download_frd=1 Making sense of teen life: Strategies for capturing ethnographic data in a networked era.] In E. Hargittai & C. Sandvig (Eds.), Digital Research Confidential: The Secrets of Studying Behavior Online. Cambridge, Massachusetts: The MIT Press. ''[Available in Canvas]''
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 2]]
: Note: Strongly focused on enthnographic interviews with tons of very specific details. Fantastic article on interviewing, although perhaps a bit weak on Internet specific advice.
* Markham, Annette (1998), "[https://canvas.uw.edu/files/36133806/download?download_frd=1 The Shifting Project, the Shifting Self]," from Life Online, Altamira Press, 1998, pp. 61-84. ''[Available in Canvas]''
: Note: One of the earliest books on online life and one of the earliest attempts to do online interviewing. This is dated, but highlight some important challenge.
* Stromer-Galley, Jennifer (2003), "[https://canvas.uw.edu/files/36133838/download?download_frd=1 Depth Interviews for the Study of Motives and Perceptions of Internet Use]," International Communication Association, San Diego, May. ''[Available in Canvas]''
: Note: Start reading on page 8 on "The Internet and the Interview". The beginning is a theoretical argument that's not really relevant to this class.* Chou, C. (2001). [http://online.liebertpub.com/doi/abs/10.1089/109493101753235160 Internet heavy use and addiction among Taiwanese college students: an online interview study.] CyberPsychology & Behavior, 4(5), 573-585. ''[Available through UW Libraries]''


'''Alternate Accounts:'''
'''Lectures:'''


These texts are largely redundant to the required texts above but do provide a different perspective and examples:
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 2]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_02-lists_dataframes_graphing-20170111.ogv Week 2 R lecture screencast: lists, matrixes, data frames, and beginning graphing] (~1 hour 8 minutes)


* Salmons, J. (2014). Qualitative Online Interviews: Strategies, Design, and Skills. SAGE Publications.
'''Resources:'''
: This is a book that lays out what claims to be a comprehensive account to online interviewing. Take a quick through the [https://canvas.uw.edu/files/36133582/download?download_frd=1 preface and table of contents] and read [https://canvas.uw.edu/files/36133581/download?download_frd=1 Chapter 1]. ''[Both Available in Canvas.]''
: I have the book and am happy to loan my copy to anybody in the class that thinks this will be a core part of their research.
* Morgan, David L. and Bojana Lobe, "[https://canvas.uw.edu/files/36133861/download?download_frd=1 Online focus groups]," Ch. 9 in HET. ''[Available in Canvas]''
* Gaiser, T. J. (2008). [http://srmo.sagepub.com/view/the-sage-handbook-of-online-research-methods/n16.xml Online Focus Groups]. In N. G. Fielding, R. M. Lee, & G. Blank (Eds.), The SAGE Handbook of Online Research Methods (pp. 290–307). London, UK: SAGE Publications, Ltd. ''[Available through UW Libraries]''


=== Week 4: Wednesday April 20: Social Network Analysis ===
* [https://www.openintro.org/download.php?file=os3_slides_02&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §2 Lecture Notes]
* [https://www.openintro.org/stat/videos.phpOpenIntro Video Lectures] including 2 short videos for §2
* [[Statistics and Statistical Programming (Winter 2017)/Session plan: Week 2]]


'''Faciliator:''' Mengjun
=== Week 3: Tuesday January 17: Distributions ===


'''Required Readings:'''
'''Required Readings:'''


* Garton, Laura, Caroline Haythornthwaite, and Barry Wellman, "[http://onlinelibrary.wiley.com/doi/10.1111/j.1083-6101.1997.tb00062.x/abstract Studying Online Social Networks]," Journal of Computer-Mediated Communication, V. 3, N. 1, June, 1997. ''[Free Online]''
* Diez, Barr, and Çetinkaya-Rundel: §3.1-3.2, §3.4: You should read the rest of the chapter (§3.3 and §3.5). I won't assign problem set questions about it but it's still important to be familiar with.
* Mislove, Alan, et al (2007), "[https://dl.acm.org/citation.cfm?id=1298311 Measurement and Analysis of Online Social Networks]," IMC 2007, October 24-27, San Diego, CA ''[Available through UW Libraries]''
* Verzani: §6 (Populations)
* Howard, Phil, "[http://nms.sagepub.com/content/4/4/550.short Network Ethnography and Hypermedia Organization: New Organizations, New Media, New Myths]," New Media and Society, December 2002, 4(4), pp. 550-574. ''[Available through UW Libraries]''
* Keegan, B., Gergle, D., & Contractor, N. (2013). [http://abs.sagepub.com/content/57/5/595 Hot Off the Wiki Structures and Dynamics of Wikipedia’s Coverage of Breaking News Events.] American Behavioral Scientist, 57(5), 595–622. ''[Available through UW Libraries]''  <!-- REMOVE NEXT TIME -->


=== Week 4: Saturday April 23: CDSW Session 2 ===
'''Assignment (Complete Before Class):'''


As description in the section on technical skills above, I expect everybody who is not comfortable with at least basic programming and data collection to attend the [[Community Data Science Workshops (Spring 2016)]] which I am running concurrently with this class.
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 3]]


This session will run from 10am-4pm. Details on the [[CDSW Spring 2016]] page.
'''Lectures:'''


=== Week 5: Monday April 25: Experiments ===
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 3]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_03-loading_data_functions_apply_misc.ogv Week 3 R lecture screencast: Loading data, functions; apply(), lapply(), sapply(); several miscellaneous functions] (~34 minutes) — This is the same material I covered in class. If you followed it, there's no reason you need to go back to this.
* [https://communitydata.cc/~mako/2017-COM521/com521-week_03-dates_tapply_merge.ogv Week 3 R lecture screencast: Dates; tapply(); and merge()] (~38 minutes) [The audio seems to be broken for the last 10 minutes. Sorry about that! I've rerecorded that below.]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_03-merge.ogv Week 3 R lecture screencast: merge()] (~13 minutes) [Rerecording of the last few minutes of the previous video.]


'''Facilitator:''' Emma
'''Resources:'''
 
'''Required Readings:'''
 
* Reips, U.-D. (2002). [http://doi.org/10.1026//1618-3169.49.4.243 Standards for Internet-based experimenting]. Experimental Psychology, 49(4), 243–256. [[http://iscience.deusto.es/wp-content/uploads/2010/04/ulf27.pdf Alternate Link]]
* Hergueux, J., & Jacquemet, N. (2014). [http://doi.org/10.1007/s10683-014-9400-5 Social preferences in the online laboratory: a randomized experiment]. Experimental Economics, 18(2), 251–283. ''[Available through UW Libraries]''
* Salganik, M. J., Dodds, P. S., & Watts, D. J. (2006). [http://doi.org/10.1126/science.1121066 Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market]. Science, 311(5762), 854–856. ''[Available through UW Libraries]''
* Rijt, A. van de, Kang, S. M., Restivo, M., & Patil, A. (2014). [http://doi.org/10.1073/pnas.1316836111 Field experiments of success-breeds-success dynamics]. Proceedings of the National Academy of Sciences, 111(19), 6934–6939. ''[Available through UW Libraries]'' [[http://www.akshaynpatil.com/papers/success.pdf Alternative Link]]
* Bond, R. M., Fariss, C. J., Jones, J. J., Kramer, A. D. I., Marlow, C., Settle, J. E., & Fowler, J. H. (2012). [http://doi.org/10.1038/nature11421 A 61-million-person experiment in social influence and political mobilization]. Nature, 489(7415), 295–298. ''[Available through UW Libraries]''


'''Optional Readings:'''
* [https://www.openintro.org/download.php?file=os3_slides_03&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §3 Lecture Notes]
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 2 videos for §3.1 and §3.2
* [[Statistics and Statistical Programming (Winter 2017)/Session plan: Week 3]]


* Zhu, H., Zhang, A., He, J., Kraut, R., & Kittur, A. (2013). [http://doi.org/10.1145/2470654.2481311 Effects of Peer Feedback on Contribution: A Field Experiment in Wikipedia]. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Paris, France: ACM. ''[Available through UW Libraries]''
=== Week 4: Tuesday January 24: Statistical significance and hypothesis testing ===
* Restivo, M., & van de Rijt, A. (2012). [http://dx.doi.org/10.1371/journal.pone.0034358 Experimental Study of Informal Rewards in Peer Production]. PLoS ONE, 7(3), e34358. ''[Free Online]''
: This is really just a more in-depth version of the experiments in the Restivo and van de Rijt article described above.
* Restivo, M., & van de Rijt, A. (0). [http://doi.org/10.1080/1369118X.2014.888459 No praise without effort: experimental evidence on how rewards affect Wikipedia’s contributor community]. Information, Communication & Society, 0(0), 1–12. ''[Available through UW Libraries]''
: Note: This piece is, more or less, a continuation of the Restivo and van de Rijt piece included above but it is longer and goes into much more depth on at least one of the important theoretical issues.
* Kramer, A. D. I., Guillory, J. E., & Hancock, J. T. (2014). [http://doi.org/10.1073/pnas.1320040111 Experimental evidence of massive-scale emotional contagion through social networks]. Proceedings of the National Academy of Sciences, 111(24), 8788–8790. ''[Available through UW Libraries]''
: Note: We've already read but I'd like to discuss it again.
* Cosley, D., Frankowski, D., Terveen, L., & Riedl, J. (2007). [http://doi.org/10.1145/1216295.1216309 SuggestBot: Using Intelligent Task Routing to Help People Find Work in Wikipedia]. In Proceedings of the 12th International Conference on Intelligent User Interfaces (pp. 32–41). New York, NY, USA: ACM. ''[Available through UW Libraries]''
* Reinecke, K., & Gajos, K. Z. (2015). [http://doi.org/10.1145/2675133.2675246 LabintheWild: Conducting Large-Scale Online Experiments With Uncompensated Samples]. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (pp. 1364–1378). New York, NY, USA: ACM. ''[Available through UW Libraries]''
 
=== Week 5: Wednesday April 27: Surveys ===
 
'''Facilitator:''' Ben


'''Required Readings:'''
'''Required Readings:'''


* Van Selm, Martine & Nicholas Jankowski (2006), "[http://doi.org/10.1007/s11135-005-8081-8 Conducting Online Surveys]," Quality and Quantity, 40: 435-456. ''[Available through UW Libraries]''
* Diez, Barr, and Çetinkaya-Rundel: §4 (Foundations for inference)
* Walejko, Gina, "[https://canvas.uw.edu/files/36334690/download?download_frd=1 Online survey: Instant publication, instant mistake, all of the above]," from Research Confidential: Solutions to Problems Most Research Scientists Pretend They Never Have, University of Michigan Press, 2009, pp. 101-121. ''[Available in Canvas]''
* Verzani: §7 (Statistical inference), §8 (Confidence intervals)
* Joseph A. Konstan, B. R. Simon Rosser, Michael W. Ross, Jeffrey Stanton, & Weston M. Edwards, “[http://onlinelibrary.wiley.com/doi/10.1111/j.1083-6101.2005.tb00248.x/full The Story of Subject Naught: A Cautionary but Optimistic Tale of Internet Survey Research],” Journal of Computer-Mediated Communication, V.10, N. 2, January 2005. ''[Free Online]''
* Hill, B. M., & Shaw, A. (2013). [http://dx.doi.org/10.1371/journal.pone.0065782 The Wikipedia Gender Gap Revisited: Characterizing Survey Response Bias with Propensity Score Estimation]. PLoS ONE, 8(6), e65782. ''[Free Online]''
* Salganik, M. J., & Levy, K. E. C. (2015). [http://doi.org/10.1371/journal.pone.0123483 Wiki Surveys: Open and Quantifiable Social Data Collection]. PLOS ONE, 10(5), e0123483. ''[Free Online]''
: Note: [http://www.technologyreview.com/view/531696/inspired-by-wikipedia-social-scientists-create-a-revolution-in-online-surveys/ This journalistic account of the research] may also be useful.


'''Optional Readings:'''
'''Assignment (Complete Before Class):'''


If you don't have a background in survey design, these two have been recommended by our guest speaker as good basic things to read:
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 4]]


* Krosnick, J. A. (1999). [http://doi.org/10.1146/annurev.psych.50.1.537 Survey Research]. Annual Review of Psychology, 50(1), 537–567. ''[Available through UW Libraries]''
'''Lectures:'''
* Krosnick, J. A. (1999). Maximizing measurement quality: Principles of good questionnaire design. In J. P. Robinson, P. R. Shaver, & L. S. Wrightsman (Eds.), Measures of Political Attitudes. New York: Academic Press.


These are other texts on the subject that you might find useful:
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 4]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_04-misc_confint_simulation-20170125.ogv Week 4 R lecture screencast: order(); confidence intervals; simulations drawn from repeated random samples] (~27 minutes)


* Dal, Michael, "[https://canvas.uw.edu/files/36335114/download?download_frd=1 Online data collection and data analysis using emergent technologies]," Ch. 12 in HET. ''[Available in Canvas]''
'''Resources:'''
* Smith, Tom W. and John Sokolowski, "[https://canvas.uw.edu/files/36335113/download?download_frd=1 The use of audiovisuals in surveys]," Ch. 19 in HET. ''[Available in Canvas]''
* Kellock, Anne, et. al. "[https://canvas.uw.edu/files/36335148/download?download_frd=1 Using technology and the experience sampling method to understand real life]," Ch. 24 from HET. ''[Available in Canvas]''
* Yun, Gi Woong and Craig Trumbo, "[http://onlinelibrary.wiley.com/doi/10.1111/j.1083-6101.2000.tb00112.x/abstract Comparative Response to a Survey Executed by Post, E-mail and Web Form]," Journal of Computer-Mediated Communication, V.6, N.1, September, 2000. ''[Free Online]''
* Hargittai, Eszter, and Chris Karr, "[https://canvas.uw.edu/files/36334928/download?download_frd=1 WAT R U DOIN? Studying the Thumb Generation Using Text Messaging]," from Research Confidential: Solutions to Problems Most Research Scientists Pretend They Never Have, University of Michigan Press, 2009, pp. 192-216. ''[Available in Canvas]''
* Wright, Kevin, "[http://onlinelibrary.wiley.com/doi/10.1111/j.1083-6101.2005.tb00259.x/abstract Researching Internet-Based Populations: Advantages and Disadvantages of Online Survey Research, Online Questionnaire Authoring Software Packages, and Web Survey Services]," Journal of Computer-Mediated Communication, V. 10, N. 3, April 2005. ''[Free Online]''


=== Week 6: Monday May 2: Narrative, Discourse and Visual Analysis ===
* [https://www.openintro.org/download.php?file=os3_slides_04&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §4 Lecture Notes]
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 7 videos for nearly all of §4
* [[Statistics and Statistical Programming (Winter 2017)/Session plan: Week 4]]


'''Facilitator:''' Liang
=== Week 5: Tuesday January 31: Continuous Numeric Data & ANOVA ===


'''Required Readings:'''
'''Required Readings:'''


Narrative Analysis:
* Diez, Barr, and Çetinkaya-Rundel: §5 (Inference for numerical data)
* Verzani: §9 (significance tests), §12 (Analysis of variance)
* Gelman, Andrew and Hal Stern. 2006. “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant.” ''The American Statistician'' 60(4):328–31. [[http://dx.doi.org/10.1198/000313006X152649 Available through UW Libraries]]
* Sweetser, K. D., & Metzgar, E. (2007). Communicating during crisis: Use of blogs as a relationship management tool. ''Public Relations Review'', 33(3), 340–342. https://doi.org/10.1016/j.pubrev.2007.05.016 [Available through UW Libraries]
* Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in ''Proceedings of the 8th ACM Conference on Designing Interactive Systems.'' Aarhus, Denmark: ACM. [[https://mako.cc/academic/buechley_hill_DIS_10.pdf PDF available on my personal website]]


* Mitra, A. (1999). [http://doi.org/10.1111/j.1083-6101.1999.tb00330.x Characteristics of the WWW Text: Tracing Discursive Strategies]. Journal of Computer-Mediated Communication, 5(1), 0–0.  ''[Free Online]''
'''Assignment (Complete Before Class):'''
* Kaun, Anne (2010), "[http://ejournals.library.ualberta.ca/index.php/IJQM/article/view/7165 Open-Ended Online Diaries: Capturing Life as it is Narrated]," International Journal of Qualitative Methods, Vol. 9 Issue 2, p133-148. ''[Free Online]''


Visual Analysis:
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 5]]


* Hochman, N., & Schwartz, R. (2012). [https://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/view/4782 Visualizing Instagram: Tracing Cultural Visual Rhythms]. In Sixth International AAAI Conference on Weblogs and Social Media. ''[Available through UW Libraries]''
'''Lectures:'''
* Hochman, N., & Manovich, L. (2013). [http://firstmonday.org/ojs/index.php/fm/article/viewArticle/4711/ Zooming into an Instagram City: Reading the local through social media]. First Monday, 18(7). ''[Free Online]''


'''Optional Readings:'''
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 5]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_05-ttests_and_anova.ogv Week 5 R lecture screencast: t-tests] (~22 minutes)
* [https://communitydata.cc/~mako/2017-COM521/com521-week_05-for_if.ogv Week 5 R lecture screencast: for loops and if statements] (~12 minutes)


Narrative Analysis:
'''Resources:'''


* Gubrium, Aline and K.C. Nat Turner, "[https://canvas.uw.edu/files/36418703/download?download_frd=1 Digital storytelling as an emergent method for social research and practice]," Ch. 21 in HET.
* [https://www.openintro.org/download.php?file=os3_slides_05&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §5 Lecture Notes]


Visual Analysis:
=== Week 6: Tuesday February 7: Categorical data ===


* Newbold, Curtis, 2013, "[http://thevisualcommunicationguy.com/2015/01/12/how-to-do-a-visual-analysis-a-five-step-process/ How to Do a Visual Analysis (A 5-Step Process)]". ''[Free Online]''
'''Required Readings:'''
: Note: Although I'm not a fan of infograpraphics as a genre, I suppose it makes sense that visual communication people would put together a pretty good one! If you're already familiar with visual analysis from the rhetorical tradition, there's not going to be a lot new here. If this is new for you, this will help you frame and understand the other readings.
* Torralba, A. (2009). [http://videolectures.net/nips09_torralba_uvs/ Understanding Visual Scenes]. Tutorial presented at the NIPS, Vancouver, BC, Canada. Part I. ''[Free Online]''
: Note: This is a two part (each part is one hour) lecture and tutorial by a expert in computer vision. I strongly recommend watching Part I. I think this gives you a good sense of the nature of the kinds of challenges that were (and still are) facing the field of computer vision and anybody trying to have their computer look at images.


These five paper are all technical approaches to doing image classification using datasets from Internet-based datasets of images like Flickr, Google Image Search, Google Street View, or Instagram. Each of these describes interesting and challenges technical issues. If you're interested, it would be a great idea to read these to get a sense for the state of the art and what is and isn't possible:
* Diez, Barr, and Çetinkaya-Rundel: §6 (Inference for categorical data)
* Verzani: §3.4 (Bivariate categorical data); §10.1-10.2 (Goodness of fit)
* Gelman, Andrew and Eric Loken. 2014. “The Statistical Crisis in Science Data-Dependent Analysis—a ‘garden of Forking Paths’—explains Why Many Statistically Significant Comparisons Don’t Hold Up.” ''American Scientist'' 102(6):460. [[https://www.americanscientist.org/issues/pub/2014/6/the-statistical-crisis-in-science/1 Available through UW Libraries]] (This is a reworked version of [http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf this unpublished manuscript] which provides a more detailed examples.)
* Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in ''Proceedings of the 8th ACM Conference on Designing Interactive Systems.'' Aarhus, Denmark: ACM. [[https://mako.cc/academic/buechley_hill_DIS_10.pdf PDF available on my personal website]]


* Jaffe, A., Naaman, M., Tassa, T., & Davis, M. (2006). [http://doi.org/10.1145/1178677.1178692 Generating Summaries and Visualization for Large Collections of Geo-referenced Photographs]. In Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval (pp. 89–98). New York, NY, USA: ACM. ''[Available through UW Libraries]''
'''Assignment (Complete Before Class):'''
* Simon, I., Snavely, N., & Seitz, S. M. (2007). [http://doi.org/10.1109/ICCV.2007.4408863 Scene Summarization for Online Image Collections]. In Computer Vision, IEEE International Conference on (Vol. 0, pp. 1–8). Los Alamitos, CA, USA: IEEE Computer Society. ''[Free Online]''
* Crandall, D. J., Backstrom, L., Huttenlocher, D., & Kleinberg, J. (2009). [http://doi.org/10.1145/1526709.1526812 Mapping the World’s Photos]. In Proceedings of the 18th International Conference on World Wide Web (pp. 761–770). New York, NY, USA: ACM. ''[Available through UW Libraries]''
* San Pedro, J., & Siersdorfer, S. (2009). [http://doi.org/10.1145/1526709.1526813 Ranking and Classifying Attractiveness of Photos in Folksonomies]. In Proceedings of the 18th International Conference on World Wide Web (pp. 771–780). New York, NY, USA: ACM. ''[Available through UW Libraries]''
* Doersch, C., Singh, S., Gupta, A., Sivic, J., & Efros, A. A. (2012). [http://doi.org/10.1145/2185520.2185597 What Makes Paris Look Like Paris?] ACM Trans. Graph., 31(4), 101:1–101:9. ''[Available through UW Libraries]''


Discourse Analysis:
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 6]]


* Honeycutt, Courtenay (2005), “[http://onlinelibrary.wiley.com/enhanced/doi/10.1111/j.1083-6101.2005.tb00240.x Hazing as a process of boundary maintenance in an online community]”, Journal of Computer-Mediated Communication, 10(2). [Available through UW Libraries]
'''Lectures:'''
:Note: Combines quantitative and qualitative computer-mediated discourse analysis methods.*


=== Week 6: Wednesday May 4: Crowdsourced Data Analysis and Experimentation ===
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 6]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_06-tables_chisq_debugging.ogv Week 6 R lecture screencast: Tables, <math>\chi^2</math>-tests, and debugging.] (~40 minutes)


'''Assignment:'''
'''Resources:'''


* Find and complete at least 2 "hits" as a worker on [http://mturk.com Amazon Mechnical Turk]. Note that to do this you will need to create a ''worker'' account on Mturk.  
* [https://www.openintro.org/download.php?file=os3_slides_06&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §6 Lecture Notes]
** Record (write down) details and notes about your tasks: What did you do? Who was the requester? What could you was the purpose of the task (as best you could tell)? What was the experience like? What research applications can you (not) imagine for this kind of system?
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 4 videos for §7
* Design and deploy a small-scale research task on Mturk. Note that to do this, you will need to create a ''requester'' account on Mturk. Be sure to allow some time to get the task design the way you want it! Some ideas for study designs you might do:
** A small survey.
** Classification of texts or images (e.g., label tweets, pictures, or comments from a discussion thread).
** A small experiment (e.g., you can do a survey where you insert ''different'' images and ask the same set of questions. Check out the [https://requester.mturk.com/help/getting_started.html Mturk requester getting started guide]
* Prepare to share details of your small-scale research task in class, including results (they will come fast).


''Note:'' In terms of running your task, it will cost real money and you have to put money on your Amazon account yourself. You've each got a $3 budget. Please use your credit card to put $3 on your account right away. I will pay each of you $3 in cash next week to reimburse you for the cost of running the experiment.
=== Week 7: Tuesday February 14: Linear Regression ===


'''Required Readings:'''
'''Required Readings:'''


* [https://docs.aws.amazon.com/AWSMechTurk/latest/RequesterUI/Introduction.html Amazon Mechanical Turk Requester UI Guide] ''[Free Online]''
* Diez, Barr, and Çetinkaya-Rundel: §7 (Introduction to linear regression); §8.1-8.3 (Multiple regression)
* [https://mturkpublic.s3.amazonaws.com/docs/MTURK_BP.pdf Amazon Mechanical Turk Best Practices Guide]. ''[Free Online]''
* OpenIntro eschews a mathematical instruction to correlation. Can you look over [https://en.wikipedia.org/wiki/Correlation_and_dependence the Wikipedia article on correlation and dependence] and pay attentions to the formulas. It's tedious to compute but I'd like to you to at least see what goes into it.
* Weinberg, J., Freese, J., & McElhattan, D. (2014). [https://www.sociologicalscience.com/articles-vol1-19-292/ Comparing Data Characteristics and Results of an Online Factorial Survey between a Population-Based and a Crowdsource-Recruited Sample]. Sociological Science, 1, 292–310. ''[Free Online]''
* Verzani: §11.1-2 (Linear regression),
* Shaw, A. (2015). [https://canvas.uw.edu/files/36419326/download?download_frd=1 Hired Hands and Dubious Guesses: Adventures in Crowdsourced Data Collection]. In E. Hargittai & C. Sandvig (Eds.), Digital Research Confidential: The Secrets of Studying Behavior Online. The MIT Press. ''[Available in Canvas]''
* Lampe, Cliff, and Paul Resnick. 2004. “Slash(Dot) and Burn: Distributed Moderation in a Large Online Conversation Space.” In ''Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '04)'', 543–550. New York, NY, USA: ACM. doi:10.1145/985692.985761. [[http://dx.doi.org/10.1145/985692.985761 Available in UW libraries]]


'''Optional Readings:'''
'''Assignment (Complete Before Class):'''


* Gray, M. L., Suri, S., Ali, S. S., & Kulkarni, D. (2016). [http://sidsuri.com/Publications_files/collab_paper21.pdf The Crowd is a Collaborative Network]. Proceedings of Computer-Supported Cooperative Work. ''[Free Online]''
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 7]]
* Kittur et al. (2013). [http://hci.stanford.edu/publications/2013/CrowdWork/futureofcrowdwork-cscw2013.pdf The Future of Crowd Work]. Proceedings of Computer-Supported Cooperative Work. ''[Free Online]''


'''Resources:'''
'''Lectures:'''
* [http://www.mturk-tracker.com/ Mturk Tracker]


=== Week 6: Saturday May 7: CDSW Session 3 ===
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 7]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_07-linear_regression.ogv Week 7 R lecture screencast: linear regression] (~42 minutes)


As description in the section on technical skills above, I expect everybody who is not comfortable with at least basic programming and data collection to attend the [[Community Data Science Workshops (Spring 2016)]] which I am running concurrently with this class.
'''Resources:'''


This session will run from 9am-3pm. Details on the [[CDSW Spring 2016]] page.
* [https://www.openintro.org/download.php?file=os3_slides_07&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §7 Lecture Notes]
* [https://www.openintro.org/download.php?file=os3_slides_08&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §8 Lecture Notes]
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 4 videos for §7 and 3 videos on the sections §8.1-8.3


=== Week 7: Monday May 9: Consulting Week (i.e., no group meeting) ===
=== Week 8: Tuesday February 21: Polynomial Terms, Interactions, and Logistic Regression ===


During this week, we not meet together. Instead, I will schedule one-on-one in person meetings of an hour with each student individually to catch up with you about your project and to work directly with you to resolve any technical issues you have run into with data collection, etc.
'''Required Readings:'''


=== Week 7: Wednesday May 11: Consulting Week (i.e., no group meeting) ===
* [https://onlinecourses.science.psu.edu/stat501/node/301 Lesson 8: Categorical Predictors] and [https://onlinecourses.science.psu.edu/stat501/node/318 Lesson 9: Data Transformations] from the PennState Eberly College of Science STAT 501 Regression Methods Course. There are several subparts (many quite short), please read them all carefully.
* Diez, Barr, and Çetinkaya-Rundel: §8.4 (Multiple and logistic regression)
* Verzani: §11.3 (Linear regression), §13.1 (Logistic regression)
* Ioannidis, John P. A. 2005. “Why Most Published Research Findings Are False.” ''PLoS Medicine'' 2(8):e124. [[http://dx.doi.org/10.1371%2Fjournal.pmed.0020124 Open Access]]
* Lampe, Cliff, and Paul Resnick. 2004. “Slash(Dot) and Burn: Distributed Moderation in a Large Online Conversation Space.” In ''Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '04)'', 543–550. New York, NY, USA: ACM. doi:10.1145/985692.985761. [[http://dx.doi.org/10.1145/985692.985761 Available in UW libraries]]


=== Week 8: Monday May 16: Consulting Week (i.e., no group meeting) ===
'''Optional Readings:'''


During this week, we not meet together. Instead, I will schedule one-on-one in person meetings of an hour with each student individually to catch up with you about your project and to work directly with you to resolve any technical issues you have run into with data collect
* Head, Megan L., Luke Holman, Rob Lanfear, Andrew T. Kahn, and Michael D. Jennions. 2015. “The Extent and Consequences of P-Hacking in Science.” ''PLOS Biology'' 13(3):e1002106. [[http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106 Open Access]]


=== Week 8: Wednesday May 18: Consulting Week (i.e., no group meeting) ===
'''Assignment (Complete Before Class):'''


=== Week 9: Monday May 23: Design Research ===
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 8]]


Today we'll have a guest visitor — [http://www.andresmh.com/ Andrés Monroy-Hernández] from [http://fuse.microsoft.com/ Microsoft Resarch's FUSE labs] and affiliate faculty in the Department of Communication and Department of Human-Centered Design and Engineering at UW. Monroy-Hernández research involves studying people by designing and building systems. He's built a number of very large and successful socio-technical systems as part of his research. In his graduate work, he build the [http://scratch.mit.edu/ Scratch Online Community] which is now used by more than 10 million people.
'''Lectures:'''


I've asked him to come and talk to us about design research as a process. As a result, it will be helpful to read about two projects he has worked on recently that he will talked to us about. Those projects are called NewsPad and Eventful.
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 8]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_08-more_regression_anova_redux.ogv Week 8 R lecture screencast: more on linear regression, including interactions, polynomials, log transformations; anova] (~28 minutes)


'''Required Readings:'''
'''Resources:'''


* Olsen, D. R., Jr. (2007). [http://doi.org/10.1145/1294211.1294256 Evaluating User Interface Systems Research.] In Proceedings of the 20th Annual ACM Symposium on User Interface Software and Technology (pp. 251–258). New York, NY, USA: ACM. [Available through UW Libraries]
* [https://www.openintro.org/download.php?file=os3_slides_08&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §8 Lecture Notes]
* J. Nathan Matias and Andres Monroy-Hernandez, [http://research.microsoft.com/apps/pubs/default.aspx?id=208886 NewsPad: Designing for Collaborative Storytelling in Neighborhoods]. CHI Work in Progress Paper. ACM, March 2014.
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including a video on §8.4
* Elena Agapie, Jaime Teevan, and Andrés Monroy-Hernández, [http://research.microsoft.com/apps/pubs/default.aspx?id=252315 Crowdsourcing in the Field: A Case Study Using Local Crowds for Event Reporting], in Human Computation (HCOMP), AAAI - Association for the Advancement of Artificial Intelligence, August 2015.
* I've written this document which will likely be useful for many of you: [https://communitydata.cc/~mako/2017-COM521/logistic_regression_interpretation.html Interpreting Logistic Regression Coefficients with Examples in R]
* Two very short videos describing the systems: [http://research.microsoft.com/en-us/projects/newspad/ NewsPad by FUSE Labs] and [http://research.microsoft.com/en-us/projects/eventful/ Eventful by FUSE Labs]
 
=== Week 9: Wednesday May 25: Digital Trace and Sensor Data ===
 
'''Required Readings:'''


Read any 2 of these 4 chapters from the [https://global.oup.com/academic/product/the-handbook-of-emergent-technologies-in-social-research-9780195373592 Handbook of Emerging Technology in Social Research]:
=== Week 9: Tuesday February 28: Consulting Meetings ===


* Eagle, Nathan, "[https://canvas.uw.edu/files/36870285/download?download_frd=1 Mobile phones as sensors for social research]," Ch. 22 in HET.
We won't meet as a group. Instead, you will each meet on-on-one with me to work through challenges and issues with your analysis.
* Visser, Albertine and Ingrid Mulder, "[https://canvas.uw.edu/files/36870283/download?download_frd=1 Emergent technologies for assessing social feelings and experiences]," Ch. 16 in HET.
* de Haan, Geert, et. al., "[https://canvas.uw.edu/files/36870284/download?download_frd=1 Bringing the research lab into everyday life: Exploiting sensitive environments to acquire data for social research]," Ch. 23 in HET.
* Fowler, Chris, et. al., "[https://canvas.uw.edu/files/36870282/download?download_frd=1 Living laboratories: Social research applications and evaluations]," Ch. 27 in HET.
* Holohan, Anne, et. al., "[https://canvas.uw.edu/files/36870280/download?download_frd=1 The digital home: A new locus of social science research]," Ch. 28 in HET.


=== Week 10: Monday May 30: Final Presentations  ===
=== Week 10: Tuesday March 7: Consulting Meetings ===
=== Week 10: Wednesday June 1: Final Presentations  ===


=== Not Covered: Hyperlink Networks ===
We won't meet as a group. Instead, you will each meet on-on-one with me to work through challenges and issues with your analysis.


* Jackson, Michele, (1997), "Assessing the Structure of the Communication on the World Wide Web," Journal of Computer-Mediated Communication, V. 3, N. 1, June, 1997.
=== Week 11: March 14: Final Presentations ===
* Jackson, Michele, (2011), "What Should Researchers Infer From Links on an Organization's Site?", blog post at http://assett.colorado.edu/jackson/what-should-researchers-infer-from-links-on-an-organizations-site/
* Olesen, Thomas (2004), "The Transnational Zapatista Solidarity Network: An Infrastructure Analysis," Global Networks, 4(1):89-107 [Although this article uses the term infrastructure analysis, the method employed is best described as a hyperlink network analysis.]


== Administrative Notes ==
== Administrative Notes ==
Line 517: Line 438:
=== Credit and Notes ===
=== Credit and Notes ===


This syllabus was inspired by, and borrows with permission from, a syallbus from an earlier version of this class taught by [http://www.com.washington.edu/foot/ Kirsten Foot] in Spring 2014.
This syllabus has, in ways that should be obvious, borrowed and built on the [https://www.openintro.org/stat/index.php OpenInto Statistics curriculum]. In the sense that he used the same two textbooks, I also drew some inspiration and confidence from Tom S. Clark's [http://www.tomclarkphd.com/teaching/POLS508F14.pdf syllabus for POLS 508: Data Analysis in Fall 2014].

Latest revision as of 05:19, 11 March 2017

Advanced Stastical Methods in Communication: Statistics and Statistical Programming
COM521 - Department of Communication, University of Washington
Instructor: Benjamin Mako Hill (University of Washington)
Course Websites:
Course Catalog Description:[1]
Discusses complexities in quantitative research on communication. Focus on multivariate data design and analysis, including multiple and logistic regression, ANOVA and MANOVA, and factor analysis.

Overview and Learning Objectives[edit]

This course is the second course in a two-quarter quantitative methods sequence in the University of Washington's Department of Communication MA/PhD program. The first course (COM 520) is an introduction to quantitative social science in communication and focuses primarily on what you might think of the "soft skills" associated with doing social science: the conceptualization, operationalization of quantifiable variables, and the design of quantitative analyses. That course introduces some univariate and bivariate statistics at the end and briefly touches on linear regression. That said, all of the statistical work in that course this is done using the tools that students already know (e.g. with spreadsheet software like LibreOffice, Google Sheets or Microsoft Excel). This class assumes that students have taken COM 520 and that they understand what is involved in describing and testing social scientific theories with data and that basic terminology of quantitative social science is going to be familiar.

This course (COM 521) is focused on technical skill-building and aims to be a get-your-hands-dirty introduction to statistics and statistical programming. The point of the course is to give you the mathematical and technical tools to carry out your own statistical analyses. Through the process, we're going to try to help you become more sophisticated consumers of quantitative research.

Although we'll be doing some math in the course, this is not a math class. I am going to assume you're familiar with basic algebra and arithmetic. This course will not require knowledge of calculus. In general we're not going to cover the math behind the techniques we'll be covering. Unlike many statistics classes, I'm definitely not going to be doing proofs on the board. Instead, the class is unapologetically focused on the application of statistic methodology. In that sense, the goal of the is course is to create informed consumers of quantitative methodology, not producers of new types of methods. My goal is to train producers of social scientific research that use statistics as a means toward an end.

This course does not seek to be the last stats class you take. I started grad school having not taken a math class since high school (basically) and took 12 different statistics and math courses over the course of my time in graduate school. Honestly, I wish I had done more. What this class seeks to do is give you a solid basis on which to build statistical knowledge. Anyone who finishes this class should feel comfortable moving on to take advance classes in CSSS (classes above 510 on this list) and to start building toward a Statistics Concentration in the Department of Communication MA/PhD Program or a similar CSSS certificate/track in another department.

We'll cover theses basic statistical techniques: t-tests; chi-squared tests; ANOVA, MANOVA, and related methods; linear regression; and end with logistic regression.

I will consider the course a complete success if every student is able to do all of these things at the end of the quarter:

  • Carry out a complete analysis of a quantitative research project, start to finish.
  • Read, modify, and create short programs in the GNU R statistical programming language.
  • Feel comfortable reading papers that use basic statistical techniques.
  • Feel comfortable and prepared enrolling in future statistics courses in CSSS.

Why Statistical Programming?[edit]

This class will focus much more on statistical programming in R than most similar classes. Most similar classes in communication will focus on using an easier to use statistical package like SPSS.

We're focusing on programming instead of a package like SPSS for several reasons:

  • Student who understands a programming language won't be limited to the "canned" functions in the off-the-shelf packages.
  • Pedagogically, programming supports students in building a deeper understanding of the mathematics and assumptions behind the canned functions by both allowing them to read the code "behind" the canned functions and by allowing the students to implement the functions themselves in assignments.
  • Analyses composed of code instead of clicks supports reproducible analyses that can document every step of the process of an analysis including during data cleaning and conversion where errors are common and very difficult to detect.
  • Because programming is a skill that is in demand in our department and discipline more generally and that I strongly believe is generally useful.

Of course, there are other programming languages well suited to statistics including Stata and Python. Ultimately, I'm teaching R because a few of us that seemed mostly to teach in this sequence going forward future got together and the decision was that R made the most sense and because there was consensus among the faculty in the department who were likely to teach statistics classes in the future that this made the most sense.

Our reasoning was that:

  • R is freely available and open source
  • R is becoming the most widely used package in statistical fields and is (by our estimate) used by most academics in my cohort or later in statistics, political science, and economics already.
  • R is the system (along with Stata) that will be in other CSSS advanced stats classes we hope students will continue to take after COM521.
  • R is better general purpose programming language than software like Stata which means that R programming skills will let students solve non-stastical problems like collecting data from the web and will make it easier to learn other programming languages.

For students with a strong psychometric focus or whose research will be limited to linear and logistic regression or ANOVA on small pre-collected datasets and similar, SPSS will likely be fine. R has a higher barrier to entry than SPSS but it's ceiling is much higher.

Note About This Syllabus[edit]

You should expect this syllabus to be a dynamic document and you will notice that there are a few places marked "To Be Determined." Although the core expectations for this class are fixed, the details of readings and assignments will shift. As a result, there are three important things to keep in mind:

  1. Although details on this syllabus will change, I will not change readings or assignments less than one week before they are due. If I don't fill in a "To Be Determined" one week before it's due, it is dropped. If you plan to read more than one week ahead, contact me first.
  2. Closely monitor your email or the announcements section on the course website on Canvas. When I make changes, these changes will be recorded in the history of this page so that you can track what has changed and I will summarize these changes in an announcement on Canvas that will be emailed to everybody in the class.
  3. I will ask the class for voluntary anonymous feedback frequently — especially toward the beginning of the quarter. Please let me know what is working and what can be improved. In the past, I have made many adjustments based on this feedback.

Books and Resources[edit]

Although I've never taught with a textbook in a proper sense, statistics is very well covered terrain and, as a result, there is an enormous amount of excellent curricular material out there I think we would be wise to build from. As a result, this class is going to use two textbooks:

  • Diez, David M., Christopher D. Barr, and Mine Çetinkaya-Rundel. 2015. OpenIntro Statistics. 3rd edition. OpenIntro, Inc. (PDF; Table-friendly PDF; Other)
  • Verzani, John. 2014. Using R for Introductory Statistics, Second Edition. 2 edition. Boca Raton: Chapman and Hall/CRC. (Various Sources; Amazon)

Diez, Barr, and Çetinkaya-Rundel's is a free, and freely-licensed, online statistics textbook. Over the last seven years, the book has also developed a large online community of students and teachers who have shared other resources. The book, lectures notes, and more are all freely licensed which has allowed the text to be adapted in a series of different fields. The book is excellent and it has been adopted extraordinarily widely. You can buy versions from Amazon in either full color hardcover ($19.99) or in black and white paperback ($7.60). I haven't purchased a paper copy so I can't speak to the quality of either.

Verzani's book is an introduction to the R programming language. It's designed to be used as a companion to a basic introductory statistics textbook (like OpenIntro). It's a poor stand-alone text but it will provide good resources for the material we're covering in the course and it should act as a good reference going forward. The book is available online for about $50.

Although it's not required for the course, I want to point you to these two books. When I was learning R, these both were very useful references:

There are also two non-textbook resources I wanted to point you to that are invaluable:

  • Baggott's R Reference Card v2 — When I was learning R, I literally took a similar reference card with me everywhere and looked at it dozens of times a day.
  • StackOverflow R Tag — Somebody already had your question about how to do X in R. They asked it, and several people have answered it, on StackOverflow. Learning to read this effectively will take time but as build up some basic familiarity with R and with StackOverflow, it will get easier. I promise.
  • Rseek — Rseek is a modified version of Google that just search R websites online. Sometimes, R is hard to search before because R is a common letter. This has become much easier over time as R has become more popular but it might still be the case sometimes and Rseek is a good solution.

Assignments[edit]

The assignments in this class are designed to give you an opportunity to try your hand at using the conceptual material taught in the class. There will be no exams or quizzes. Unless otherwise noted, all assignments are due at the end of the day (i.e., 11:59pm on the day they are due).

Weekly Problem Sets and Participation[edit]

Each week I will post a problem set with a list of questions. Some of these will be drawn from the textbooks and some will be ones I design or write. The questions will cover:

  • Statistics questions — These will be questions about statistics from the OpenIntro sections as well as any empirical papers that are listed as required for that that day.
  • Programming challenges — These will be R programming problems that cover material from the Verzani text that was listed as required from the previous session.

I won't be grading these assignment and I won't be asking you to turn in anything for the statistics questions portion of the weekly assignment. That said, we will spend a good chunk of class each day going through the answers to the questions due on that day.

Because randomness is an extremely important concept in statistics, I will use a small R program to randomly cold call on students in the class to walk through your "answer" to each question and explain your reasoning to the class. We'll then have an opportunity to discuss the different approaches as a group. I don't promise to ask all of these questions in class (especially if it's clear that folks get the point). Although I might ask them, I won't cold call for questions that are not on the list.

For the programming challenges, I will ask that everybody shares code for any solutions to programming problems before class so we can walk through in class. If you get completely stuck on a problem and cannot "solve" it, that's OK, but share the code that you do have so that you can walk us through what you did and what you were thinking.

Although the problem sets are not going to be graded, it is critical that you be at class and that you be able to discuss your answers to each of the questions. Your ability to do these latter two things will be reflected in your participation grade for the course which makes a full 40% of your grade. I can't emphasize enough how important it will be to be in class.

I'm not going to form groups for you but it's totally fine with me if you want to work on these problem sets in small groups.

The "Participation Rubric" section of my page on assessment gives the details on how I evaluate participation in my classes. If you sense a conflict between material in this section and material on that page, you can safely assume that the syllabus takes precedence.

Research Project[edit]

As a demonstration of your learning in this course, you will design and carry out a quantitative research project, start to finish. This means you will all:

  • Design and describe a social scientific study — You should all have experience doing this at least once in COM520. The study you design should involves quantitative analysis and should be something you can complete at least a first pass at over the course of this quarter.
  • Find a dataset — Very quickly, you should identify a dataset you will use to complete this project. For most of you, I suspect you will be engaging in secondary data analysis or a re-analysis of a previously collected dataset.
  • Engage in descriptive data analysis — Use R to create descriptive statistics and visualization to describe your data.
  • Test a hypotheses about relationships between two or more variables
  • Report your findings — I'll expect you all to report your findings in both a short paper and a short presentation.
  • Ensure replicability — I'll expect you all to provide code and data for your analysis in a way that makes your work replicable by other researchers.

Although it's not required, I strongly urge each of you to take this opportunity to produce a document that will further your academic career outside of the class. There are many ways that this can happen but the obvious ones are that the paper is something you can submit for publication to a journal or conference, that provides primarily analysis for or acts as a pilot analysis that you can report in a grant proposal or thesis proposal, and/or that serves as part of your masters thesis or dissertation.

Project and Dataset Identification[edit]

Due Date
January 17
Maximum paper length
500 words (~1-2 page)
Deliverables
Turn in in Canvas

Early on, I want you to identify and describe your final project. Your proposal should be short and can be either paragraphs or bullets. It should include the following things:

  • A one paragraph abstract of the proposed study and research question, theory, community, and/or groups you plan to study.
  • A short description of how the project will fit into your career trajectory.
  • An identification of the dataset you will use and a description of the columns or type of data it will include. If you do not currently have access to these data, explain when you will have access to the data.

Final Project Ouline[edit]

Outline Due Date
February 21
Maximum outline length
5 pages
Deliverables
Turn in in Canvas

The outline should should have the following sections: (a) Rationale, (b) Objectives; (b.1) General Objectives; (b.2) Specific Objectives; (c) Null hypotheses; (d) Conceptual Diagram; (e) Measures; (e) Dummy Tables.

An excellent example from my partner Mika Matsuzakis is online in Canavs. Your diagram will likely be much less complicated than Matsuzaki's. Also, please don't be distracted by the fact that Mika does public health. It's the basic form I want you all to emulate, not the content. You can read the published paper to compare.

The example includes everything except a "Measures" section. Your Measures section only needs to include two column table where column 1 is the name of each variable in your analysis and 2 is the specific operationalization of this measures and a description of how you will create it.

Final Project[edit]

Paper Due Date
March 19
Maximum length
6000 words (~20 pages)
Presentation Date
March 14
All Deliverables
Turn in in Canvas

I'm expecting you to produce a draft of a short research paper that, after some additional work, you could consider submitting for publication. I'm also very open to the project being a part of a dissertation. I don't expect the papers to be publication ready but I do expect them to have well considered drafts of all the necessary pieces in terms of quantitative methodology.

Because the emphasis in this class is on statistics and methodology and because I'm not an expert in each of your areas or fields, I'm happy to assume that your paper, proposal, or thesis chapter has already established the relevance and significance of your study and has a comprehensive literature review, well-grounded conceptual approach, and compelling reason why this research is so important. Instead of providing all of these details, feel free to start with a brief summary of the purpose and importance of this research, and an introduction of your research questions or hypotheses. If your provide more detail, that's fine, but I won't give you detailed feedback on these parts.

I have a strong preference for you to write this paper individually but I'm open to the idea that you may want to work with others in the class.

In terms of content:

Grading[edit]

I have put together a very detailed page that describes grading rubric I will be using in this course. Please read it carefully I will assign grades for each of the following items on the UW 4.0 grade scale according to the weights below:

  • Participation: 40%
  • Proposal identification: 5%
  • Final paper outline: 5%
  • Final Presentation: 10%
  • Final Paper: 40%

Finding a Dataset[edit]

In order to complete your project, you will each need a dataset. If you are at the stage of your career where you already have a dataset, great! If not, there are many datasets to draw from. Here are some ideas:

  • Ask your advisor for a dataset they have collected and used in previous papers. Are there other variables you could use?
  • If there's an author of a study you loved, you can send a polite email asking if they are able or willing to share an archival or replication version of the dataset used in their paper. Be very polite and make it clear that this is starting as a class project but that might turn into a paper for publication. Make your timeline clear. In communication, replication datasets are still very rare, so be prepared for a negative answer.
  • Do some Google Scholar and normal Google searching for datasets in your research area. You'd be surprised at what's available.
  • Take a look at datasets available in the Harvard Dataverse (the largest collection of social science research data) or one of the other members of the Dataverse network.
  • Look at the collection of social scientific datasets at ICPSR (UW is a member). There are an enormous number of very rich datasets.
  • Use the ISA Explorer to find datasets. Keep in mind the large majority of datasets it will search are drawn from the natural sciences.
  • Set up a meeting with Jennifer Muilenburg — Data Curriculum and Communications Librarian who runs research data services at the UW libraries. Her email is: libdata@uw.edu I've have talked to her about this course and she is excited about meeting with you to help.
  • FiveThirtyEight.com has published a GitHub repository and an R package with pre-processed and cleaned versions of many of the datasets they use for articles published on their website.

In general, you're responsible for make sure that you're on the right side of the human subject rules and that work is ethical. Class projects generally do not need IRB approval but I hope that each of your projects will turn into something more. If your study involves human subjects research, that work will need IRB oversight of some sort. In general, you can't do a class project with IRB approval and then retroactively get it later. Secondary analysis of anonymized data is generally not considered human subjects research but I strongly suggest that you get a determination from UW's Human Subject Division before you start. For work that is not considered human subjects research, this can often happen in a few hours or days. If you need a faculty sponsor, that should ideally be your advisor. If that doesn't make sense for any of you, I'm happy to talk about serving as the faculty supervisor for the work.

Structure of Class[edit]

I expect everybody to come to class, every week, with their laptop and a power cord, being ready to answer any question on the problem set and having uploaded and shared code to the code related questions. The class is listed as nearly 4 hours long and, with the exception of a few short breaks, I intend to use the entire period. Be in class on time and be plugged in and ready to go.

When it comes to the statistics part of this material, this will be a primarily "flipped" classroom. What this means is that we'll be relying on the textbook and other resources to introduce the material and we'll be using the class to discuss it and answer questions that come up.

Although structure of class will vary, it will generally include the following parts.

  1. Quick updates about assignments, projects, and a meta-discussion about the class.
  2. Discussion of programming challenges due that day.
  3. [Possibly/Sometimes] Short lecture and/or Q&A about new material in Diez, Barr, and Çetinkaya-Rundel
  4. Discussion of statistics questions related to new material in Diez, Barr, and Çetinkaya-Rundel and any exemplary empirical paper we have read to discuss.
  5. Interactive lecture introducing new statistical programming concepts.
  6. [Possibly/Sometimes] Time to begin work on next week's programming assignments.

Schedule[edit]

When reading the schedule below, the following key might help resolve ambiguity: §n denotes chapter n; §n.x denotes section x of chapter; §n.x-y denotes sections x through y of chapter n.

Week 1: Tuesday January 3: Introduction, Setup, and Data and Variables[edit]

Hopefully, the material in OpenIntro feels very familiar from COM520. The programming material will be new but I want you to read it before you come to class so we can work through the examples a group.

Required Readings:

  • Diez, Barr, and Çetinkaya-Rundel: §1 (Introduction to data)
  • Verzani: §1 (Getting Started), §2 (Univariate data) [Available with UWNetID]
  • Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” Proceedings of the National Academy of Sciences 111(24):8788–90. [Available through UW libraries]

Optional Readings:

  • Verzani: §A (Programming)

Assignment (Complete Before Class):

Lectures:

Resources:

Week 2: Tuesday January 10: Probability and Visualization[edit]

Required Readings:

  • Diez, Barr, and Çetinkaya-Rundel: §2 (Probability)
  • Verzani: §3.1-2 (Bivariate data), §4 (Multivariate data), §5 (Multivariate graphics) [Available with UW NetID]
  • Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in Proceedings of the 8th ACM Conference on Designing Interactive Systems. Aarhus, Denmark: ACM. [PDF available on my personal website]

Assignment (Complete Before Class):

Lectures:

Resources:

Week 3: Tuesday January 17: Distributions[edit]

Required Readings:

  • Diez, Barr, and Çetinkaya-Rundel: §3.1-3.2, §3.4: You should read the rest of the chapter (§3.3 and §3.5). I won't assign problem set questions about it but it's still important to be familiar with.
  • Verzani: §6 (Populations)

Assignment (Complete Before Class):

Lectures:

Resources:

Week 4: Tuesday January 24: Statistical significance and hypothesis testing[edit]

Required Readings:

  • Diez, Barr, and Çetinkaya-Rundel: §4 (Foundations for inference)
  • Verzani: §7 (Statistical inference), §8 (Confidence intervals)

Assignment (Complete Before Class):

Lectures:

Resources:

Week 5: Tuesday January 31: Continuous Numeric Data & ANOVA[edit]

Required Readings:

  • Diez, Barr, and Çetinkaya-Rundel: §5 (Inference for numerical data)
  • Verzani: §9 (significance tests), §12 (Analysis of variance)
  • Gelman, Andrew and Hal Stern. 2006. “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant.” The American Statistician 60(4):328–31. [Available through UW Libraries]
  • Sweetser, K. D., & Metzgar, E. (2007). Communicating during crisis: Use of blogs as a relationship management tool. Public Relations Review, 33(3), 340–342. https://doi.org/10.1016/j.pubrev.2007.05.016 [Available through UW Libraries]
  • Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in Proceedings of the 8th ACM Conference on Designing Interactive Systems. Aarhus, Denmark: ACM. [PDF available on my personal website]

Assignment (Complete Before Class):

Lectures:

Resources:

Week 6: Tuesday February 7: Categorical data[edit]

Required Readings:

  • Diez, Barr, and Çetinkaya-Rundel: §6 (Inference for categorical data)
  • Verzani: §3.4 (Bivariate categorical data); §10.1-10.2 (Goodness of fit)
  • Gelman, Andrew and Eric Loken. 2014. “The Statistical Crisis in Science Data-Dependent Analysis—a ‘garden of Forking Paths’—explains Why Many Statistically Significant Comparisons Don’t Hold Up.” American Scientist 102(6):460. [Available through UW Libraries] (This is a reworked version of this unpublished manuscript which provides a more detailed examples.)
  • Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in Proceedings of the 8th ACM Conference on Designing Interactive Systems. Aarhus, Denmark: ACM. [PDF available on my personal website]

Assignment (Complete Before Class):

Lectures:

Resources:

Week 7: Tuesday February 14: Linear Regression[edit]

Required Readings:

  • Diez, Barr, and Çetinkaya-Rundel: §7 (Introduction to linear regression); §8.1-8.3 (Multiple regression)
  • OpenIntro eschews a mathematical instruction to correlation. Can you look over the Wikipedia article on correlation and dependence and pay attentions to the formulas. It's tedious to compute but I'd like to you to at least see what goes into it.
  • Verzani: §11.1-2 (Linear regression),
  • Lampe, Cliff, and Paul Resnick. 2004. “Slash(Dot) and Burn: Distributed Moderation in a Large Online Conversation Space.” In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '04), 543–550. New York, NY, USA: ACM. doi:10.1145/985692.985761. [Available in UW libraries]

Assignment (Complete Before Class):

Lectures:

Resources:

Week 8: Tuesday February 21: Polynomial Terms, Interactions, and Logistic Regression[edit]

Required Readings:

  • Lesson 8: Categorical Predictors and Lesson 9: Data Transformations from the PennState Eberly College of Science STAT 501 Regression Methods Course. There are several subparts (many quite short), please read them all carefully.
  • Diez, Barr, and Çetinkaya-Rundel: §8.4 (Multiple and logistic regression)
  • Verzani: §11.3 (Linear regression), §13.1 (Logistic regression)
  • Ioannidis, John P. A. 2005. “Why Most Published Research Findings Are False.” PLoS Medicine 2(8):e124. [Open Access]
  • Lampe, Cliff, and Paul Resnick. 2004. “Slash(Dot) and Burn: Distributed Moderation in a Large Online Conversation Space.” In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '04), 543–550. New York, NY, USA: ACM. doi:10.1145/985692.985761. [Available in UW libraries]

Optional Readings:

  • Head, Megan L., Luke Holman, Rob Lanfear, Andrew T. Kahn, and Michael D. Jennions. 2015. “The Extent and Consequences of P-Hacking in Science.” PLOS Biology 13(3):e1002106. [Open Access]

Assignment (Complete Before Class):

Lectures:

Resources:

Week 9: Tuesday February 28: Consulting Meetings[edit]

We won't meet as a group. Instead, you will each meet on-on-one with me to work through challenges and issues with your analysis.

Week 10: Tuesday March 7: Consulting Meetings[edit]

We won't meet as a group. Instead, you will each meet on-on-one with me to work through challenges and issues with your analysis.

Week 11: March 14: Final Presentations[edit]

Administrative Notes[edit]

Attendance[edit]

As detailed in my page on assessment, attendance in class is expected of all participants. If you need to miss class for any reason, please contact me ahead of time (email is best). Multiple unexplained absences will likely result in a lower grade or (in extreme circumstances) a failing grade. In the event of an absence, you are responsible for obtaining class notes, handouts, assignments, etc.

Office Hours[edit]

I will not hold regular office hours. In general, I will be available to meet after class. Please contact me on email to arrange a meeting then or at another time.

Accommodations[edit]

In general, if you have an issue, such as needing an accommodation for a religious obligation or learning disability, speak with me before it affects your performance; afterward it is too late. Do not ask for favors; instead, offer proposals that show initiative and a willingness to work.

To request academic accommodations due to a disability please contact Disability Resources for Students, 448 Schmitz, 206-543-8924/V, 206-5430-8925/TTY. If you have a letter from Disability Resources for Students indicating that you have a disability that requires academic accommodations, please present the letter to me so we can discuss the accommodations that you might need for the class. I am happy to work with you to maximize your learning experience.

Academic Misconduct[edit]

I am committed to upholding the academic standards of the University of Washington’s Student Conduct Code. If I suspect a student violation of that code, I will first engage in a conversation with that student about my concerns.

If we cannot successfully resolve a suspected case of academic misconduct through our conversations, I will refer the situation to the department of communication advising office who can then work with the COM Chair to seek further input and if necessary, move the case up through the College.

While evidence of academic misconduct may result in a lower grade, I will not unilaterally lower a grade without addressing the issue with you first through the process outlined above.

Credit and Notes[edit]

This syllabus has, in ways that should be obvious, borrowed and built on the OpenInto Statistics curriculum. In the sense that he used the same two textbooks, I also drew some inspiration and confidence from Tom S. Clark's syllabus for POLS 508: Data Analysis in Fall 2014.