User:Aaronshaw/Stats course

From CommunityData
Statistics and Statistical Programming
MTS 525 Media, Technology & Society, Northwestern University
Instructor: Aaron Shaw (Northwestern University)
Course Websites:


Overview and learning objectives

This course provides a get-your-hands-dirty introduction to statistics and statistical programming mostly for applications in the social sciences and social computing. My main objectives are for all participants to acquire the conceptual, technical, and practical skills to conduct your own statistical analyses and become more sophisticated consumers of quantitative research in communication, HCI, and adjacent disciplines.

I will consider the course a complete success if every student is able to do all of the following things at the end of the quarter:

  • Design and execute a complete quantitative research project, start to finish.
  • Read, modify, and create short programs in the R statistical programming language.
  • Feel comfortable reading and interpreting papers that use basic statistical techniques.
  • Feel comfortable and prepared to enroll in more specialized and advanced statistics courses.

The course will cover the following techniques: t-tests; chi-squared tests; ANOVA, MANOVA, and related methods; linear regression; and logistic regression. We will also consider salient issues in quantitative research such as reproducibility and "the statistical crisis in science." We may cover other topics as time and interest allow.

The course materials will consist of readings, problem sets, and recorded lectures and screencasts (some created by me, some created by other people). The course requirements will emphasize active participation, self-evaluation, and will include a final project focused on the design and execution of an original piece of quantitative research. We will use the R programming language for all examples and assignments.

You are not required to know much about statistics or statistical programming to take this class. I will assume some (very little!) knowledge of the basics of empirical research methods and design, basic algebra and arithmetic, and a willingness to work to learn the rest. In general we are not going to cover the math behind the techniques we'll be learning. Although we may do some math, this is not a math class. This course will also not require knowledge of calculus or matrix algebra. I will *not* do proofs on the board. Instead, the class is unapologetically focused on the application of statistical methods. Likewise, while some exposure to R, other programming languages, or other statistical computing resources will be helpful, it is absolutely not assumed.

Why this course? Why statistical programming? Why R?

Many comparable courses in statistics and quantitative methods do not focus on statistical programming and use easier-to-learn statistical software than R. So why bother? By learning statistical programming you will gain a deeper understanding of both the principles behind your analysis techniques as well as the tools you use to apply those techniques. In addition, a solid grasp of statistical programming will prepare you to create reproducible research, avoid common errors, and enable both greater durability and validity of your work.

Other programming languages are also well suited to statistics, including Stata and Python. Ultimately, I teach (and use) R for a few reasons:

  • R is freely available and open source.
  • R is becoming the most widely used package in statistics and many social science fields.
  • R (along with Stata) will be used in most other advanced stats classes I hope you will take after this course.
  • R is better general purpose programming language than software like Stata which means that R programming skills will let you solve non-statistical problems and will make it easier to learn other programming languages like Python.

A note about this syllabus

This syllabus will be a dynamic document that will evolve throughout the quarter. Although the core expectations are fixed, the details will shift. As a result, please keep in mind the following:

  1. I will not add readings or assignments less than one week before they are due. If I don't fill in a "To Be Determined" one week before it's due, it is dropped. If you plan to read more than one week ahead, contact me first.
  2. Closely monitor your email and/or the announcements section on the course website on Canvas. When I make changes, these changes will be recorded in the history of this page so that you can track what has changed. I will also do my best to summarize these changes in an announcement on Canvas that will be emailed to everybody in the class.
  3. I will ask the class for voluntary anonymous feedback — especially toward the beginning of the quarter. Please let me know what is working and what can be improved. In the past, I have made many adjustments based on this feedback.

Books and resources

This class will use a freely-licensed textbook:

  • Diez, David M., Christopher D. Barr, and Mine Çetinkaya-Rundel. 2015. OpenIntro Statistics. 3rd edition. OpenIntro, Inc. (PDF; Table-friendly PDF; Other)

The texbook (in any format) is required material for the course. You can download it at no cost and/or buy (affordable!) hard copy versions in either full color hardcover or in black and white paperback. The book is excellent and has been adopted widely. It has also developed a large online community of students and teachers who have shared other resources. Lecture slides, videos, notes, and more are all freely licensed (many through the website and others elsewhere).

I am also assigning several chapters from the book

  • <TODO> Reinhardt book.

This book provides a conceptual introduction to some common failures in statistical analysis that you need to learn to recognize and avoid. It was also written by a Ph.D. student.

A few other (optional) books may be useful resources while you're learning to analyze, visualize, and interpret statistical data with R:

There are also some non-textbook resources that are invaluable:

  • Baggott's R Reference Card v2 — Print this out. Take it with you everywhere and look at it dozens of times a day. You will learn the language faster!
  • StackOverflow R Tag — Somebody already had your question about how to do X in R. They asked it, and several people have answered it, on StackOverflow. Learning to read this effectively will take time but as build up some basic familiarity with R and with StackOverflow, it will get easier. I promise.
  • Rseek — Rseek is a modified version of Google that just search R websites online. Sometimes, R is hard to search before because R is a common letter. This has become much easier over time as R has become more popular but it might still be the case sometimes and Rseek is a good solution.
  • <TODO> ggplot2 documentation — Ggplot is a powerful data visualization package for R that I recommend highly. The documentation is indispensable for learning how to use it.

Assignments

The assignments in this class focus on applied statistical research design, analysis, and interpretation. There will be no graded exams or quizzes. Unless otherwise noted, all assignments are due at the end of the day (i.e., 11:59pm on the day they are due).

Weekly problem sets and participation

Each week I will post a problem set. Some of these will be taken from the textbooks and some will not. They will include:

  • Statistics questions about statistical concepts, principles, and interpretation.
  • Programming challenges that you must solve using R.
  • Empirical paper questions about other assigned readings.

You should submit your solutions to the programming challenges ahead of each class session. While I will not grade them, we will spend a good chunk of class going through the answers to the assignment due on that day.

Because randomness is extremely important in statistics, I will use a small R program to randomly call on students to walk through your answer to statistics questions and empirical paper questions in class. We'll then discuss the answers, address points of confusion, and consider alternative approaches as a group.

For the programming challenges, you should submit code for your solutions before class (more on how in a moment) so we can walk through the material together. If you get completely stuck on a problem, that's okay, but please share whatever code you have so that you can tell us what you did and what you were thinking.

Coming to class will be profoundly important to learning the material and to your final grade. Although the problem sets will not be graded, it is critical that you be present and able to discuss your answers to each of the questions. Your ability to do so will figure prominently in your participation grade for the course (40% of your final grade). More on

I encourage you to form groups to work on the problem sets if you find that helpful; however, you must still submit your work individually to help ensure that you learn and understand the material.

<TODO create rubric?> The "Participation Rubric" section of my page on assessment gives the details on how I evaluate participation in my classes. If you sense a conflict between material in this section and material on that page, you can safely assume that the syllabus takes precedence.

Research project

As a demonstration of your learning in this course, you will design and carry out a quantitative research project, start to finish. This means you will all:

  • Design and describe a plan for a study — The study you design should involve quantitative analysis and should be something you can complete at least a first pass on during this quarter.
  • Find a dataset — Very quickly, you should identify a dataset you will use to complete this project. For most of you, I suspect you will be engaging in secondary data analysis or a analysis of a previously collected dataset.
  • Engage in descriptive data analysis — Use R to calculate descriptive statistics and visualizations to describe your data.
  • Test at least one hypothesis about relationships between two or more variables
  • Report and interpret your findings — You will do this in both a short paper and a short presentation.
  • Ensure that your work is replicable — You will need to provide code and data for your analysis in a way that makes your work replicable by other researchers.

I strongly urge you to produce a project that will further your academic career outside of the class. There are many ways that this can happen. Some obvious options are to prepare a project that you can submit for publication, use as pilot analysis that you can report in a grant or thesis proposal, and/or that fulfills a degree requirement.

There are several intermediate milestones and deadlines to help you accomplish a successful research project. Unless otherwise noted, all deliverables should be submitted via Canvas.

Project plan and dataset identification

Due date
<TBA>
Maximum length
500 words (~1-2 pages)

Early on, I want you to identify and describe your final project. Your description should be short and can be either paragraphs or bullets. It should include the following:

  • An abstract of the proposed study including the topic, research question, theoretical motivation, object(s) of study, and anticipated research contribution.
  • An identification of the dataset you will use and a description of the columns or type of data it will include. If you do not currently have access to these data, explain why and when you will.
  • A short (several sentences?) description of how the project will fit into your career trajectory.

Project planning document

Due date
<TBA>
Maximum length
5 pages

The project planing document is a basic shell/outline of an empirical quantitative research paper. Your planning document should should have the following sections: (a) Rationale, (b) Objectives; (b.1) General objectives; (b.2) Specific objectives; (c) Null hypotheses; (d) Conceptual diagram and/or explanation of the relationship you plan to test; (e) Measures; (e) Dummy tables. Descriptions of each of these planning document section are available on this wiki page.

An exemplary planning document from public health researcher Mika Matsuzaki is online in Canavs. Your diagram will likely be much less complicated than Matsuzaki's. Also, please don't be distracted by the fact that Matsuzaki does public health research. You can (and should!) emulate the form rather than the content. You can also check out the published paper to see how the project wound up.

Please note that the Matsuzaki planning document includes everything except a "Measures" section. Your Measures section should include a two column table where column 1 is the name of each variable in your analysis and column 2 describes the operationalization of each measures and (if necessary) how you will create it.

Project presentation and paper

Paper due date
<TBA>
Maximum length
6000 words (~20 pages)
Presentation due date
<TBA>
Maximum length
<TBA> minutes


The paper: Ideally, I expect you to produce a high quality short research paper that you might revise and submit for publication and/or a dissertation milestone. I do not expect the paper to be ready for publication, but it should contain polished drafts of all the necessary components of a scholarly quantitative empirical research study. In terms of the structure, please see the page on the structure of a quantitative empirical research paper.

As noted above, you should also provide data, code, and any documentation sufficient to enable the replication of all analysis and visualizations. This can happen through Github. If that is not possible/appropriate for some reason, please talk to me so that we can find another solution.

Because the emphasis in this class is on statistics and methods and because I'm not an expert in each of your fields, I'm happy to assume that your paper, proposal, or thesis chapter has already established the relevance and significance of your study and has a comprehensive literature review, well-grounded conceptual approach, and compelling reason why this research is important. As a result, you need not focus on these elements of the work in your written submission. Instead, feel free to start with a brief summary of the purpose and importance of this research followed by an introduction of your research questions or hypotheses. If you provide more detail, that's fine, but I won't give you detailed feedback on these parts.

I have a strong preference for you to write the paper individually, but I'm open to the idea that you may want to work with others in the class.

I do not have strong preferences about the style or formatting guidelines you follow for the paper and its bibliography. However, your paper must follow a standard format (e.g., <TODO link> ACM SIGCHI CSCW format or <TODO link> APA 6th edition) that is applicable for the journal or conference in which you aim to publish the work (they all have formatting or submission guidelines published online and you can follow them). This includes the references. I also strongly recommend that you use reference management software to handle your bibliographic sources.

The presentation: The presentation will provide an opportunity to share a brief summary of your project and findings with the other members of the class. Since you will all give other research presentations throughout your career, I strongly encourage you to take the opportunity to refine your academic presentation skills. The document Creating a Successful Scholarly Presentation (link is in Canvas) will likely be useful.

Grading

<TODO decide/update?>I have put together a very detailed page that describes grading rubric I will be using in this course. Please read it carefully.

I will assign grades (usually a numeric value ranging from 0-10) for each of the following aspects of your performance. The percentage values in parentheses are weights that will be applied to calculate your overall grade for the course.

  • Participation: 40%
  • Proposal identification: 5%
  • Final project planning document: 5%
  • Final project presentation: 10%
  • Final project paper: 40%

Note on finding a dataset

In order to complete your project, you will each need a dataset. If you already have a dataset for the project you plan to conduct, great! If not, there are many datasets to draw from. Here are some ideas:

  • Ask your advisor for a dataset they have collected and used in previous papers. Are there other variables you could use? Other relationships you could analyze?
  • If there's an important study you loved, you can send a polite email to the author(s) asking if they are willing and able to share an archival or replication version of the dataset used in their paper. Be very polite and make it clear that this is starting as a class project, but that it might turn into a paper for publication. Make your timeline clear. In Communication and HCI, replication datasets are still very rare, so be prepared for a negative answer and/or questions about your motives in conducting the analysis.
  • Do some Google Scholar and normal internet searching for datasets in your research area. You'll probably be surprised at what's available.
  • Take a look at datasets available in the Harvard Dataverse (a very large collection of social science research data) or one of the other members of the Dataverse network.
  • Look at the collection of social scientific datasets at ICPSR at the University of Michigan (NU is a member). There are an enormous number of very rich datasets.
  • Use the ISA Explorer to find datasets. Keep in mind the large majority of datasets it will search are drawn from the natural sciences.
  • <TODO fix/update accordingly> Set up a meeting with Jennifer Muilenburg — Data Curriculum and Communications Librarian who runs research data services at the UW libraries. Her email is: libdata@uw.edu I've have talked to her about this course and she is excited about meeting with you to help.
  • FiveThirtyEight.com has published a GitHub repository and an R package with pre-processed and cleaned versions of many of the datasets they use for articles published on their website.

Human subjects research, IRB, and ethics

In general, you are responsible for making sure that you're on the right side of the IRB requirements and that your work meets applicable ethical norms and standards.

Class projects generally do not need IRB approval, but research for publications, dissertations, and sometimes even pilot studies generally fall under IRB purview. You should not plan to seek IRB approval/determination retroactively. If your study may involve human subjects and you may ever publish it in any form, you will need IRB oversight of some sort.

Secondary analysis of anonymized data is generally not considered human subjects research, but I strongly suggest that you get a determination from [LINK the Northwestern IRB] before you start. For work that is not considered human subjects research, this can often happen in a few hours or days. If you need to list a faculty sponsor or Principal Investigator, that should ideally be your advisor. If that doesn't make sense for some reason, please talk to me.

Structure of Class

I expect everybody to come to class, every week, with a laptop and a power cord, ready to answer any question on the problem set and having uploaded code related the the programming questions. The class is listed as nearly 3 hours long and, with the exception of short breaks, I intend to use the entire period. Please be in class on time, plugged in, and ready to go.

When it comes to the statistics material, this will mostly be a so-called "flipped" classroom. This means we will rely on the textbook and other resources to introduce the material and we will use the class sessions to discuss questions as they come up.

Although the day-to-day routine will vary, each class session will generally include the following:

  • Quick updates about assignments, projects, and meta-discussion about the class.
  • Discussion of programming challenges due that day.
  • [Sometimes] Short lecture and/or Q&A about new material in Diez, Barr, and Çetinkaya-Rundel.
  • Discussion of statistics questions related to new material in Diez, Barr, and Çetinkaya-Rundel.
  • Discussion of any exemplary empirical paper we have read.
  • [Sometimes] Interactive lecture introducing new statistical programming concepts.

Schedule

When reading the schedule below, the following key might help resolve ambiguity: §n denotes chapter n; §n.x denotes section x of chapter; §n.x-y denotes sections x through y of chapter n.

Week 1: Tuesday January 3: Introduction, Setup, and Data and Variables

Please complete the readings prior to class so that we can discuss them and start talking through some of the examples in R together.

Required Readings:

  • Diez, Barr, and Çetinkaya-Rundel: §1 (Introduction to data)
  • Verzani: §1 (Getting Started), §2 (Univariate data) [Available with UWNetID]
  • Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” Proceedings of the National Academy of Sciences 111(24):8788–90. [Available through UW libraries]

Optional Readings:

  • Verzani: §A (Programming)

Assignment (Complete Before Class):

Lectures:

Resources:

Week 2: Tuesday January 10: Probability and Visualization

Required Readings:

  • Diez, Barr, and Çetinkaya-Rundel: §2 (Probability)
  • Verzani: §3.1-2 (Bivariate data), §4 (Multivariate data), §5 (Multivariate graphics) [Available with UW NetID]
  • Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in Proceedings of the 8th ACM Conference on Designing Interactive Systems. Aarhus, Denmark: ACM. [PDF available on my personal website]

Assignment (Complete Before Class):

Lectures:

Resources:

Week 3: Tuesday January 17: Distributions

Required Readings:

  • Diez, Barr, and Çetinkaya-Rundel: §3.1-3.2, §3.4: You should read the rest of the chapter (§3.3 and §3.5). I won't assign problem set questions about it but it's still important to be familiar with.
  • Verzani: §6 (Populations)

Assignment (Complete Before Class):

Lectures:

Resources:

Week 4: Tuesday January 24: Statistical significance and hypothesis testing

Required Readings:

  • Diez, Barr, and Çetinkaya-Rundel: §4 (Foundations for inference)
  • Verzani: §7 (Statistical inference), §8 (Confidence intervals)

Assignment (Complete Before Class):

Lectures:

Resources:

Week 5: Tuesday January 31: Continuous Numeric Data & ANOVA

Required Readings:

  • Diez, Barr, and Çetinkaya-Rundel: §5 (Inference for numerical data)
  • Verzani: §9 (significance tests), §12 (Analysis of variance)
  • Gelman, Andrew and Hal Stern. 2006. “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant.” The American Statistician 60(4):328–31. [Available through UW Libraries]
  • Sweetser, K. D., & Metzgar, E. (2007). Communicating during crisis: Use of blogs as a relationship management tool. Public Relations Review, 33(3), 340–342. https://doi.org/10.1016/j.pubrev.2007.05.016 [Available through UW Libraries]
  • Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in Proceedings of the 8th ACM Conference on Designing Interactive Systems. Aarhus, Denmark: ACM. [PDF available on my personal website]

Assignment (Complete Before Class):

Lectures:

Resources:

Week 6: Tuesday February 7: Categorical data

Required Readings:

  • Diez, Barr, and Çetinkaya-Rundel: §6 (Inference for categorical data)
  • Verzani: §3.4 (Bivariate categorical data); §10.1-10.2 (Goodness of fit)
  • Gelman, Andrew and Eric Loken. 2014. “The Statistical Crisis in Science Data-Dependent Analysis—a ‘garden of Forking Paths’—explains Why Many Statistically Significant Comparisons Don’t Hold Up.” American Scientist 102(6):460. [Available through UW Libraries] (This is a reworked version of this unpublished manuscript which provides a more detailed examples.)
  • Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in Proceedings of the 8th ACM Conference on Designing Interactive Systems. Aarhus, Denmark: ACM. [PDF available on my personal website]

Assignment (Complete Before Class):

Lectures:

Resources:

Week 7: Tuesday February 14: Linear Regression

Required Readings:

  • Diez, Barr, and Çetinkaya-Rundel: §7 (Introduction to linear regression); §8.1-8.3 (Multiple regression)
  • OpenIntro eschews a mathematical instruction to correlation. Can you look over the Wikipedia article on correlation and dependence and pay attentions to the formulas. It's tedious to compute but I'd like to you to at least see what goes into it.
  • Verzani: §11.1-2 (Linear regression),
  • Lampe, Cliff, and Paul Resnick. 2004. “Slash(Dot) and Burn: Distributed Moderation in a Large Online Conversation Space.” In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '04), 543–550. New York, NY, USA: ACM. doi:10.1145/985692.985761. [Available in UW libraries]

Assignment (Complete Before Class):

Lectures:

Resources:

Week 8: Tuesday February 21: Polynomial Terms, Interactions, and Logistic Regression

Required Readings:

  • Lesson 8: Categorical Predictors and Lesson 9: Data Transformations from the PennState Eberly College of Science STAT 501 Regression Methods Course. There are several subparts (many quite short), please read them all carefully.
  • Diez, Barr, and Çetinkaya-Rundel: §8.4 (Multiple and logistic regression)
  • Verzani: §11.3 (Linear regression), §13.1 (Logistic regression)
  • Ioannidis, John P. A. 2005. “Why Most Published Research Findings Are False.” PLoS Medicine 2(8):e124. [Open Access]
  • Lampe, Cliff, and Paul Resnick. 2004. “Slash(Dot) and Burn: Distributed Moderation in a Large Online Conversation Space.” In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '04), 543–550. New York, NY, USA: ACM. doi:10.1145/985692.985761. [Available in UW libraries]

Optional Readings:

  • Head, Megan L., Luke Holman, Rob Lanfear, Andrew T. Kahn, and Michael D. Jennions. 2015. “The Extent and Consequences of P-Hacking in Science.” PLOS Biology 13(3):e1002106. [Open Access]

Assignment (Complete Before Class):

Lectures:

Resources:

Week 9: Tuesday February 28: Consulting Meetings

We won't meet as a group. Instead, you will each meet on-on-one with me to work through challenges and issues with your analysis.

Week 10: Tuesday March 7: Consulting Meetings

We won't meet as a group. Instead, you will each meet on-on-one with me to work through challenges and issues with your analysis.

Week 11: March 14: Final Presentations

Administrative Notes

Attendance

As detailed in my page on assessment, attendance in class is expected of all participants. If you need to miss class for any reason, please contact me ahead of time (email is best). Multiple unexplained absences will likely result in a lower grade or (in extreme circumstances) a failing grade. In the event of an absence, you are responsible for obtaining class notes, handouts, assignments, etc.

Office Hours

I will not hold regular office hours. In general, I will be available to meet after class. Please contact me on email to arrange a meeting then or at another time.

Accommodations

In general, if you have an issue, such as needing an accommodation for a religious obligation or learning disability, speak with me before it affects your performance; afterward it is too late. Do not ask for favors; instead, offer proposals that show initiative and a willingness to work.

To request academic accommodations due to a disability please contact Disability Resources for Students, 448 Schmitz, 206-543-8924/V, 206-5430-8925/TTY. If you have a letter from Disability Resources for Students indicating that you have a disability that requires academic accommodations, please present the letter to me so we can discuss the accommodations that you might need for the class. I am happy to work with you to maximize your learning experience.

Academic Misconduct

I am committed to upholding the academic standards of the University of Washington’s Student Conduct Code. If I suspect a student violation of that code, I will first engage in a conversation with that student about my concerns.

If we cannot successfully resolve a suspected case of academic misconduct through our conversations, I will refer the situation to the department of communication advising office who can then work with the COM Chair to seek further input and if necessary, move the case up through the College.

While evidence of academic misconduct may result in a lower grade, I will not unilaterally lower a grade without addressing the issue with you first through the process outlined above.

Credit and Notes

This syllabus has, in ways that should be obvious, borrowed and built on the OpenInto Statistics curriculum. In the sense that he used the same two textbooks, I also drew some inspiration and confidence from Tom S. Clark's syllabus for POLS 508: Data Analysis in Fall 2014.