Statistics and Statistical Programming (Fall 2020): Difference between revisions

From CommunityData
No edit summary
 
(251 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<div style="float:right;" class="toclimit-2">__TOC__</div>
<div style="float:right;" width=30%; class="toclimit-3">__TOC__</div>


:'''Statistics and Statistical Programming'''
;Statistics and Statistical Programming
::Media, Technology & Society (MTS) 525
:Media, Technology & Society (MTS) 525 and Communication Studies 395
::Tuesdays & Thursdays 10-11:50am (via Zoom)
:Tuesdays & Thursdays 1-2:50pm CT
::Spring, 2019
:Fall 2020
::Northwestern University
:Northwestern University


:'''Instructor:''' [http://aaronshaw.org Aaron Shaw] ([mailto:aaronshaw@northwestern.edu aaronshaw@northwestern.edu])
;Course websites
::Office Hours: <TBA> or by appointment
: [https://canvas.northwestern.edu/courses/122522 Canvas] for [https://canvas.northwestern.edu/courses/122522/announcements announcements], [https://canvas.northwestern.edu/courses/122522/assignments assignments], and some [https://canvas.northwestern.edu/courses/122522/files files].
::<location tba>
: [https://northwestern.zoom.us Zoom] for synchronous course meetings.
: [https://discord.com Discord] for discussions and chat.
: [https://wiki.communitydata.science/Statistics_and_Statistical_Programming_(Fall_2020) This wiki page] for nearly everything else.


:'''Teaching Assistant:''' <TBA>
;'''Instructor:''' [http://aaronshaw.org Aaron Shaw] ([mailto:aaronshaw@northwestern.edu aaronshaw@northwestern.edu])
::Office Hours: <tba>
:Office Hours: Thursday 10am-12pm and by appointment
::<location tba>
:Please use [[User:Aaronshaw/OH|office hours signups (with location information)]]
:Also usually available via chat during "business hours."


:'''Course Websites''':
;'''Teaching Assistant:''' [http://nickmvincent.com Nick Vincent] ([mailto:nickvincent@u.northwestern.edu nickvincent@u.northwestern.edu])
:* We will use [https://canvas.northwestern.edu/courses/90927 Canvas] for [https://canvas.northwestern.edu/courses/90927/announcements announcements], [https://canvas.northwestern.edu/courses/90927/assignments turning in most assignments], and maybe [https://canvas.northwestern.edu/courses/90927/discussion_topics discussions] the other possibility is [https://discord.com Discord].
:Office Hours: Monday 10am-12pm and by appointment. I'll try to respond to any asynchronous questions in a timely fashion during "business hours" (9a-5p Central Time), and will also have OH by appointment. I'll respond best to email (above), but am also happy to use Discord for quicker back-and-forth.
:* Everything else will be linked on this page.
:I am happy to try out alternative communication software for OH!


<br>
[[File:Datasaurus.gif|left|450px|frame|Image from [https://www.autodeskresearch.com/publications/samestats Matejka and Fitzmaurice, ''CHI'', 2017]|link=https://www.autodeskresearch.com/publications/samestats]]
<br clear=all>


== Overview and learning objectives ==
== Course information ==
=== Overview and learning objectives ===


This course provides a get-your-hands-dirty introduction to inferential statistics and statistical programming mostly for applications in the social sciences and social computing. My main objectives are for all participants to acquire the conceptual, technical, and practical skills to conduct your own statistical analyses and become more sophisticated consumers of quantitative research in communication, human computer interaction (HCI), and adjacent disciplines.
This course provides a get-your-hands-dirty introduction to inferential statistics and statistical programming mostly for applications in the social sciences and social computing. My main objectives are for all participants to acquire the conceptual, technical, and practical skills to conduct your own statistical analyses and become more sophisticated consumers of quantitative research in communication, human computer interaction (HCI), and adjacent disciplines.
Line 36: Line 43:
You are not required to know much about statistics or statistical programming to take this class. I will assume some (very little!) knowledge of the basics of empirical research methods and design, basic algebra and arithmetic, and a willingness to work to learn the rest. In general we are not going to cover most of the math behind the techniques we'll be learning. Although we may do some math, this is not a math class. This course will also not require knowledge of calculus or matrix algebra. I will *not* do proofs on the board. Instead, the class is unapologetically focused on the application of statistical methods. Likewise, while some exposure to R, other programming languages, or other statistical computing resources will be helpful, it is not assumed.
You are not required to know much about statistics or statistical programming to take this class. I will assume some (very little!) knowledge of the basics of empirical research methods and design, basic algebra and arithmetic, and a willingness to work to learn the rest. In general we are not going to cover most of the math behind the techniques we'll be learning. Although we may do some math, this is not a math class. This course will also not require knowledge of calculus or matrix algebra. I will *not* do proofs on the board. Instead, the class is unapologetically focused on the application of statistical methods. Likewise, while some exposure to R, other programming languages, or other statistical computing resources will be helpful, it is not assumed.


== Why this course? Why statistical programming? Why R? ==
'''Why this course? Why statistical programming? Why R?'''


Many comparable courses in statistics and quantitative methods do not focus on statistical programming and use easier-to-learn statistical software than R. So why bother? By learning statistical programming you will gain a deeper understanding of both the principles behind your analysis techniques as well as the tools you use to apply those techniques. In addition, a solid grasp of statistical programming will prepare you to create reproducible research, avoid common errors, and enable both greater durability and validity of your work.  
Many comparable courses in statistics and quantitative methods do not emphasize statistical programming. So why bother? By learning statistical programming you will gain a deeper understanding of both the principles behind your analysis techniques as well as the tools you use to apply those techniques. In addition, a solid grasp of statistical programming will prepare you to create reproducible research, avoid common errors, and enable both greater durability and validity of your work.  


Other programming languages are also well suited to statistics, including Stata and Python. I am most comfortable and capable with R, so that guides my choice for the course. However, I like to use and teach with R for a few reasons:
Other programming languages are also well suited to statistics, including Stata and Python. I do most of my work with R, so that guides my choice for the course. That said, I opt to use and teach with R for a few reasons:
* R is freely available and open source.
* R is freely available and open source.
* R is becoming the most widely used package in statistics and many social science fields.
* R is the most widely used package in statistics and several social scientific fields.
* R (along with Stata) will be used in most of the advanced stats classes I hope you will take after this course.
* R (along with Stata) will be used in most of the advanced stats classes I hope you will take after this course.
* R is better general purpose programming language than software like Stata which means that R programming skills will let you solve non-statistical problems and may make it easier to learn other programming languages like Python.
* R is better general purpose programming language than Stata which means that R programming skills will let you solve non-statistical problems and may make it easier to learn other programming languages like Python.


== A note about this syllabus ==
=== Format and structure ===
<!---
I expect everybody to come to class, every week, with a laptop and a power cord, ready to answer any question on the problem set and having uploaded code related the the programming questions. The class is listed as nearly 3 hours long and, with the exception of short breaks, I intend to use the entire period. Please be in class on time, plugged in, and ready to go.
--->
 
This course will proceed in a '''remote''' format that includes ''asynchronous'' and ''synchronous'' elements (more on those below). In general, the organization of the course adopts a "flipped" approach where participants consume, discuss, and process instructional materials outside of "class" and we use synchronous meetings to answer questions, address challenges or concerns, work through solutions, and hold semi-structured discussions.
 
The course introduces ''both'' basic statistical concepts as well as applications of those concepts through statistical programming. As a result, we will usually dedicate part of each week to a particular set of concepts and part of each week to applied data analysis and/or interpretation. A brief description of how I expect it all to work follows below. We'll talk about it more during the first class session.
 
====Asynchronous elements of the course====
 
These include all readings, recorded lectures/slides, tutorials, textbook exercises, problem sets, and other assignments. I expect you to complete (or at least attempt to complete!) these outside of our class meeting times. I also strongly encourage you to identify, submit, and discuss questions about the material '''before each class meeting''' whenever possible.
 
We will use Discord for everyday discussions and chat related to the course. In general, the teaching team will try to keep an eye on the various server channels during "business hours." To the extent that we can respond to questions and concerns there, we'll do so. We'll also use the discussion channels to identify topics that might benefit from synchronous conversation during the course meetings. Hopefully, writing and talking about questions and concerns outside of the synchronous course meetings will help support accountability, learning, and more effective use of our meeting time.
 
For nearly all of the "instructional" material introducing particular statistical concepts and techniques, you are assigned materials from the OpenIntro textbook and lecture materials created by the textbook authors. Please note that this means I will not deliver lectures during our class meetings. Please also note that this means you are responsible for coordinating your working groups and any collaborative work with other members of the class outside of our class meeting times.
 
====Synchronous elements of the course====


This syllabus will be a dynamic document that will evolve throughout the quarter. Although the core expectations are fixed, the details will shift. As a result, please keep in mind the following:
The synchronous elements of the course will be the two weekly class meetings that will happen via video conference (Zoom). These are scheduled to run for a maximum of 110 minutes. Each session will include multiple short breaks.
 
We will use the class meetings to discuss and work through any questions or challenges you encounter in the materials assigned for that day. This means that I encourage you to identify, submit, and discuss questions about the material '''before each class meeting''' whenever possible. Doing so will give the teaching team time to sift, sort, and organize the questions into a hopefully-cohesive plan for each class session that is tailored to the specific concerns you encounter in the material. Obviously, we anticipate that questions will arise during the class sessions too as well and we'll do our best to adapt as we go.
 
A couple of other notes about the synchronous course meetings:
* Aaron plans to record the course meetings and have them available to class participants only via Zoom/Canvas. Please get in touch if you have concerns or requests about this.
* The teaching team will do our best to notice and respond to any questions or comments that come up via Discord or Zoom during the class. Please do what you can to support these efforts.
* You might want to create/acquire something like [https://www.mccormick.northwestern.edu/news/articles/2020/08/back-to-school-hack-shares-students-handwritten-work-and-teacher-response-in-real-time.html NU Mechanical Engineering Professor Michael Peshkin's homebrew document camera] to facilitate sharing hand-written notes/drawings during class.
 
In addition, because randomness is extremely important in statistics, I may occasionally '''randomly assign''' different working groups to share and discuss their solutions to selected textbook exercises or problem set questions during class. These random assignments will be announced ahead of time so that the group has an opportunity to prepare. The idea here is to structure some participation in the synchronous sessions to ensure an equitable distribution of the responsibility to discuss questions, answers, points of confusion, and alternatives.
 
==== Working groups ====
 
At the start of the course you will be assigned to a small working group. This will be a group of 2-3 students (exact numbers will depend on the final enrollment) with whom you may meet outside of class time to discuss, complete, and/or review your weekly assignments (as well as some of the research project assignments). The groups will rotate at least once during the quarter to ensure that you get to work with different members of the class. The main idea is to support collaborative learning, peer support, and accountability. While the specifics of exactly when and how you work with your working group will largely be up to you, the teaching team will provide [[Statistics_and_Statistical_Programming_(Fall_2020)/Working_groups_template|suggestions in the form of a template]] that you can use as a starting point.
 
As a general rule, we strongly encourage you to collaborate with members of your working group on any/all weekly (minor) assignments. You may, if you choose, also collaborate with others in your group or the class on your research project (major) assignments; however, collaborative research projects should be discussed with a member of the teaching team and all research project assignment submissions should include the names of all collaborators.


# '''Assignments and readings are ''frozen'' 1 week before they are due.''' I will not add readings or assignments less than one week before they are due. If I forget to add something or fill in a "To Be Determined" less than one week before it's due, it is dropped. If you plan to read or work more than one week ahead, contact me first.
<!---
# '''Substantial changes to the syllabus or course materials will be announced.''' Please closely monitor your email and/or [https://canvas.northwestern.edu the announcements section on the course website on Canvas]. When I make changes, these changes will be recorded in [[https://wiki.communitydata.science/index.php?title=Statistics_and_Statistical_Programming_(Fall_2020)&action=history |the history of this page]] so that you can track what has changed. I will also do my best to summarize these changes in an announcement on Canvas that will be emailed to everybody in the class.
Although the day-to-day routine will vary, each class session will generally include the following:
# '''The course design may adapt throughout the quarter.''' As this is a new format for this course, I may iterate and prototype course design elements rapidly along the way. To this end, I will ask you for voluntary anonymous feedback — especially toward the beginning of the quarter. Please let me know what is working and what can be improved. In the past, I have made many adjustments based on this feedback and I expect to do so again.
* Quick updates about assignments, projects, and meta-discussion about the class.
* Discussion of '''programming challenges''' due that day (and related to the previous week's R lecture materials).
* Discussion of  '''statistics questions''' related to new material in Diez, Barr, and Çetinkaya-Rundel.
* Discussion of any exemplary empirical paper we have read and the '''empirical paper questions'''.
--->


== Books and resources ==
=== Textbook, readings, and resources ===


This class will use a freely-licensed textbook:
This class will use a freely-licensed textbook:


* Diez, David M., Christopher D. Barr, and Mine Çetinkaya-Rundel. 2015. ''OpenIntro Statistics''. 3rd edition. OpenIntro, Inc. ([https://www.openintro.org/download.php?file=os3&referrer=/stat/textbook.php PDF]; [https://www.openintro.org/download.php?file=os3_tablet&referrer=/stat/textbook.php Table-friendly PDF]; [https://www.openintro.org/stat/textbook.php Other])
* Diez, David M., Christopher D. Barr, and Mine Çetinkaya-Rundel. 2019. [https://www.openintro.org/book/os/ ''OpenIntro Statistics'']. 4th edition. OpenIntro, Inc.


The texbook (in any format) is required material for the course. You can download it at no cost and/or buy (affordable!) hard copy versions in either [https://www.openintro.org/redirect.php?go=amazon_os3_hardcover&referrer=/stat/textbook.php full color hardcover] or in [https://www.openintro.org/redirect.php?go=createspace_os3&referrer=/stat/textbook.php black and white paperback]. The book is excellent and has been adopted widely. It has also developed a large online community of students and teachers who have shared other resources. Lecture slides, videos, notes, and more are all freely licensed (many through the website and others elsewhere).
The texbook (in any format) is required for the course. You can [https://www.openintro.org/go?id=os4&referrer=/book/os/index.php download it] at no cost and purchase hard copy versions in either [https://www.openintro.org/go?id=os4_color_pb&referrer=/book/os/index.php full color ($60)] or in [https://www.openintro.org/go?id=os4_bw_pb&referrer=/book/os/index.php black and white ($20)]. The B&W version is very affordable and I strongly recommend buying a hard copy for the purposes of the course and subsequent reference use. The book is excellent and has been adopted widely. It has also developed a large online community of students and teachers who have shared other resources. Lecture slides, videos, notes, and more are all freely licensed (many through the website and others elsewhere).


I will also assigning several chapters from the following:
I will also assigning several chapters from the following:
Line 66: Line 109:
* Reinhart, Alex. 2015. ''Statistics Done Wrong: The Woefully Complete Guide''. SF, CA: No Starch Press. ([https://search.library.northwestern.edu/primo-explore/fulldisplay?docid=01NWU_ALMA51732460650002441&context=L&vid=NULVNEW&search_scope=NWU&tab=default_tab&lang=en_US Safari online via NU libraries])
* Reinhart, Alex. 2015. ''Statistics Done Wrong: The Woefully Complete Guide''. SF, CA: No Starch Press. ([https://search.library.northwestern.edu/primo-explore/fulldisplay?docid=01NWU_ALMA51732460650002441&context=L&vid=NULVNEW&search_scope=NWU&tab=default_tab&lang=en_US Safari online via NU libraries])


This book provides a conceptual introduction to some common failures in statistical analysis that you should learn to recognize and avoid. It was also written by a Ph.D. student. You have access to an electronic copy via the NU library, but you may find it helpful to purchase.
This book provides a readable conceptual introduction to some common failures in statistical analysis that you should learn to recognize and avoid. It was also written by a Ph.D. student. You have access to an electronic copy via the NU library (you'll need to sign-in and/or use the NU VPN to access it), but you may find it helpful to purchase as well.


A few other books may be useful resources while you're learning to analyze, visualize, and interpret statistical data with R. I will share some advice about these during the first class meeting:
A few other books may be useful resources while you're learning to analyze, visualize, and interpret statistical data with R. I will share some advice about these during the first class meeting:
Line 74: Line 117:
* Verzani, John. 2014. ''Using R for Introductory Statistics, Second Edition''. 2 edition. Boca Raton: Chapman and Hall/CRC. ([https://en.wikipedia.org/wiki/Special:BookSources/978-1-4665-9073-1 Various Sources]; [https://www.amazon.com/Using-Introductory-Statistics-Second-Chapman/dp/1466590734/ref=mt_hardcover?_encoding=UTF8&me= Amazon])
* Verzani, John. 2014. ''Using R for Introductory Statistics, Second Edition''. 2 edition. Boca Raton: Chapman and Hall/CRC. ([https://en.wikipedia.org/wiki/Special:BookSources/978-1-4665-9073-1 Various Sources]; [https://www.amazon.com/Using-Introductory-Statistics-Second-Chapman/dp/1466590734/ref=mt_hardcover?_encoding=UTF8&me= Amazon])
* Wickham, Hadley. 2010. ''ggplot2: Elegant Graphics for Data Analysis''. 1st ed. 2009. Corr. 3rd printing 2010 edition. New York: Springer. ([https://link.springer.com/book/10.1007%2F978-3-319-24277-4 Springer/NU Libraries]; [https://en.wikipedia.org/wiki/Special:BookSources/978-0-596-80915-7 Various Sources])
* Wickham, Hadley. 2010. ''ggplot2: Elegant Graphics for Data Analysis''. 1st ed. 2009. Corr. 3rd printing 2010 edition. New York: Springer. ([https://link.springer.com/book/10.1007%2F978-3-319-24277-4 Springer/NU Libraries]; [https://en.wikipedia.org/wiki/Special:BookSources/978-0-596-80915-7 Various Sources])
* Wickham, Hadly and Grolemund, Garret. 2017. ''R for Data Science''. Sebastopol, CA: O'Reilly. ([https://r4ds.had.co.nz/ Online version]).


There are also some invaluable non-textbook resources:
There are also some invaluable non-textbook resources:
Line 79: Line 123:
* [ftp://cran.r-project.org/pub/R/doc/contrib/Baggott-refcard-v2.pdf Baggott's R Reference Card v2] — Print this out. Take it with you everywhere and look at it dozens of times a day. You will learn the language faster!
* [ftp://cran.r-project.org/pub/R/doc/contrib/Baggott-refcard-v2.pdf Baggott's R Reference Card v2] — Print this out. Take it with you everywhere and look at it dozens of times a day. You will learn the language faster!
* [https://stackoverflow.com/questions/tagged/r StackOverflow R Tag] — Somebody already had your question about how to do ''X'' in R. They asked it, and several people have answered it, on StackOverflow. Learning to read this effectively will take time but as build up some basic familiarity with R and with StackOverflow, it will get easier. I promise.
* [https://stackoverflow.com/questions/tagged/r StackOverflow R Tag] — Somebody already had your question about how to do ''X'' in R. They asked it, and several people have answered it, on StackOverflow. Learning to read this effectively will take time but as build up some basic familiarity with R and with StackOverflow, it will get easier. I promise.
* [http://rseek.org/ Rseek] — Rseek is a modified version of Google that just search R websites online. Sometimes, R is hard to search before because R is a common letter. This has become much easier over time as R has become more popular but it might still be the case sometimes and Rseek is a good solution.
* [http://rseek.org/ Rseek] — Rseek is a modified version of Google that just searches R websites online. Sometimes, R is hard to search because R is a common letter. This has become much easier over time as R has become more popular, but it can still be an issue sometimes and Rseek is a good solution.
* [https://ggplot2.tidyverse.org/ ggplot2 documentation] — Ggplot is a powerful data visualization package for R that I recommend highly. The documentation is indispensable for learning how to use it.
* [https://ggplot2.tidyverse.org/ ggplot2 documentation] — ggplot is a powerful data visualization package for R that I recommend highly. The documentation is indispensable for learning how to use it.
* [https://depts.washington.edu/madlab/proj/Rstats/ Statistical Analysis and Reporting in R] — A set of resources created and distributed by Jacob Wobbrock (University of Washington, School of Information) in conjunction with a MOOC he teaches. Contains cheatsheets, code snippets, and data to help execute commonly encountered statistical procedures in R.
* [https://depts.washington.edu/acelab/proj/Rstats/index.html Statistical Analysis and Reporting in R] — A set of resources created and distributed by Jacob Wobbrock (University of Washington, School of Information) in conjunction with a MOOC he teaches. Contains cheatsheets, code snippets, and data to help execute commonly encountered statistical procedures in R.
* [https://www.datacamp.com DataCamp] offers introductory R courses. Northwestern usually has some free accounts that get passed out via Research Data Services each quarter. Apparently, if you are taking or teaching relevant coursework, instructors can [https://www.datacamp.com/groups/education request] free access to DataCamp for their courses from DataCamp. If folks are interested in this, I can reach out.
* [https://www.datacamp.com DataCamp] offers introductory R courses. Northwestern usually has some free accounts that get passed out via Research Data Services each quarter. Apparently, if you are taking or teaching relevant coursework, instructors can [https://www.datacamp.com/groups/education request] free access to DataCamp for their courses from DataCamp. If folks are interested in this, I can reach out.


Line 87: Line 131:
* If you are planning to analyze large-scale data (i.e., data that won't fit in memory on your laptop) then you will want to sign up for a research allocation on Quest, which is Northwestern's high-performance computing cluster. Instructions on how to do that are [[Statistics_and_Statistical_Programming_(Spring_2019)/Quest_at_Northwestern|here]].
* If you are planning to analyze large-scale data (i.e., data that won't fit in memory on your laptop) then you will want to sign up for a research allocation on Quest, which is Northwestern's high-performance computing cluster. Instructions on how to do that are [[Statistics_and_Statistical_Programming_(Spring_2019)/Quest_at_Northwestern|here]].


== Assignments ==
=== Weekly (minor) assignments ===


The assignments in this class focus on applied statistical research design, analysis, and interpretation. There will be no graded exams or quizzes. Unless otherwise noted, all assignments are due at the end of the day (i.e., 11:59pm on the day they are due).
In order to support continuous progress towards the learning goals for the course, I have assigned some textbook exercises or a problem set ahead of every class. These assignments will provide the basis on which the teaching team will assess and provide feedback on your participation and engagement with the course material.


=== Weekly problem sets and participation ===
The first week or so of the course is textbook-focused to get us warmed up. Starting in week 2, we will do more statistical programming and apply the textbook concepts using R and RStudio. In general, we will cover the problem sets in the first session of the week and the textbook materials in the second session.


Each week I will post a problem set. Some of these will be taken from the textbooks and some will not. They will include:
==== Textbook exercises ====
The focus is on self-assessment of your understanding of the textbook material and you do not need to hand in anything. I expect that you will work on the exercises, review and discuss solutions, and submit any questions ahead of or during class. Please note that solutions to odd-numbered problems appear in the back of the book. The teaching team will distribute solutions to even-numbered problems as well.


* '''Statistics questions''' about statistical concepts, principles, and interpretation.
==== Problem sets ====
* '''Programming challenges''' that you must solve using R.
The course will include problem sets and these may incorporate several kinds of questions:
 
* '''Statistics questions''' about statistical concepts and principles.
* '''Programming challenges''' that you should solve using R.
* '''Empirical paper questions''' about other assigned readings.  
* '''Empirical paper questions''' about other assigned readings.  


You should submit your solutions to the programming challenges (feel free to submit the others if you like, but they're not required!) ahead of each class session. While I will not grade them, we will spend a good chunk of class going through the answers to the assignment due on that day.
For the problem sets, I ask that you submit your work [https://canvas.northwestern.edu/courses/122522/assignments via Canvas 24 hours before class] (i.e., Monday afternoon for our Tuesday class sessions). Details of exactly how this will work will be elaborated during the first class. For the programming challenges, you should submit code and text for your solutions (again, more on how later). If you get completely stuck on a problem, that's okay, but please provide whatever you have.


Because randomness is extremely important in statistics, I will use a small R program to '''randomly call on''' students to walk through your answer to statistics questions and empirical paper questions in class. We'll then discuss the answers, address points of confusion, and consider alternative approaches as a group.
Problem sets will be evaluated on a complete/incomplete basis. Although the problem sets will not be assigned a letter grade, they are a central focus of the course and completing them will support your mastery of the material in multiple ways. Working through them on schedule will also make it possible for you to participate in the synchronous course meetings and online discussions of course material effectively. Your ability to do so will figure prominently in your participation grade for the course (see the section on grading and assessment below).


For the programming challenges, you should submit code for your solutions before class (more on how in a moment) so we can walk through the material together. If you get completely stuck on a problem, that's okay, but please share whatever code you have so that you can tell us what you did and what you were thinking.
=== Research project (major) assignments ===
 
Coming to class will be profoundly important to learning the material and to your final grade. Although the problem sets will not be graded, it is critical that you be present and able to discuss your answers to each of the questions. Your ability to do so will figure prominently in your participation grade for the course (40% of your final grade).
 
I strongly encourage you to form groups to work on the problem sets if you find that helpful; however, you must still submit your work individually and respond to my cold-call prompts in class individually to help ensure that you learn and understand the material.
 
I evaluate participation along four dimensions: attendance, preparation, engagement, and contribution. These are quite similar to the dimensions described in the "Participation Rubric" section of [https://mako.cc/teaching/assessment.html Benjamin Mako Hill's assessment page] and [https://reagle.org/joseph/zwiki/Teaching/Assessment/Participation.html Joseph Reagle's participation assessment rubric]. Exceptional participation means excelling along all four dimensions. Please note that participation ≠ talking more and I encourage all of us to seek [https://reagle.org/joseph/zwiki/Teaching/Best_Practices/Learning/Balance_in_Discussion.html balance in our classroom discussions].
 
=== Research project ===


==== Overview ====
As a demonstration of your learning in this course, you will design and carry out a quantitative research project, start to finish. This means you will all:
As a demonstration of your learning in this course, you will design and carry out a quantitative research project, start to finish. This means you will all:


Line 118: Line 159:
* '''Find a dataset''' — Very quickly, you should identify a dataset you will use to complete this project. For most of you, I suspect you will be engaging in secondary data analysis or a analysis of a previously collected dataset.
* '''Find a dataset''' — Very quickly, you should identify a dataset you will use to complete this project. For most of you, I suspect you will be engaging in secondary data analysis or a analysis of a previously collected dataset.
* '''Engage in descriptive data analysis''' — Use R to calculate descriptive statistics and visualizations to describe your data.
* '''Engage in descriptive data analysis''' — Use R to calculate descriptive statistics and visualizations to describe your data.
* '''Motivate and test at least one hypothesis about relationships between two or more variables'''
* '''Motivate and test at least one hypothesis about relationships between two or more variables''' — I'm happy to discuss alternatives to formal hypothesis testing procedures (even if some of them are beyond the scope of this course).
* '''Report and interpret your findings''' — You will do this in both a short paper and a short presentation.
* '''Report and interpret your findings''' — You will do this in both a short paper and a short (recorded) presentation.
* '''Ensure that your work is replicable''' — You will need to provide code and data for your analysis in a way that makes your work replicable by other researchers.
* '''Ensure that your work is replicable''' — You will need to provide code and data for your analysis in a way that makes your work replicable by other researchers.


''I strongly urge you'' to produce a project that will further your academic career outside of the class. There are many ways that this can happen. Some obvious options are to prepare a project that you can submit for publication, use as pilot analysis that you can report in a grant or thesis proposal, and/or that fulfills a degree requirement.
''I strongly urge you'' to produce a project that will further your academic career outside of the class. There are many ways that this can happen. Some obvious options are to prepare a project that you can submit for publication, use as pilot analysis that you can report in a grant or thesis proposal, and/or use to fulfill a degree requirement.


There are several intermediate milestones and deadlines to help you accomplish a successful research project. Unless otherwise noted, all deliverables should be submitted via Canvas.
There are several intermediate milestones, deliverables, and deadlines to help you accomplish a successful research project. Unless otherwise noted, all deliverables should be submitted via Canvas by 5pm CT on the day they are due.


==== Project plan and dataset identification ====


;Due date: Thursday, April 18, 2019
==== Research project plan and dataset identification ====
 
;Due date: October 9, 2020, 5pm CT
;Maximum length: 500 words (~1-2 pages)
;Maximum length: 500 words (~1-2 pages)


Line 134: Line 176:


* An abstract of the proposed study including the topic, research question, theoretical motivation, object(s) of study, and anticipated research contribution.
* An abstract of the proposed study including the topic, research question, theoretical motivation, object(s) of study, and anticipated research contribution.
* An identification of the dataset you will use and a description of the columns or type of data it will include. If you do not currently have access to these data, explain why and when you will.
* An identification of the dataset you will use and a description of the rows and columns or type(s) of data it will include. If you do not currently have access to these data, explain why and when you will.
* A short (several sentences?) description of how the project will fit into your career trajectory.
* A short (several sentences?) description of how the project will fit into your career trajectory.


==== Project planning document ====
;Due date: Thursday, May 16, 2019
;Maximum length: ~5 pages
The project planning document is a basic shell/outline of an empirical quantitative research paper. Your planning document should should have the following sections: (a) Rationale, (b) Objectives; (b.1) General objectives; (b.2) Specific objectives; (c) (Null) hypotheses; (d) Conceptual diagram and explanation of the relationship(s) you plan to test; (e) Measures; (f) Dummy tables/figures; (g) anticipated finding(s) and research contribution(s). Longer descriptions of each of these planning document sections (as well as a few others) can be found [[CommunityData:Planning document|on this wiki page]].
I have also provided three example planning documents via our Canvas site:
* [https://canvas.northwestern.edu/files/6908602/download?download_frd=1 One by public health researcher Mika Matsuzaki]. The first planning document I ever saw and still one of the best. It's missing a measures section. It's also focused on a research context that is probably very different from yours, but try not to get bogged down by that and imagine how you might map the structure of the document to your own work.
* [https://canvas.northwestern.edu/files/6919735/download?download_frd=1 One by Jim Maddock] created as part of a qualifying exam earlier in 2019. Jim doesn't provide dummy tables or anticipated findings/contributions, but he has an especially phenomenal explanation of the conceptual relationships and processes he wants to test.
* [https://canvas.northwestern.edu/files/6908606/download?download_frd=1 One provided as an appendix to Gerber and Green's excellent textbook, ''Field Experiments: Design, Analysis, and Interpretation'' (FEDAI)]. It's over-detailed and incredibly long for our purposes, but nevertheless an exemplary approach to planning empirical quantitative research in a careful, intentional way that is worthy of imitation.
==== Project presentation and paper ====
;Paper due date: Monday, June 10, 2019
;Maximum length: 6000 words (~20 pages)
;Presentation due date: Thursday, May 30 or Thursday, June 6, 2019
;Maximum length: 8 minutes
''The paper:'' Ideally, I expect you to produce a high quality short research paper that you might revise and submit for publication and/or a dissertation milestone. I do not expect the paper to be ready for publication, but it should contain polished drafts of all the necessary components of a scholarly quantitative empirical research study. In terms of the structure, please see the page on the [[structure of a quantitative empirical research paper]].
As noted above, you should also provide data, code, and any documentation sufficient to enable the replication of all analysis and visualizations. If that is not possible/appropriate for some reason, please talk to me so that we can find another solution.
Because the emphasis in this class is on statistics and methods and because I'm not an expert in each of your fields, I'm happy to assume that your paper, proposal, or thesis chapter has already established the relevance and significance of your study and has a comprehensive literature review, well-grounded conceptual approach, and compelling reason why this research is important. As a result, you need not focus on these elements of the work in your written submission. Instead, feel free to start with a brief summary of the purpose and importance of this research followed by an introduction of your research questions or hypotheses. If you provide more detail, that's fine, but I won't give you detailed feedback on these parts and they will not figure prominently in my assessment of the work.
I have a strong preference for you to write the paper individually, but I'm open to the idea that you may want to work with others in the class. Please contact me ''before'' you attempt to pursue a collaborative final paper.
I do not have strong preferences about the style or formatting guidelines you follow for the paper and its bibliography. However, ''your paper must follow a standard format'' (e.g., [https://cscw.acm.org/2019/submit-papers.html ACM SIGCHI CSCW format] or [https://www.apastyle.org/index APA 6th edition] ([https://templates.office.com/en-us/APA-style-report-6th-edition-TM03982351 Word] and [https://www.overleaf.com/latex/templates/sample-apa-paper/fswjbwygndyq LaTeX] templates)) that is applicable for a peer-reviewed journal or conference proceedings in which you aim to publish the work (they all have formatting or submission guidelines published online and you should follow them). This includes the references. I also strongly recommend that you use reference management software to handle your bibliographic sources.
'' [[Statistics_and_Statistical_Programming_(Spring_2019)/Final_project_presentations|The presentation:]]'' The presentation will provide an opportunity to share a brief summary of your project and findings with the other members of the class. Since you will all give other research presentations throughout your career, I strongly encourage you to take the opportunity to refine your academic presentation skills. The document [https://canvas.northwestern.edu Creating a Successful Scholarly Presentation] (file will be posted to Canvas) may be useful.
: More details about the presentation goals, format suggestions, and more are available [[Statistics_and_Statistical_Programming_(Spring_2019)/Final_project_presentations|on this page]]
=== Grading ===
I will assign grades (usually a numeric value ranging from 0-10) for each of the following aspects of your performance. The percentage values in parentheses are weights that will be applied to calculate your overall grade for the course.
* Participation: 40%
* Proposal identification: 5%
* Final project planning document: 5%
* Final project presentation: 10%
* Final project paper: 40%


My assessment of your paper will reflect the clarity of the written work, the effective execution and presentation of quantitative empirical analysis, as well as the quality and originality of the analysis. Throughout the quarter, we will talk a lot about the qualities of exemplary quantitative research. I expect your final project to embody these exemplary qualities.
===== Notes on finding a dataset =====


== Note on finding a dataset ==
In order to complete your final project, you will each need a dataset. If you already have a dataset for the project you plan to conduct, great! If not, fear not! There are many datasets to draw from. Some ideas are below (please suggest others, provide updated links, or report problems). The teaching team will also be available to help you brainstorm/find resources if needed:
 
In order to complete your project, you will each need a dataset. If you already have a dataset for the project you plan to conduct, great! If not, there are many datasets to draw from. Some ideas are below. Jeremy and Aaron will also be available to help you brainstorm/find resources if needed:


* Ask your advisor for a dataset they have collected and used in previous papers. Are there other variables you could use? Other relationships you could analyze?
* Ask your advisor for a dataset they have collected and used in previous papers. Are there other variables you could use? Other relationships you could analyze?
Line 195: Line 191:
* Use the [http://scientificdata.isa-explorer.org/index.html ISA Explorer] to find datasets. Keep in mind the large majority of datasets it will search are drawn from the natural sciences.
* Use the [http://scientificdata.isa-explorer.org/index.html ISA Explorer] to find datasets. Keep in mind the large majority of datasets it will search are drawn from the natural sciences.
* The City of Chicago has one of the best [https://data.cityofchicago.org/ data portal sites] of any municipality in the U.S. (and better than many federal agencies). There are also numerous administrative datasets released by other public entities (try searching!) that you might find inspiring.
* The City of Chicago has one of the best [https://data.cityofchicago.org/ data portal sites] of any municipality in the U.S. (and better than many federal agencies). There are also numerous administrative datasets released by other public entities (try searching!) that you might find inspiring.
<!---
* [http://fivethirtyeight.com FiveThirtyEight.com] has published a [https://cran.r-project.org/web/packages/fivethirtyeight/vignettes/fivethirtyeight.html GitHub repository and an R package] with pre-processed and cleaned versions of many of the datasets they use for articles published on their website.
* <TODO fix/update accordingly> Set up a meeting with Jennifer Muilenburg — Data Curriculum and Communications Librarian who runs [https://www.lib.washington.edu/digitalscholarship/services/data research data services at the UW libraries]. Her email is: libdata@uw.edu I've have talked to her about this course and she is excited about meeting with you to help.
* If you interested in studying online communities, there are some great resources for accessing data from Reddit, Wikipedia, and StackExchange. See [https://files.pushshift.io/reddit/ pushshift] for dumps of Reddit data, [https://meta.wikimedia.org/wiki/Research:Data here] for an overview of Wikipedia's data resources, and [https://data.stackexchange.com/ Stack Exchange's data portal].
-->
* The NY Times is publishing a [https://github.com/nytimes/covid-19-data COVID-19 data repository] that includes county-level metrics for deaths, mask usage, and other pandemic-related data. The release a lot of it as frequently updated .csv files and the repository includes documentation of the measurements, data collection details, and more.
* [http://fivethirtyeight.com FiveThirtyEight.com] has published a [https://cran.r-project.org/web/packages/fivethirtyeight/vignettes/fivethirtyeight.html GitHub repository and an R package] with pre-processed and cleaned versions of many of the datasets they use for articles published on their website.  
* The Community Data Science Collective and colleagues have created a [[COVID-19_Digital_Observatory| COVID-19 digital observatory]] (hosted in part right here on this wiki!) that publishes a bunch of pandemic-related data as csv and json files.
* The [https://openpolicing.stanford.edu Stanford Open Policing project] has published a huge archive of policing data related mostly to traffic stops in states and many cities of the U.S. We'll use at least one of these files for a problem set.


=== Human subjects research, IRB, and ethics ===
==== Research project planning document ====
In general, you are responsible for making sure that you're on the right side of the IRB requirements and that your work meets applicable ethical norms and standards.


Class projects generally do not need IRB approval, but research for publications, dissertations, and sometimes even pilot studies generally fall under IRB purview. You should ''not'' plan to seek IRB approval/determination retroactively. If your study may involve human subjects and you may ever publish it in any form, you will need IRB oversight of some sort.
;Due date: October 30, 2020, 5pm CT
;Suggested length: ~5 pages


Secondary analysis of anonymized data is generally not considered human subjects research, but I strongly suggest that you get a determination from [https://irb.northwestern.edu/ the Northwestern IRB] before you start. For work that is not considered human subjects research, this can often happen in a few hours or days. If you need to list a faculty sponsor or Principal Investigator, that should ideally be your advisor. If that doesn't make sense for some reason, please talk to me.
The project planning document is a shell/outline of an empirical quantitative research paper. Your planning document should should have the following sections: (a) Rationale, (b) Objectives; (b.1) General objectives; (b.2) Specific objectives; (c) (Null) hypotheses; (d) Conceptual diagram and explanation of the relationship(s) you plan to test; (e) Measures; (f) Dummy tables/figures; (g) anticipated finding(s) and research contribution(s). Longer descriptions of each of these planning document sections (as well as a few others) can be found [[CommunityData:Planning document|on this wiki page]].


== Structure of Class ==
I will also provide three example planning documents via our Canvas site (links to-be-updated for 2020 edition of the course):
* [https://canvas.northwestern.edu/files/9439380/download?download_frd=1 One by public health researcher Mika Matsuzaki]. The first planning document I ever saw and still one of the best. It's missing a measures section. It's also focused on a research context that is probably very different from yours, but try not to get bogged down by that and imagine how you might map the structure of the document to your own work.
* [https://canvas.northwestern.edu/files/9421229/download?download_frd=1 One by Jim Maddock] created as part of a qualifying exam early in 2019. Jim doesn't provide dummy tables or anticipated findings/contributions, but he has an especially phenomenal explanation of the conceptual relationships and processes he wants to test.
* [https://canvas.northwestern.edu/files/9439379/download?download_frd=1 One provided as an appendix to Gerber and Green's excellent textbook, ''Field Experiments: Design, Analysis, and Interpretation'' (FEDAI)]. It's over-detailed and over-long for the purposes of this assignment, but nevertheless an exemplary approach to planning empirical quantitative research in a careful, intentional way that is worthy of imitation.


I expect everybody to come to class, every week, with a laptop and a power cord, ready to answer any question on the problem set and having uploaded code related the the programming questions. The class is listed as nearly 3 hours long and, with the exception of short breaks, I intend to use the entire period. Please be in class on time, plugged in, and ready to go.
==== Research project presentation ====


When it comes to the statistics material, this will mostly be a so-called "flipped" classroom. This means we will rely on the textbook and other resources to introduce the material and we will use the class sessions to discuss questions as they come up.
;Presentation due date: December 3, 2020, 5pm CT
;Maximum length: 10 minutes


The problem sets each week will  
<!-- TODO revisit old presentations page to update/adapt
[[Statistics_and_Statistical_Programming_(Spring_2019)/Final_project_presentations]]
--->
You will also create and record a short presentation of your final project. The presentation will provide an opportunity to share a brief overview of your project and findings with the other members of the class. Since you will all give other research presentations throughout your career, I strongly encourage you to take the opportunity to refine your academic presentation skills. The document [https://canvas.northwestern.edu/files/9439377/download?download_frd=1 Creating a Successful Scholarly Presentation] (file posted to Canvas) may be useful.


Although the day-to-day routine will vary, each class session will generally include the following:
Additional details about the presentation goals, format suggestions, resources, and more will be provided later in the quarter.
* Quick updates about assignments, projects, and meta-discussion about the class.
* Discussion of '''programming challenges''' due that day (and related to the previous week's R lecture materials).
* Discussion of  '''statistics questions''' related to new material in Diez, Barr, and Çetinkaya-Rundel.
* Discussion of any exemplary empirical paper we have read and the '''empirical paper questions'''.


== Schedule ==
==== Research project paper ====


When reading the schedule below, the following key might help resolve ambiguity: §n denotes chapter n; §n.x denotes section x of chapter; §n.x-y denotes sections x through y of chapter n.
;Paper due date: December 10, 2020, 5pm CT
;Maximum length: 6000 words (~20 pages)


=== Week 1: Thursday April 4: Introduction, Setup, and Data and Variables ===
I expect you to produce a short, high quality research paper that you might revise, extend, and submit for publication and/or a dissertation milestone. I do not expect the paper to be ready for publication, but it should contain polished drafts of all the necessary components of a scholarly quantitative empirical research study. In terms of the structure, please see the page on the [[structure of a quantitative empirical research paper]].


* [[Statistics and Statistical Programming (Spring 2019)/Session plan: Week 1]]
As noted above, you should also provide data, code, and any documentation sufficient to enable the replication of all analysis and visualizations. If that is not possible/appropriate for some reason, please talk to me so that we can find another solution.


Please complete the readings and assignment prior to class so that we can discuss them and start talking through some of the examples in R together.
Because the emphasis in this class is on statistics and methods and because I'm probably not an expert in the substance of your research domain, I'm happy to assume that your paper, proposal, or thesis chapter has already established the relevance and significance of your study and has a comprehensive literature review, well-grounded conceptual approach, and compelling reason why this research is important. As a result, you need not focus on these elements of the work in your written submission. Instead, feel free to start with a brief summary of the purpose and importance of this research followed by an introduction of your research questions or hypotheses. If you provide more detail, that's fine, but I won't give you detailed feedback on these parts and they will not figure prominently in my assessment of the work.


'''Required Readings:'''
I have a strong preference for you to write the paper individually, but I'm open to the idea that you may want to work with others in the class. Please contact me ''before'' you attempt to pursue a collaborative final paper.


* Diez, Barr, and Çetinkaya-Rundel: §1 (Introduction to data)
I do not have strong preferences about the style or formatting guidelines you follow for the paper and its bibliography. However, ''your paper must follow a standard format'' (e.g., [https://cscw.acm.org/2019/submit-papers.html ACM SIGCHI CSCW format] or [https://www.apastyle.org/index APA 6th edition] ([https://templates.office.com/en-us/APA-style-report-6th-edition-TM03982351 Word] and [https://www.overleaf.com/latex/templates/sample-apa-paper/fswjbwygndyq LaTeX] templates)) that is applicable for a peer-reviewed journal or conference proceedings in which you might aim to publish the work (they all have formatting or submission guidelines published online and you should follow them). This includes the references. I also strongly recommend that you use reference management software like Zotero to handle your bibliographic sources.
* Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks. ''Proceedings of the National Academy of Sciences'' 111(24):8788–90. [[http://www.pnas.org/content/111/24/8788.full Open Access]]


'''Recommended Readings:'''
==== Human subjects research, IRB, and ethics ====
In general, you are responsible for making sure that you're on the right side of the IRB requirements and that your work meets applicable ethical norms and standards.


* Verzani: §1 (Getting Started), §2 (Univariate data) [[https://canvas.northwestern.edu/verzani_ch1-ch2.pdf Available via Canvas]]
Class projects generally do not need IRB approval, but research for publications, dissertations, and sometimes even pilot studies do fall under IRB purview. You should ''not'' plan to seek IRB approval/determination retroactively. If your study may involve human subjects and you may ever publish it in any form, you will need IRB oversight of some sort.
* Verzani: §A (Programming)
* Healy: §2 (and skim the preferatory material as well as §1)
'''Assignment (Complete before class):'''


* [[Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 1]]
Secondary analysis of anonymized data is generally not considered human subjects research, but I strongly suggest that you get a determination from [https://irb.northwestern.edu/ the Northwestern IRB] before you start. For work that is not considered human subjects research, this can often happen in a few hours or days. If you need to list a faculty sponsor or Principal Investigator, that should ideally be your advisor. If that doesn't make sense for some reason, please talk to me.


'''Lectures:'''
Research ethics are broad and complex topic. We'll talk about issues related to ethics and quantitative empirical research a bit more during class, but will likely only scratch the surface. I strongly encourage you to pursue further reading, conversation, coursework, and reflection as you consider how to understand and apply ethical principles in the context of your own research and teaching.
* [https://communitydata.cc/~ads/teaching/2019/stats/r_lectures/w01-R_lecture.zip Week 1 R lecture materials] (.zip file)
* [https://communitydata.cc/~ads/teaching/2019/stats/screencasts/w01-s01-intro.webm Week 1 screencast (part 1, 23 minutes)] (the video should load directly in browser window)
* [https://communitydata.cc/~ads/teaching/2019/stats/screencasts/w01-s02-intro.webm Week 1 screencast (part 2, 27 minutes)]


'''Resources:'''
=== Grading and assessment ===
* [https://www.openintro.org/download.php?file=os3_slides_01&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §1 Lecture Notes]
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including some for §1


=== Week 2: Thursday April 11: Probability and Visualization ===
I will assign grades (usually a numeric value ranging from 0-10) for each of the following aspects of your performance. The percentage values in parentheses are weights that will be applied to calculate your overall grade for the course.
* [[Statistics and Statistical Programming (Spring 2019)/Session plan: Week 2]]
* Questions? Topics you'd like to discuss? Add them to the [https://canvas.northwestern.edu/courses/90927/discussion_topics/601700 Canvas discussion] for this week's material.


'''Required Readings:'''
* Weekly participation: 40%
* Proposal identification: 5%
* Final project planning document: 5%
* Final project presentation: 10%
* Final project paper: 40%


* Diez, Barr, and Çetinkaya-Rundel: §2 (Probability)
The teaching team will jointly and holistically evaluate your participation along four dimensions: attendance, preparation, engagement, and contribution. These are quite similar to the dimensions described in the "Participation Rubric" section of [https://mako.cc/teaching/assessment.html Benjamin Mako Hill's assessment page] and [https://reagle.org/joseph/zwiki/Teaching/Assessment/Participation.html Joseph Reagle's participation assessment rubric]. Exceptional participation means excelling along all four dimensions. Please note that participation ≠ talking/typing more and I encourage all of us to seek [https://reagle.org/joseph/zwiki/Teaching/Best_Practices/Learning/Balance_in_Discussion.html balance in our discussions].
* Shaw, Aaron and Yochai Benkler. 2012. A tale of two blogospheres: Discursive practices on the left and right. ''American Behavioral Scientist''. 56(4): 459-487. [[https://doi.org/10.1177%2F0002764211433793 available via NU libraries]]


'''Recommended Readings:'''
The teaching team's assessment of your final project proposal, planning document, presentation, and paper will reflect the clarity of the work, the effective execution and presentation of quantitative empirical analysis, as well as the quality and originality of the analysis. A more detailed assessment rubric will be provided. Throughout the quarter, we will talk about the qualities of exemplary quantitative research. In general, I expect your final project to embody these exemplary qualities.
* Verzani: §3.1-2 (Bivariate data), §4 (Multivariate data), §5 (Multivariate graphics) <!---[[https://faculty.washington.edu/makohill/com521/verzani-usingr-ch3.1-2_ch4_ch5.pdf Available with UW NetID]]--->
* [https://seeing-theory.brown.edu/ Seeing Theory] §1 (Basic Probability) and §2 (Compound Probability). (Note: this site provides a beautiful visual introduction to core concepts in probability and statistics).
<!---
* Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in ''Proceedings of the 8th ACM Conference on Designing Interactive Systems.'' Aarhus, Denmark: ACM. [[https://mako.cc/academic/buechley_hill_DIS_10.pdf PDF available on my personal website]]
--->
* Healy: §3.


'''Assignment (Complete Before Class):'''
=== Policies ===


* [[Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 2]]
==== General course policies ====


'''Lectures:'''
[[User:Aaronshaw/Classroom_policies|General policies]] on a wide variety of topics including classroom equity, attendance, academic integrity, accommodations, late assignments, and more are provided [[User:Aaronshaw/Classroom_policies|on Aaron's class policies page]]. Below are some policy statements specific to this course and quarter.
* [https://communitydata.cc/~ads/teaching/2019/stats/r_lectures/w02-R_lecture.Rmd Week 2 R lecture materials] (.Rmd file)
* [https://communitydata.cc/~ads/teaching/2019/stats/screencasts/w02.webm Week 2 screencast (17 minutes)]


'''Resources:'''
==== Teaching and learning in a pandemic ====


* [https://www.openintro.org/download.php?file=os3_slides_02&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §2 Lecture Notes]
The Covid-19 pandemic will impact this course in various ways, some of them obvious and tangible and others harder to pin down. On the obvious and tangible front, we have things like a mix of remote and (a)synchronous instruction, the fact that many of us will not be anywhere near campus or each other this year, and the unusual academic calendar. These will reshape our collective "classroom" experience in major ways.  
* [https://www.openintro.org/stat/videos.phpOpenIntro Video Lectures] including 2 short videos for §2


=== Week 3: Thursday April 18: Distributions ===
On the "harder to pin down" side, many of us may experience elevated levels of exhaustion, stress, uncertainty and/or distraction. We may need to provide unexpected support to family, friends, or others in our communities. I have personally experienced all of these things at various times over the past six months and I expect that some of you have too. It is a difficult time.


* [[Statistics and Statistical Programming (Spring 2019)/Session plan: Week 3]]
I believe it is important to acknowledge these realities of the situation and create the space to discuss and process them in the context of our class throughout the quarter. As your instructor and colleague, I commit to do my best to approach the course in an adaptive, generous, and empathetic way. I will try to be transparent and direct with you throughout—both with respect to the course material as well as the pandemic and the university's evolving response to it. I ask that you try to extend a similar attitude towards everyone in the course. When you have questions, feedback, or concerns, please try to share them in an appropriate way. If you require accommodations of any kind at any time (directly related to the pandemic or not), please contact the teaching team.


'''Required Readings:'''
==== Expectations for synchronous remote sessions ====


* Diez, Barr, and Çetinkaya-Rundel: §3.1-3.2, §3.4: You should read the rest of the chapter (§3.3 and §3.5). I won't assign problem set questions about it but it's still important to be familiar with.
The following are some baseline expectations for our synchronous remote class sessions. I expect that these can and will evolve. Please feel free to ask questions, suggest changes, or raise concerns during the quarter. I welcome all input.
* All members of the class are expected to create a supportive and welcoming environment that is respectful of the conditions under which we are participating in this class.
* All members of the class are expected to take reasonable steps to create an effective teaching/learning environment for themselves and others.


'''Recommended Readings:'''
And here are suggested protocols for any video/audio portions of our class:
* Verzani: §6 (Populations)
* Please mute your microphone whenever you're not speaking and learn to use [https://en.wikipedia.org/wiki/Push-to-talk "push-to-talk"] if/when possible.
* [https://seeing-theory.brown.edu/ Seeing Theory] §3 (Probability Distributions).
* Video is optional for all students at all times, although if you're willing/able to keep the instructor company in the video channel that would be nice.
* If you need to excuse yourself at any time and for any reason you may do so.
* Children, family, pets, roommates, and others with whom you may share your workspace are welcome to join our class as needed.


'''Assignment (Complete Before Class):'''
==== Syllabus revisions ====


* [[Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 3]]
This syllabus will be a dynamic document that will evolve throughout the quarter. Although the core expectations are fixed, the details will shift. As a result, please keep in mind the following:


'''Lectures:'''
# '''Assignments and readings are ''frozen'' 1 week before they are due.''' I will not add readings or assignments less than one week before they are due. If I forget to add something or fill in a "To Be Determined" less than one week before it's due, it is dropped. If you plan to read or work more than one week ahead, contact me first.
# '''Substantial changes to the syllabus or course materials will be announced.''' Please closely monitor your email and/or [https://canvas.northwestern.edu the announcements section on the course website on Canvas]. When I make changes, these changes will be recorded in [https://wiki.communitydata.science/index.php?title=Statistics_and_Statistical_Programming_(Fall_2020)&action=history  the edit history of this page] so that you can track what has changed. I will also do my best to summarize these changes in an announcement on Canvas that will be emailed to everybody in the class.
# '''The course design may adapt throughout the quarter.''' As this is a new format for this course, I may iterate and prototype course design elements rapidly along the way. To this end, I will ask you for voluntary anonymous feedback — especially toward the beginning of the quarter. Please let me know what is working and what can be improved. In the past, I have made many adjustments based on this feedback and I expect to do so again.


* [https://communitydata.cc/~ads/teaching/2019/stats/r_lectures/w03-R_lecture.Rmd Week 3 R lecture materials] (.Rmd file)
==== Statistics and power ====
* [https://communitydata.cc/~ads/teaching/2019/stats/screencasts/w03.webm Week 3 screencast (19 minutes)]


'''Resources:'''
The subject matter of this course—statistics and statistical programming—has historical and present-day affinities with a variety of oppressive ideologies and projects, including white supremacy, discrimination on the basis of gender and sexuality, state violence, genocide, and colonialism. It has also been used to challenge and undermine these projects in various ways. I will work throughout the quarter to acknowledge and represent these legacies accurately, at the same time as I also strive to advance equity, inclusion, and justice through my teaching practice, the selection of curricular materials, and the cultivation of an inclusive classroom environment. Please see my [[User:Aaronshaw/Classroom_policies|general classroom policies]] for more on some of these topics.


* [https://www.openintro.org/download.php?file=os3_slides_03&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §3 Lecture Notes]
== Schedule (with all the details) ==
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 2 videos for §3.1 and §3.2


=== Week 4: Thursday April 25: Statistical significance and hypothesis testing ===
When reading the schedule below, the following key might help resolve ambiguity: §n denotes chapter n; §n.x denotes section x of chapter; §n.x-y denotes sections x through y (inclusive) of chapter n.
* [[Statistics and Statistical Programming (Spring 2019)/Session plan: Week 4]]


'''Required Readings:'''
=== Week 1 (9/17) ===
==== September 17: Intro and setup ====


* Diez, Barr, and Çetinkaya-Rundel: §4 (Foundations for inference)
;[[Statistics_and_Statistical_Programming_(Fall_2020)/w01_session_plan|Session plan]]


'''Recommended Readings:'''
<blockquote>''Note: Aaron doesn't actually expect you to complete these before class on September 17''</blockquote>
* Verzani: §7 (Statistical inference), §8 (Confidence intervals)
* [https://seeing-theory.brown.edu/ Seeing Theory] §4 (Frequentist Inference)


'''Assignment (Complete Before Class):'''
'''Required'''
* Read this syllabus, discuss any questions/concerns with the teaching team.
* Complete [https://apps3.cehd.umn.edu/artist/user/scale_select.html pre-course assessment of statistical concepts] (access code TBA via email). Estimated time to do this is 30-40 minutes. '''Submission deadline: September 18, 11:00pm Chicago time'''
* Confirm course registration and access to [https://www.openintro.org/book/os/ the textbook] (pdf download available for $0 and b&w paperbacks for $20) as well as any software and web-services you'll need for course (Zoom, Discord, Canvas, this wiki, R, RStudio). Discord invites will be sent via email.
* Complete [https://wiki.communitydata.science/Statistics_and_Statistical_Programming_(Fall_2020)/pset0 problem set #0]


* [https://docs.google.com/forms/d/e/1FAIpQLScMkAPwWQUjB4C5wtbkemkNZYjNl3ipO4Dg5wsORFmdfduEtA/viewform?usp=sf_link Mid-quarter course evaluation survey] (by Monday please!)
'''Recommended'''
* [[Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 4]]
* Work through one (or more) introduction(s) to R and Rstudio so that you can complete problem set 0. Here are several suggestions:
** '''From Aaron:''' The [https://communitydata.science/~ads/teaching/2020/stats/r_tutorials/w01-R_tutorial.html Week 01 R tutorial] (you should also download the [https://communitydata.science/~ads/teaching/2020/stats/r_tutorials/w01-R_tutorial.rmd .rmd version of the tutorial] that you can open and read/edit in RStudio). These are accompanied by the R and Rstudio intro screencasts ([https://communitydata.cc/~ads/teaching/2019/stats/screencasts/w01-s01-intro.webm Part 1] and [https://communitydata.cc/~ads/teaching/2019/stats/screencasts/w01-s02-intro.webm Part 2]) Aaron created for the 2019 version of the course.
** Modern Dive [https://moderndive.netlify.app/index.html Statistical inference via data science] Chapter 1: [https://moderndive.netlify.app/1-getting-started.html Getting started with R].
** [https://rladiessydney.org/courses/ryouwithme/ RYouWithMe] course [https://rladiessydney.org/courses/ryouwithme/01-basicbasics-0/ "Basic basics" 1 & 2] (and maybe 3 if you're feeling ambitious).
** Verzani §1 (Getting started).
** Healy §2 (Get started).


'''Lectures:'''
=== Week 2 (9/22, 9/24) ===
*[https://communitydata.cc/~ads/teaching/2019/stats/r_lectures/w04-R_lecture.Rmd Week 4 R lecture materials] (.Rmd file)
;[[Statistics_and_Statistical_Programming_(Fall_2020)/w02_session_plan|Session plans]]
*(No screencast for this week)
==== September 22: Data and variables ====
'''Required'''
* Read Diez, Çetinkaya-Rundel, and Barr: §1.1-1.3 (Introduction to data).
* Watch [https://www.youtube.com/playlist?list=PLkIselvEzpM6pZ76FD3NoCvvgkj_p-dE8 Lecture materials for §1.1-3 (Videos 1-4 in the playlist)].
* Submit, review, and respond to questions or requests for discussion via Discord or some other means.


'''Resources:'''
==== September 24: Numerical and categorical data ====
'''Required'''
* Read Diez, Çetinkaya-Rundel, and Barr: §2.1-2 (Numerical and categorical data).
* Review [https://www.youtube.com/playlist?list=PLkIselvEzpM6pZ76FD3NoCvvgkj_p-dE8 Lecture materials for §2.1 and §2.2 (Videos 6-7 in the playlist)].
* Complete '''exercises from OpenIntro §2:''' 2.12, 2.13, 2.16, 2.20, 2.23, 2.30 (and remember that solutions to odd-numbered problems are in the book!)
* Submit, review, and respond to questions or requests for discussion via Discord or some other means.


* [https://www.openintro.org/download.php?file=os3_slides_04&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §4 Lecture Notes]
=== Week 3 (9/29, 10/1) ===
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 7 videos for nearly all of §4


=== Week 5: Thursday May 2: Continuous Numeric Data & ANOVA ===
;[[Statistics_and_Statistical_Programming_(Fall_2020)/w03_session_plan|Session plans]]


* [[Statistics and Statistical Programming (Spring 2019)/Session plan: Week 5|Session plan]]
==== September 29: R fundamentals: Import, transform, tidy, and describe data ====
'''Required'''
* Complete [[Statistics_and_Statistical_Programming_(Fall_2020)/pset1|problem set #1]] (due Monday, September 28 at 1pm Central)


'''Required Readings:'''
'''Recommended'''
 
* [https://communitydata.science/~ads/teaching/2020/stats/r_tutorials/w03-R_tutorial.html Week 3 R tutorial] (note that you can access .rmd or .pdf versions by replacing the suffix of the URL accordingly).
* Diez, Barr, and Çetinkaya-Rundel: §5 (Inference for numerical data)
* Additional material from any of the recommended R learning resources suggested last week or elsewhere in the syllabus. In particular, you may find the ModernDive, RYouWithMe, Healy, and/or Wickham and Grolemund resources valuable.
<!---* Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in ''Proceedings of the 8th ACM Conference on Designing Interactive Systems.'' Aarhus, Denmark: ACM. [[https://mako.cc/academic/buechley_hill_DIS_10.pdf PDF from Hill's website]]--->
* Sweetser, K. D., & Metzgar, E. (2007). Communicating during crisis: Use of blogs as a relationship management tool. ''Public Relations Review'', 33(3), 340–342. [[https://doi.org/10.1016/j.pubrev.2007.05.016 Available through NU Libraries]]
* Reinhart, §1
 
'''Recommended Readings:'''
* Verzani: §9 (significance tests), §12 (Analysis of variance)
* Gelman, Andrew and Hal Stern. 2006. “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant.” ''The American Statistician'' 60(4):328–31. [[http://dx.doi.org/10.1198/000313006X152649 Available through NU Libraries]]
 
'''Assignment (Complete Before Class):'''
 
* [[Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 5]]
 
'''Lectures:'''
* No new R material for this week.
<!---
<!---
* [[Statistics and Statistical Programming (Spring 2019)/R lecture outline: Week 5]]
'''Resources'''
* [https://communitydata.cc/~mako/2017-COM521/com521-week_05-ttests_and_anova.ogv Week 5 R lecture screencast: t-tests] (~22 minutes)
* [https://science.sciencemag.org/content/187/4175/398 UCB admissions paper]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_05-for_if.ogv Week 5 R lecture screencast: for loops and if statements] (~12 minutes)
* [https://openpolicing.stanford.edu Stanford OpenPolicing Project]
--->
--->


'''Resources:'''
==== October 1: Probability ====
 
'''Required'''
* [https://www.openintro.org/download.php?file=os3_slides_05&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §5 Lecture Notes]
* Read Diez, Çetinkaya-Rundel, and Barr: §3 (Probability).  
 
* Watch [https://www.youtube.com/watch?list=PLkIselvEzpM5EgoOajhw83Ax_FktnlD6n&v=rG-SLQ2uF8U Probability introduction] and [https://www.youtube.com/watch?v=HxEz4ZHUY5Y&list=PLkIselvEzpM5EgoOajhw83Ax_FktnlD6n&index=2 Probability trees] OpenIntro lectures (just videos 1 and 2 in the playlist).
=== Week 6: Thursday May 9: Categorical data ===
* Complete '''exercises from OpenIntro §3:''' 3.12, 3.15, 3.22, 3.28, 3.34, 3.38
 
* [[Statistics and Statistical Programming (Spring 2019)/Session plan: Week 6|Session plan]]
'''Required Readings:'''
 
* Diez, Barr, and Çetinkaya-Rundel: §6.1-6.4 (Inference for categorical data).
* Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in ''Proceedings of the 8th ACM Conference on Designing Interactive Systems.'' Aarhus, Denmark: ACM. [[https://mako.cc/academic/buechley_hill_DIS_10.pdf PDF available on Hill's personal website]]
* Reinhart, §4 and §5.
 
'''Recommended Readings:
* Diez, Barr, and Çetinkaya-Rundel: §6.5-6.6 (Small samples and randomization inference)
* Verzani: §3.4 (Bivariate categorical data); §10.1-10.2 (Goodness of fit)
* Gelman, Andrew and Eric Loken. 2014. “The Statistical Crisis in Science Data-Dependent Analysis—a ‘garden of Forking Paths’—explains Why Many Statistically Significant Comparisons Don’t Hold Up.” ''American Scientist'' 102(6):460. [[https://www.americanscientist.org/issues/pub/2014/6/the-statistical-crisis-in-science/1 Available through NU Libraries]] (This is a reworked version of [http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf this unpublished manuscript] which provides a more detailed examples.)
 
'''Assignment (Complete Before Class):'''
 
* [[Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 6]]
 
'''Lectures:'''
*[https://communitydata.cc/~ads/teaching/2019/stats/r_lectures/w06-R_lecture.Rmd Week 6 R lecture materials] (.Rmd file)
*(No screencast for this week)
 
'''Resources:'''
* [https://www.openintro.org/download.php?file=os3_slides_06&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §6 Lecture Notes]
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 4 videos for §7
 
=== Week 7: Thursday May 16: Linear Regression ===
* [[Statistics and Statistical Programming (Spring 2019)/Session plan: Week 7|Session plan]]
'''Required Readings:'''
 
* Diez, Barr, and Çetinkaya-Rundel: §7 (Introduction to linear regression)
* OpenIntro eschews a mathematical approach to correlation. Look over [https://en.wikipedia.org/wiki/Correlation_and_dependence the Wikipedia article on correlation and dependence] and pay attention to the formulas. It's tedious to compute, but you should be aware of what goes into it.
* Lampe, Cliff, and Paul Resnick. 2004. “Slash(Dot) and Burn: Distributed Moderation in a Large Online Conversation Space.” In ''Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '04)'', 543–550. New York, NY, USA: ACM. doi:10.1145/985692.985761. [[http://dx.doi.org/10.1145/985692.985761 Available via NU libraries]]
 
'''Recommended Readings:'''
* Verzani: §11.1-2 (Linear regression).
* [https://seeing-theory.brown.edu/ Seeing Theory] §5 (Regression Analysis)
 
'''Assignment (Complete Before Class):'''
 
* [[Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 7]]
* Final project planning document (see details above!)
 
'''Lectures:'''
* [https://communitydata.cc/~ads/teaching/2019/stats/r_lectures/w07-R_lecture.Rmd Week 7 R lecture materials]
 
'''Resources:'''
* [https://www.openintro.org/download.php?file=os3_slides_07&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §7 Lecture Notes]
* [https://www.openintro.org/download.php?file=os3_slides_08&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §8 Lecture Notes]
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 4 videos for §7 and 3 videos on the sections §8.1-8.3
 
=== Week 8: Thursday May 23: Polynomial Terms, Interactions, and Logistic Regression ===
* [[Statistics_and_Statistical_Programming_(Spring_2019)/Session plan: Week 8|Session plan]]
 
'''Required Readings:'''
* Diez, Barr, and Çetinkaya-Rundel: §8 (Multiple and logistic regression)
* [https://onlinecourses.science.psu.edu/stat501/node/301 Lesson 8: Categorical Predictors] and [https://onlinecourses.science.psu.edu/stat501/node/318 Lesson 9: Data Transformations] from the PennState Eberly College of Science STAT 501 Regression Methods Course. There are several subparts (many quite short), please read them all carefully.
* (Revisit) Lampe, Cliff, and Paul Resnick. 2004. “Slash(Dot) and Burn: Distributed Moderation in a Large Online Conversation Space.” In ''Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '04)'', 543–550. New York, NY, USA: ACM. doi:10.1145/985692.985761. [[http://dx.doi.org/10.1145/985692.985761 Available via NU libraries]]
* Reinhart, §8 and §9.


'''Recommended Readings:'''
'''Resources'''
* Verzani: §11.3 (Linear regression), §13.1 (Logistic regression)
* [https://seeing-theory.brown.edu/index.html#secondPage Seeing Theory §1-2 (Basic Probability and Compound Probability)]
* Ioannidis, John P. A. 2005. “Why Most Published Research Findings Are False.” ''PLoS Medicine'' 2(8):e124. [[http://dx.doi.org/10.1371%2Fjournal.pmed.0020124 Open Access]]
* Head, Megan L., Luke Holman, Rob Lanfear, Andrew T. Kahn, and Michael D. Jennions. 2015. “The Extent and Consequences of P-Hacking in Science.” ''PLOS Biology'' 13(3):e1002106. [[http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106 Open Access]]


'''Assignment (Complete Before Class):'''
=== Week 4 (10/6, 10/8) ===
;[[Statistics_and_Statistical_Programming_(Fall_2020)/w04_session_plan|Session plans]]


* [[Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 8]]
==== October 6: Emotional contagion and more advanced R fundamentals: import, tidy, transform, and simulate data; write functions ====
'''Required'''
* Read the paper below as well as the attendant [https://www.pnas.org/content/111/29/10779.1 "Expression of editorial concern"] and [https://www.pnas.org/content/111/29/10779.2 "Correction"] that were subsequently appended to it.
:Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” ''Proceedings of the National Academy of Sciences'' 111(24):8788–90. [[http://www.pnas.org/content/111/24/8788.full Open access]]
* Complete [[Statistics_and_Statistical_Programming_(Fall_2020)/pset2|problem set #2]] (due Monday, October 5 at 1pm CT)


'''Lectures:'''
'''Recommended'''
*[https://communitydata.science/~ads/teaching/2019/stats/r_lectures/w08-R_lecture.Rmd Week 8 R lecture materials]
* [https://communitydata.science/~ads/teaching/2020/stats/r_tutorials/w04-R_tutorial.html Week 4 R tutorial] (as usual, also available as .rmd or .pdf)


'''Resources:'''
==== October 8: Distributions ====
'''Required'''
* Read Diez, Çetinkaya-Rundel, and Barr: §4.1-3 (Normal and binomial distributions).
* Watch [https://www.youtube.com/watch?list=PLkIselvEzpM6V9h55s0l9Kzivih9BUWeW&v=S_p5D-YXLS4 normal and binomial distributions] OpenIntro lectures (videos 1-3 in the playlist).
* Complete '''exercises from OpenIntro §4:''' 4.4, 4.6, 4.15, 4.22


* [https://www.openintro.org/download.php?file=os3_slides_08&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §8 Lecture Notes]
'''Resources'''
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including a video on §8.4
* [https://seeing-theory.brown.edu/index.html#secondPage/chapter3 Seeing Theory §3 (Probability distributions)]
* Mako Hill wrote this document which will likely be useful for many of you: [https://communitydata.cc/~mako/2017-COM521/logistic_regression_interpretation.html Interpreting Logistic Regression Coefficients with Examples in R]


=== Week 9: Thursday May 30: Loose ends and Final Presentations (part 1)  ===
==== October 9: [[#Research project plan and dataset identification|Research project plan and dataset identification]] due by 5pm CT ====
*'''Submit via [https://canvas.northwestern.edu/courses/122522/assignments Canvas]''' (due by 5pm CT)


* [[Statistics_and_Statistical_Programming_(Spring_2019)/Session plan: Week 9|Session plan]]
=== Week 5 (10/13, 10/15) ===
;[[Statistics_and_Statistical_Programming_(Fall_2020)/w05_session_plan|Session plans]]
==== October 13: Descriptive analysis and visualization of data ====
'''Required'''
* Complete [[Statistics_and_Statistical_Programming_(Fall_2020)/pset3|problem set #3]] (due Monday, October 12 at 1pm CT)


'''Required readings:'''
'''Recommended'''
* Reinhart, §10 and §11.
* [https://communitydata.science/~ads/teaching/2020/stats/r_tutorials/w05-R_tutorial.html Week 5 R tutorial] and [https://communitydata.science/~ads/teaching/2020/stats/r_tutorials/w05a-R_tutorial.html Week 5 R tutorial supplement] (both, as usual, also available as .rmd or .pdf).


'''[[Statistics_and_Statistical_Programming_(Spring_2019)/Final_project_presentations|Final presentations]]: (part 1)'''
==== October 15: Foundations for (frequentist) inference ====
* First batch today. The rest next week.
'''Required'''
* Read Diez, Çetinkaya-Rundel, and Barr: §5 (Foundations for inference).
* Watch [https://www.youtube.com/watch?v=oLW_uzkPZGA&list=PLkIselvEzpM4SHQojH116fYAQJLaN_4Xo foundations for inference] (videos 1-3 in the playlist) OpenIntro lectures.
* Complete [https://www.openintro.org/book/stat/why05/ Why .05?] OpenIntro video/exercise.
* Complete '''exercises from OpenIntro §5:''' 5.4, 5.8, 5.10, 5.17, 5.30, 5.35, 5.36


'''Resources:'''
'''Resources'''
* [https://communitydata.cc/~ads/teaching/2019/stats/r_lectures/w09-R_lecture.html Week 9 R-lecture] (we will use this in class)
* Kelly M., [https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1740-9713.2013.00693.x Emily Dickinson and monkeys on the stair Or: What is the significance of the 5% significance level?] ''Significance'' 10:5. 2013.
* [https://seeing-theory.brown.edu/index.html#secondPage/chapter4 Seeing Theory §4 (Frequentist Inference)]


=== Week 10: Thursday June 6: Fully reproducible research example, Replications, Final Presentations (part 2), and wrap-up ===
=== Week 6 (10/20, 10/22) ===
;[[Statistics_and_Statistical_Programming_(Fall_2020)/w06_session_plan|Session plans]]
==== October 20: Reinforced foundations for inference ====
'''Required'''
* Complete [[Statistics_and_Statistical_Programming_(Fall_2020)/pset4|problem set #4]] 
* Read Reinhart, §1.
* Revisit the Kramer et al. (2014) paper we read a few weeks ago:
:Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” ''Proceedings of the National Academy of Sciences'' 111(24):8788–90. [[http://www.pnas.org/content/111/24/8788.full Open access]] 


* Fully [https://www.overleaf.com/read/tkdpdcspwtkp reproducible research example].
==== October 22: Inference for categorical data ====
* [https://canvas.northwestern.edu/courses/90927/files/folder/resources/Straub-Cook%20Replication Research replication study] by Polly Straub-Cook (UW Comm. Ph.D. student)
'''Required'''
:: (n.b.: cluster & heteroscedasticity robust standard errors!)
* Read Diez, Çetinkaya-Rundel, and Barr: §6 (Inference for categorical data).
* Watch [https://www.youtube.com/watch?list=PLkIselvEzpM5Gn-sHTw1NF0e8IvMxwHDW&v=_iFAZgpWsx0 inference for categorical data] (videos 1-3 in the playlist) OpenIntro lectures.
* Complete '''exercises from OpenIntro §6:''' 6.10, 6.16, 6.22, 6.30, 6.40 (just parts a and b; part c gets tedious)


* '''[[Statistics_and_Statistical_Programming_(Spring_2019)/Final_project_presentations|Final presentations]]: (part 2)'''
'''Resources'''
:: Second batch of presenters today.
* [https://gallery.shinyapps.io/CLT_prop/ OpenIntro Central limit theorem for proportions demo].
* Closing thoughts
:: What next? Beyond your final projects...
:: Class social gathering


Followed by much rejoicing!
=== Week 7 (10/27, 10/29) ===
;[[Statistics_and_Statistical_Programming_(Fall_2020)/w07_session_plan|Session plans]]
==== October 27: Applied inference for categorical data ====
'''Required'''
* Read Reinhart, §4 and §5 (both are quite short).
* Skim the following (all are referenced in the problem set)
**  Aronow PM, Karlan D, Pinson LE. (2018). The effect of images of Michelle Obama’s face on trick-or-treaters’ dietary choices: A randomized control trial. PLoS ONE 13(1): e0189693. [https://doi.org/10.1371/journal.pone.0189693 https://doi.org/10.1371/journal.pone.0189693]
** Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in ''Proceedings of the 8th ACM Conference on Designing Interactive Systems.'' Aarhus, Denmark: ACM. [[https://mako.cc/academic/buechley_hill_DIS_10.pdf PDF available on Hill's personal website]]
** Shaw, Aaron and Yochai Benkler. 2012. A tale of two blogospheres: Discursive practices on the left and right. ''American Behavioral Scientist''. 56(4): 459-487. [[https://doi.org/10.1177%2F0002764211433793 available via NU libraries]]
* Complete [[Statistics_and_Statistical_Programming_(Fall_2020)/pset5|problem set #5]]
'''Resources'''
* [https://communitydata.science/~ads/teaching/2020/stats/r_tutorials/w06-R_tutorial.html Week 06 R tutorial] (it's very short!)


== Policies ==
==== October 29: Inference for numerical data (part 1) ====
'''Required'''
* Read Diez, Çetinkaya-Rundel, and Barr: §7.1-3 (Inference for numerical data: differences of means).
* Watch [https://www.youtube.com/watch?list=PLkIselvEzpM5G3IO1tzQ-DUThsJKQzQCD&v=uVEj2uBJfq0 inference for numerical data] (videos 1-4 in the playlist) OpenIntro lectures (and featuring one of the textbook authors!).
* Complete '''exercises from OpenIntro §7:''' 7.12, 7.24, 7.26


=== Attendance ===
'''Resources'''
* [https://gallery.shinyapps.io/CLT_mean/ OpenIntro Central limit theorem for means demo].


Attendance in class is expected of all participants. If you need to miss class for any reason, please contact me ahead of time (email is best). Multiple unexplained absences will likely result in a lower grade or (in extreme circumstances) a failing grade. In the event of an absence, you are responsible for obtaining class notes, handouts, assignments, etc. You are also still responsible for turning in any assignments on time unless you make prior arrangements with me.
==== October 30: [[#Research project planning document|Research project planning document]] due 5pm CT====
* Submit via [https://canvas.northwestern.edu/courses/122522/assignments/787297 Canvas] (due by 5pm CT)


=== In-class device usage ===
=== Week 8 (11/3, 11/5) ===
==== November 3: U.S. election day (no class meeting) ====


Please refrain from any uses of digitally networked devices or other distraction machines that do not directly contribute to your engagement with the course material. If you struggle to comply with this policy, I may recommend you temporarily put away your device(s) or leave the classroom.
==== November 4: Interactive self-assessment due ====
* Please submit results [https://canvas.northwestern.edu/courses/122522/assignments/799630 (via Canvas)] from the [https://communitydata.science/~ads/teaching/2020/stats/assessment/interactive_assessment.rmd interactive self-assessment] by 5pm CT.


=== Peers’ Work and In-Class Discussions ===
==== November 5: Inference for numerical data (part 2) ====
'''Required'''
* Read Diez, Çetinkaya-Rundel, and Barr: §7.4-5 (Inference for numerical data: power calculations, ANOVA, and multiple comparisons).
* Watch [https://www.youtube.com/watch?list=PLkIselvEzpM5G3IO1tzQ-DUThsJKQzQCD&v=uVEj2uBJfq0 inference for numerical data] (videos 4-8 in the playlist) OpenIntro lectures (and featuring one of the textbook authors!).
* Complete '''exercises from OpenIntro §7:''' 7.42, 7.44, 7.46


Throughout the course, you may receive, read, collaborate, and/or comment on classmates’ work. These assignments are for class use only. You may not share them with anybody outside of class without explicit written permission from the document’s author and pertaining to the specific piece.
'''Resources'''
* [https://www.openintro.org/go/?id=stat_better_understand_anova&referrer=/book/os/index.php OpenIntro supplement on ANOVA calculations] (useful if you think you'll be doing more ANOVAs).


It is essential to the success of this class that all participants feel comfortable discussing questions, thoughts, ideas, fears, reservations, apprehensions and confusion about works-in-progress, statistical concepts, independent research, and more. Therefore, you may not create any audio or video recordings during class time nor share verbatim comments with those not in class nor are you allowed to share using other methods -- e.g., social media -- any comments linked to people’s identities unless you get clear and explicit permission. If you want to share general impressions or specifics of in-class discussions with those not in class, please do so without disclosing personal identities or details.
=== Week 9 (11/10, 11/12) ===
==== November 10: Applied inference for numerical data (t-tests, power analysis, ANOVA) ====
;[[Statistics_and_Statistical_Programming_(Fall_2020)/w09_session_plan|Session plans]]


=== Academic Integrity ===
'''Required'''
* Complete [[Statistics_and_Statistical_Programming_(Fall_2020)/pset6|problem set #6]]


You are responsible for reading and abiding by the Northwestern University [https://www.northwestern.edu/provost/policies/academic-integrity/principles.html Principles Regarding Academic Integrity]. Personally, I expect you to exceed the minimal standards elaborated in those principles and to strive for admirable, extraordinary conduct in every aspect of your academic career. Feel free to ask me (the instructor) for clarification about this or related matters.
'''Resources'''
* [https://communitydata.science/~ads/teaching/2020/stats/r_tutorials/w09-R_tutorial.html Week 09 R tutorial]


=== Deadlines ===  
==== November 12: Linear regression ====
'''Required'''
* Read Diez, Çetinkaya-Rundel, and Barr: §8 (Linear regression).
* Watch [https://www.youtube.com/playlist?list=PLkIselvEzpM63ikRfN41DNIhSgzboELOM linear regression] (videos 1-4 in the playlist) OpenIntro lectures.
* Read [https://www.openintro.org/go/?id=stat_more_inference_for_linear_regression&referrer=/book/os/index.php More inference for linear regression] (OpenIntro supplement).
* Complete '''exercises from OpenIntro §8:''' 8.6, 8.36, 8.40, 8.44
* Complete '''exercises from OpenIntro supplement:''' 4 and 5 (answers provided in the supplement).
'''Resources'''
* [https://seeing-theory.brown.edu/index.html#secondPage/chapter6 Seeing Theory §6 (Regression analysis)]


Emergencies happen. Unanticipated obstacles arise. If you cannot make a deadline, please contact me to figure out a schedule that will work. The more proactive and responsible you are, the more receptive I am likely be.
=== Week 10 (11/17, 11/19) ===
;[[Statistics_and_Statistical_Programming_(Fall_2020)/w10_session_plan|Session plans]]
==== November 17: Applied linear regression ====
'''Required'''
* Complete [[Statistics_and_Statistical_Programming_(Fall_2020)/pset7|Problem set #7]]


A word about extensions and incompletes: I strongly discourage them. In principle, I have no problem with extensions or incompletes. In practice, they tend to be a pain for everybody involved. If you absolutely must submit an assignment late, assume that I may require up to 1 month (4 weeks) to grade it. Please take this into account if you will need me to to submit a grade in order to receive your fellowship/diploma/visa/etc. by a particular date.
'''Resources'''
* [https://communitydata.science/~ads/teaching/2020/stats/r_tutorials/w10-R_tutorial.html Week 10 R tutorial]
==== November 19: Multiple and logistic regression ====
'''Required'''
* Read Diez, Çetinkaya-Rundel, and Barr: §9 (Multiple and logistic regression). (Skim §9.2-9.4)
** '''Disclaimer:''' Aaron doesn't like §9.2-9.3, but it should be useful to understand and discuss them, so we'll do that.
* Watch [https://www.youtube.com/playlist?list=PLkIselvEzpM5f1HYzIjFt52SD4izsJ2_I multiple and logistic regression] (videos 1-4 in the playlist) OpenIntro lectures.
* Read [https://www.openintro.org/go/?id=stat_interaction_terms&referrer=/book/os/index.php Interaction terms] (OpenIntro supplement).
* Read [https://www.openintro.org/go/?id=stat_nonlinear_relationships&referrer=/book/os/index.php Fitting models for non-linear trends] (OpenIntro supplement).
* Complete '''exercises from OpenIntro §9:''' 9.4, 9.13, 9.16, 9.18,


=== Accommodations ===
'''Resources'''


I am totally happy to provide accommodations for religious observance, physical needs, or other circumstances as needed. Any student requesting accommodations related to a disability or other condition is required to register with AccessibleNU (847-467-5530) and provide professors with an accommodation notification from AccessibleNU, preferably within the first two weeks of class. All information will remain confidential. For more information, visit [https://www.northwestern.edu/accessiblenu/ AccessibleNU].
=== Week 11 (11/24) ===
==== November 24: Applied multiple and logistic regression ====
;[[Statistics_and_Statistical_Programming_(Fall_2020)/w11_session_plan|Session plans]]
'''Required'''
* Complete [[Statistics_and_Statistical_Programming_(Fall_2020)/pset8|Problem set #8]]
'''Resources'''
* Mako Hill created (and Aaron updated) a brief tutorial on [https://communitydata.science/~ads/teaching/2020/stats/r_tutorials/logistic_regression_interpretation.html interpreting logistic regression coefficients with examples in R]


=== Sexual Misconduct ===  
=== Week 12+ ===


All participants in this class are bound by the [https://www.northwestern.edu/sexual-misconduct/title-IX/university-policies/policy-on-sexual-misconduct.html Northwestern University sexual misconduct policy] Please note, that the core of the policy states, "Northwestern is committed to fostering an environment in which all members of our community are safe, secure, and free from sexual misconduct of any form, including, but not limited to, sexual assault, sexual exploitation, stalking, and dating and domestic violence." I take this very seriously. Please review the policy and speak to me if you have any questions or concerns.
==== December 3: [[#Research project presentation|Research project presentation]] due by 5pm CT ====
'''[https://canvas.northwestern.edu/courses/122522/discussion_topics/856868 Post your video via this "Discussion" on Canvas]'''. Please view and provide constructive feedback on other's videos!


=== Email protocol ===
* '''Post videos directly to the "Discussion."''' The Canvas text editor has an option to upload/record a video. That's what you want.
* '''Please remember not to over-work/think this.''' I mentioned this in class, but just to reiterate, the focus of this assignment should not be your video editing skills. Please do what you can to record and convey your ideas clearly without devoting insane hours to creating the perfect video.
* '''Some resources for recording presentations:''' There are a bunch of ways you might record/share your video. Some ideas include using the embedded media recorder in Canvas (!) that can record with with your webcam (maybe attach a few visuals to accompany this?); recording a "meeting" with yourself in Zoom; and "Panopto," a piece of high-end video recording, sharing, and editing software that NU licenses for campus use. Here are some pointers:
** NU has a "digital learning resource hub" which provides some [https://digitallearning.northwestern.edu/resource-hub#for-students resources for students]. The first item in that list has pointers for recording yourself and posting to Canvas and includes info about the Canvas media recorder and Panopto.
** You should be able to use your NU zoom account to create a zoom meeting, record your meeting (in which you deliver your presentation and share your screen with any visuals), and then share a link to the recording via the "Recordings" item in the left-hand menu of your [https://northwestern.zoom.us/ https://northwestern.zoom.us/] account page.
** If nothing works, please get in touch.


I receive too much email and I sometimes fail to keep up. If, for some reason, I do not respond to a message related to this course within 48 hours, please do not take it personally and feel free to re-send the message with a polite reminder. This will help me and I will not resent you for it.
==== December 4: Post-course assessment of statistical concepts due by 11pm CT ====
Complete [https://apps3.cehd.umn.edu/artist/user/scale_select.html post-course assessment] (access code TBA VIA email). Submission deadline: December 4, 11:00pm Chicago time.


==== December 10: [[#Research project paper|Research project paper]] due by 5pm CT ====
'''[https://canvas.northwestern.edu/courses/122522/assignments/812317 Submit your paper, data, and code via Canvas].'''


=== Credit and Notes ===
== Credit and Notes ==


This syllabus has, in ways that should be obvious, borrowed and built on the [https://www.openintro.org/stat/index.php OpenInto Statistics curriculum]. I also based nearly every aspect of the course design on Benjamin Mako Hill's [[Statistics_and_Statistical_Programming_(Winter_2017)|COM 521 class]].
This syllabus has, in ways that should be obvious, borrowed and built on the [https://www.openintro.org/stat/index.php OpenInto Statistics curriculum]. Most aspects of this course design extend Benjamin Mako Hill's [[Statistics_and_Statistical_Programming_(Winter_2017)|COM 521 class]] from the University of Washington as well as a [[Statistics_and_Statistical_Programming_(Spring_2019)|prior iteration of the same course]] offered at Northwestern in Spring 2019.

Latest revision as of 03:08, 3 January 2021

Statistics and Statistical Programming
Media, Technology & Society (MTS) 525 and Communication Studies 395
Tuesdays & Thursdays 1-2:50pm CT
Fall 2020
Northwestern University
Course websites
Canvas for announcements, assignments, and some files.
Zoom for synchronous course meetings.
Discord for discussions and chat.
This wiki page for nearly everything else.
Instructor: Aaron Shaw (aaronshaw@northwestern.edu)
Office Hours: Thursday 10am-12pm and by appointment
Please use office hours signups (with location information)
Also usually available via chat during "business hours."
Teaching Assistant: Nick Vincent (nickvincent@u.northwestern.edu)
Office Hours: Monday 10am-12pm and by appointment. I'll try to respond to any asynchronous questions in a timely fashion during "business hours" (9a-5p Central Time), and will also have OH by appointment. I'll respond best to email (above), but am also happy to use Discord for quicker back-and-forth.
I am happy to try out alternative communication software for OH!



Course information[edit]

Overview and learning objectives[edit]

This course provides a get-your-hands-dirty introduction to inferential statistics and statistical programming mostly for applications in the social sciences and social computing. My main objectives are for all participants to acquire the conceptual, technical, and practical skills to conduct your own statistical analyses and become more sophisticated consumers of quantitative research in communication, human computer interaction (HCI), and adjacent disciplines.

I will consider the course a complete success if every student is able to do all of the following things at the end of the quarter:

  • Design and execute a quantitative research project that involves statistical inference, start to finish.
  • Read, modify, and create short programs in the R statistical programming language.
  • Feel comfortable reading and interpreting papers that use basic statistical techniques.
  • Feel prepared to enroll in more specialized and advanced statistics courses.

The course will cover a number of techniques, likely including the following: t-tests; chi-squared tests; ANOVA; linear regression; and logistic regression. We will also consider salient issues in quantitative research such as reproducibility and "the statistical crisis in science." We may cover other topics as time and interest allow.

The course materials will consist of readings, problem sets, assessment exercises, and recorded lectures and screencasts (some created by me, some created by other people). The course requirements will emphasize active participation, self-evaluation, and will include a final project focused on the design and execution of an original piece of quantitative research. We will use the R programming language for all examples and assignments.

You are not required to know much about statistics or statistical programming to take this class. I will assume some (very little!) knowledge of the basics of empirical research methods and design, basic algebra and arithmetic, and a willingness to work to learn the rest. In general we are not going to cover most of the math behind the techniques we'll be learning. Although we may do some math, this is not a math class. This course will also not require knowledge of calculus or matrix algebra. I will *not* do proofs on the board. Instead, the class is unapologetically focused on the application of statistical methods. Likewise, while some exposure to R, other programming languages, or other statistical computing resources will be helpful, it is not assumed.

Why this course? Why statistical programming? Why R?

Many comparable courses in statistics and quantitative methods do not emphasize statistical programming. So why bother? By learning statistical programming you will gain a deeper understanding of both the principles behind your analysis techniques as well as the tools you use to apply those techniques. In addition, a solid grasp of statistical programming will prepare you to create reproducible research, avoid common errors, and enable both greater durability and validity of your work.

Other programming languages are also well suited to statistics, including Stata and Python. I do most of my work with R, so that guides my choice for the course. That said, I opt to use and teach with R for a few reasons:

  • R is freely available and open source.
  • R is the most widely used package in statistics and several social scientific fields.
  • R (along with Stata) will be used in most of the advanced stats classes I hope you will take after this course.
  • R is better general purpose programming language than Stata which means that R programming skills will let you solve non-statistical problems and may make it easier to learn other programming languages like Python.

Format and structure[edit]

This course will proceed in a remote format that includes asynchronous and synchronous elements (more on those below). In general, the organization of the course adopts a "flipped" approach where participants consume, discuss, and process instructional materials outside of "class" and we use synchronous meetings to answer questions, address challenges or concerns, work through solutions, and hold semi-structured discussions.

The course introduces both basic statistical concepts as well as applications of those concepts through statistical programming. As a result, we will usually dedicate part of each week to a particular set of concepts and part of each week to applied data analysis and/or interpretation. A brief description of how I expect it all to work follows below. We'll talk about it more during the first class session.

Asynchronous elements of the course[edit]

These include all readings, recorded lectures/slides, tutorials, textbook exercises, problem sets, and other assignments. I expect you to complete (or at least attempt to complete!) these outside of our class meeting times. I also strongly encourage you to identify, submit, and discuss questions about the material before each class meeting whenever possible.

We will use Discord for everyday discussions and chat related to the course. In general, the teaching team will try to keep an eye on the various server channels during "business hours." To the extent that we can respond to questions and concerns there, we'll do so. We'll also use the discussion channels to identify topics that might benefit from synchronous conversation during the course meetings. Hopefully, writing and talking about questions and concerns outside of the synchronous course meetings will help support accountability, learning, and more effective use of our meeting time.

For nearly all of the "instructional" material introducing particular statistical concepts and techniques, you are assigned materials from the OpenIntro textbook and lecture materials created by the textbook authors. Please note that this means I will not deliver lectures during our class meetings. Please also note that this means you are responsible for coordinating your working groups and any collaborative work with other members of the class outside of our class meeting times.

Synchronous elements of the course[edit]

The synchronous elements of the course will be the two weekly class meetings that will happen via video conference (Zoom). These are scheduled to run for a maximum of 110 minutes. Each session will include multiple short breaks.

We will use the class meetings to discuss and work through any questions or challenges you encounter in the materials assigned for that day. This means that I encourage you to identify, submit, and discuss questions about the material before each class meeting whenever possible. Doing so will give the teaching team time to sift, sort, and organize the questions into a hopefully-cohesive plan for each class session that is tailored to the specific concerns you encounter in the material. Obviously, we anticipate that questions will arise during the class sessions too as well and we'll do our best to adapt as we go.

A couple of other notes about the synchronous course meetings:

  • Aaron plans to record the course meetings and have them available to class participants only via Zoom/Canvas. Please get in touch if you have concerns or requests about this.
  • The teaching team will do our best to notice and respond to any questions or comments that come up via Discord or Zoom during the class. Please do what you can to support these efforts.
  • You might want to create/acquire something like NU Mechanical Engineering Professor Michael Peshkin's homebrew document camera to facilitate sharing hand-written notes/drawings during class.

In addition, because randomness is extremely important in statistics, I may occasionally randomly assign different working groups to share and discuss their solutions to selected textbook exercises or problem set questions during class. These random assignments will be announced ahead of time so that the group has an opportunity to prepare. The idea here is to structure some participation in the synchronous sessions to ensure an equitable distribution of the responsibility to discuss questions, answers, points of confusion, and alternatives.

Working groups[edit]

At the start of the course you will be assigned to a small working group. This will be a group of 2-3 students (exact numbers will depend on the final enrollment) with whom you may meet outside of class time to discuss, complete, and/or review your weekly assignments (as well as some of the research project assignments). The groups will rotate at least once during the quarter to ensure that you get to work with different members of the class. The main idea is to support collaborative learning, peer support, and accountability. While the specifics of exactly when and how you work with your working group will largely be up to you, the teaching team will provide suggestions in the form of a template that you can use as a starting point.

As a general rule, we strongly encourage you to collaborate with members of your working group on any/all weekly (minor) assignments. You may, if you choose, also collaborate with others in your group or the class on your research project (major) assignments; however, collaborative research projects should be discussed with a member of the teaching team and all research project assignment submissions should include the names of all collaborators.


Textbook, readings, and resources[edit]

This class will use a freely-licensed textbook:

  • Diez, David M., Christopher D. Barr, and Mine Çetinkaya-Rundel. 2019. OpenIntro Statistics. 4th edition. OpenIntro, Inc.

The texbook (in any format) is required for the course. You can download it at no cost and purchase hard copy versions in either full color ($60) or in black and white ($20). The B&W version is very affordable and I strongly recommend buying a hard copy for the purposes of the course and subsequent reference use. The book is excellent and has been adopted widely. It has also developed a large online community of students and teachers who have shared other resources. Lecture slides, videos, notes, and more are all freely licensed (many through the website and others elsewhere).

I will also assigning several chapters from the following:

This book provides a readable conceptual introduction to some common failures in statistical analysis that you should learn to recognize and avoid. It was also written by a Ph.D. student. You have access to an electronic copy via the NU library (you'll need to sign-in and/or use the NU VPN to access it), but you may find it helpful to purchase as well.

A few other books may be useful resources while you're learning to analyze, visualize, and interpret statistical data with R. I will share some advice about these during the first class meeting:

  • Healy, Kieran. 2019. Data Visualization: A Practical Introduction. Princeton, NJ: Princeton UP. (via Healy's website)
  • Teetor, Paul. 2011. R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics. 1 edition. Sebastopol, CA: O’Reilly Media. (Safari Proquest/NU Libraries; Various Sources; Amazon)
  • Verzani, John. 2014. Using R for Introductory Statistics, Second Edition. 2 edition. Boca Raton: Chapman and Hall/CRC. (Various Sources; Amazon)
  • Wickham, Hadley. 2010. ggplot2: Elegant Graphics for Data Analysis. 1st ed. 2009. Corr. 3rd printing 2010 edition. New York: Springer. (Springer/NU Libraries; Various Sources)
  • Wickham, Hadly and Grolemund, Garret. 2017. R for Data Science. Sebastopol, CA: O'Reilly. (Online version).

There are also some invaluable non-textbook resources:

  • Baggott's R Reference Card v2 — Print this out. Take it with you everywhere and look at it dozens of times a day. You will learn the language faster!
  • StackOverflow R Tag — Somebody already had your question about how to do X in R. They asked it, and several people have answered it, on StackOverflow. Learning to read this effectively will take time but as build up some basic familiarity with R and with StackOverflow, it will get easier. I promise.
  • Rseek — Rseek is a modified version of Google that just searches R websites online. Sometimes, R is hard to search because R is a common letter. This has become much easier over time as R has become more popular, but it can still be an issue sometimes and Rseek is a good solution.
  • ggplot2 documentation — ggplot is a powerful data visualization package for R that I recommend highly. The documentation is indispensable for learning how to use it.
  • Statistical Analysis and Reporting in R — A set of resources created and distributed by Jacob Wobbrock (University of Washington, School of Information) in conjunction with a MOOC he teaches. Contains cheatsheets, code snippets, and data to help execute commonly encountered statistical procedures in R.
  • DataCamp offers introductory R courses. Northwestern usually has some free accounts that get passed out via Research Data Services each quarter. Apparently, if you are taking or teaching relevant coursework, instructors can request free access to DataCamp for their courses from DataCamp. If folks are interested in this, I can reach out.

Computing resources:

  • If you are planning to analyze large-scale data (i.e., data that won't fit in memory on your laptop) then you will want to sign up for a research allocation on Quest, which is Northwestern's high-performance computing cluster. Instructions on how to do that are here.

Weekly (minor) assignments[edit]

In order to support continuous progress towards the learning goals for the course, I have assigned some textbook exercises or a problem set ahead of every class. These assignments will provide the basis on which the teaching team will assess and provide feedback on your participation and engagement with the course material.

The first week or so of the course is textbook-focused to get us warmed up. Starting in week 2, we will do more statistical programming and apply the textbook concepts using R and RStudio. In general, we will cover the problem sets in the first session of the week and the textbook materials in the second session.

Textbook exercises[edit]

The focus is on self-assessment of your understanding of the textbook material and you do not need to hand in anything. I expect that you will work on the exercises, review and discuss solutions, and submit any questions ahead of or during class. Please note that solutions to odd-numbered problems appear in the back of the book. The teaching team will distribute solutions to even-numbered problems as well.

Problem sets[edit]

The course will include problem sets and these may incorporate several kinds of questions:

  • Statistics questions about statistical concepts and principles.
  • Programming challenges that you should solve using R.
  • Empirical paper questions about other assigned readings.

For the problem sets, I ask that you submit your work via Canvas 24 hours before class (i.e., Monday afternoon for our Tuesday class sessions). Details of exactly how this will work will be elaborated during the first class. For the programming challenges, you should submit code and text for your solutions (again, more on how later). If you get completely stuck on a problem, that's okay, but please provide whatever you have.

Problem sets will be evaluated on a complete/incomplete basis. Although the problem sets will not be assigned a letter grade, they are a central focus of the course and completing them will support your mastery of the material in multiple ways. Working through them on schedule will also make it possible for you to participate in the synchronous course meetings and online discussions of course material effectively. Your ability to do so will figure prominently in your participation grade for the course (see the section on grading and assessment below).

Research project (major) assignments[edit]

Overview[edit]

As a demonstration of your learning in this course, you will design and carry out a quantitative research project, start to finish. This means you will all:

  • Design and describe a plan for a study — The study you design should involve quantitative analysis and should be something you can complete at least a first pass on during this quarter.
  • Find a dataset — Very quickly, you should identify a dataset you will use to complete this project. For most of you, I suspect you will be engaging in secondary data analysis or a analysis of a previously collected dataset.
  • Engage in descriptive data analysis — Use R to calculate descriptive statistics and visualizations to describe your data.
  • Motivate and test at least one hypothesis about relationships between two or more variables — I'm happy to discuss alternatives to formal hypothesis testing procedures (even if some of them are beyond the scope of this course).
  • Report and interpret your findings — You will do this in both a short paper and a short (recorded) presentation.
  • Ensure that your work is replicable — You will need to provide code and data for your analysis in a way that makes your work replicable by other researchers.

I strongly urge you to produce a project that will further your academic career outside of the class. There are many ways that this can happen. Some obvious options are to prepare a project that you can submit for publication, use as pilot analysis that you can report in a grant or thesis proposal, and/or use to fulfill a degree requirement.

There are several intermediate milestones, deliverables, and deadlines to help you accomplish a successful research project. Unless otherwise noted, all deliverables should be submitted via Canvas by 5pm CT on the day they are due.


Research project plan and dataset identification[edit]

Due date
October 9, 2020, 5pm CT
Maximum length
500 words (~1-2 pages)

Early on, I want you to identify and describe your final project. Your description should be short and can be either paragraphs or bullets. It should include the following:

  • An abstract of the proposed study including the topic, research question, theoretical motivation, object(s) of study, and anticipated research contribution.
  • An identification of the dataset you will use and a description of the rows and columns or type(s) of data it will include. If you do not currently have access to these data, explain why and when you will.
  • A short (several sentences?) description of how the project will fit into your career trajectory.


Notes on finding a dataset[edit]

In order to complete your final project, you will each need a dataset. If you already have a dataset for the project you plan to conduct, great! If not, fear not! There are many datasets to draw from. Some ideas are below (please suggest others, provide updated links, or report problems). The teaching team will also be available to help you brainstorm/find resources if needed:

  • Ask your advisor for a dataset they have collected and used in previous papers. Are there other variables you could use? Other relationships you could analyze?
  • If there's an important study you loved, you can send a polite email to the author(s) asking if they are willing and able to share an archival or replication version of the dataset used in their paper. Be very polite and make it clear that this is starting as a class project, but that it might turn into a paper for publication. Make your timeline clear. In Communication and HCI, replication datasets are still very rare, so be prepared for a negative answer and/or questions about your motives in conducting the analysis.
  • Do some Google Scholar and normal internet searching for datasets in your research area. You'll probably be surprised at what's available.
  • Take a look at datasets available in the Harvard Dataverse (a very large collection of social science research data) or one of the other members of the Dataverse network.
  • Look at the collection of social scientific datasets at ICPSR at the University of Michigan (NU is a member). There are an enormous number of very rich datasets.
  • Use the ISA Explorer to find datasets. Keep in mind the large majority of datasets it will search are drawn from the natural sciences.
  • The City of Chicago has one of the best data portal sites of any municipality in the U.S. (and better than many federal agencies). There are also numerous administrative datasets released by other public entities (try searching!) that you might find inspiring.
  • FiveThirtyEight.com has published a GitHub repository and an R package with pre-processed and cleaned versions of many of the datasets they use for articles published on their website.
  • If you interested in studying online communities, there are some great resources for accessing data from Reddit, Wikipedia, and StackExchange. See pushshift for dumps of Reddit data, here for an overview of Wikipedia's data resources, and Stack Exchange's data portal.
  • The NY Times is publishing a COVID-19 data repository that includes county-level metrics for deaths, mask usage, and other pandemic-related data. The release a lot of it as frequently updated .csv files and the repository includes documentation of the measurements, data collection details, and more.
  • The Community Data Science Collective and colleagues have created a COVID-19 digital observatory (hosted in part right here on this wiki!) that publishes a bunch of pandemic-related data as csv and json files.
  • The Stanford Open Policing project has published a huge archive of policing data related mostly to traffic stops in states and many cities of the U.S. We'll use at least one of these files for a problem set.

Research project planning document[edit]

Due date
October 30, 2020, 5pm CT
Suggested length
~5 pages

The project planning document is a shell/outline of an empirical quantitative research paper. Your planning document should should have the following sections: (a) Rationale, (b) Objectives; (b.1) General objectives; (b.2) Specific objectives; (c) (Null) hypotheses; (d) Conceptual diagram and explanation of the relationship(s) you plan to test; (e) Measures; (f) Dummy tables/figures; (g) anticipated finding(s) and research contribution(s). Longer descriptions of each of these planning document sections (as well as a few others) can be found on this wiki page.

I will also provide three example planning documents via our Canvas site (links to-be-updated for 2020 edition of the course):

Research project presentation[edit]

Presentation due date
December 3, 2020, 5pm CT
Maximum length
10 minutes

You will also create and record a short presentation of your final project. The presentation will provide an opportunity to share a brief overview of your project and findings with the other members of the class. Since you will all give other research presentations throughout your career, I strongly encourage you to take the opportunity to refine your academic presentation skills. The document Creating a Successful Scholarly Presentation (file posted to Canvas) may be useful.

Additional details about the presentation goals, format suggestions, resources, and more will be provided later in the quarter.

Research project paper[edit]

Paper due date
December 10, 2020, 5pm CT
Maximum length
6000 words (~20 pages)

I expect you to produce a short, high quality research paper that you might revise, extend, and submit for publication and/or a dissertation milestone. I do not expect the paper to be ready for publication, but it should contain polished drafts of all the necessary components of a scholarly quantitative empirical research study. In terms of the structure, please see the page on the structure of a quantitative empirical research paper.

As noted above, you should also provide data, code, and any documentation sufficient to enable the replication of all analysis and visualizations. If that is not possible/appropriate for some reason, please talk to me so that we can find another solution.

Because the emphasis in this class is on statistics and methods and because I'm probably not an expert in the substance of your research domain, I'm happy to assume that your paper, proposal, or thesis chapter has already established the relevance and significance of your study and has a comprehensive literature review, well-grounded conceptual approach, and compelling reason why this research is important. As a result, you need not focus on these elements of the work in your written submission. Instead, feel free to start with a brief summary of the purpose and importance of this research followed by an introduction of your research questions or hypotheses. If you provide more detail, that's fine, but I won't give you detailed feedback on these parts and they will not figure prominently in my assessment of the work.

I have a strong preference for you to write the paper individually, but I'm open to the idea that you may want to work with others in the class. Please contact me before you attempt to pursue a collaborative final paper.

I do not have strong preferences about the style or formatting guidelines you follow for the paper and its bibliography. However, your paper must follow a standard format (e.g., ACM SIGCHI CSCW format or APA 6th edition (Word and LaTeX templates)) that is applicable for a peer-reviewed journal or conference proceedings in which you might aim to publish the work (they all have formatting or submission guidelines published online and you should follow them). This includes the references. I also strongly recommend that you use reference management software like Zotero to handle your bibliographic sources.

Human subjects research, IRB, and ethics[edit]

In general, you are responsible for making sure that you're on the right side of the IRB requirements and that your work meets applicable ethical norms and standards.

Class projects generally do not need IRB approval, but research for publications, dissertations, and sometimes even pilot studies do fall under IRB purview. You should not plan to seek IRB approval/determination retroactively. If your study may involve human subjects and you may ever publish it in any form, you will need IRB oversight of some sort.

Secondary analysis of anonymized data is generally not considered human subjects research, but I strongly suggest that you get a determination from the Northwestern IRB before you start. For work that is not considered human subjects research, this can often happen in a few hours or days. If you need to list a faculty sponsor or Principal Investigator, that should ideally be your advisor. If that doesn't make sense for some reason, please talk to me.

Research ethics are broad and complex topic. We'll talk about issues related to ethics and quantitative empirical research a bit more during class, but will likely only scratch the surface. I strongly encourage you to pursue further reading, conversation, coursework, and reflection as you consider how to understand and apply ethical principles in the context of your own research and teaching.

Grading and assessment[edit]

I will assign grades (usually a numeric value ranging from 0-10) for each of the following aspects of your performance. The percentage values in parentheses are weights that will be applied to calculate your overall grade for the course.

  • Weekly participation: 40%
  • Proposal identification: 5%
  • Final project planning document: 5%
  • Final project presentation: 10%
  • Final project paper: 40%

The teaching team will jointly and holistically evaluate your participation along four dimensions: attendance, preparation, engagement, and contribution. These are quite similar to the dimensions described in the "Participation Rubric" section of Benjamin Mako Hill's assessment page and Joseph Reagle's participation assessment rubric. Exceptional participation means excelling along all four dimensions. Please note that participation ≠ talking/typing more and I encourage all of us to seek balance in our discussions.

The teaching team's assessment of your final project proposal, planning document, presentation, and paper will reflect the clarity of the work, the effective execution and presentation of quantitative empirical analysis, as well as the quality and originality of the analysis. A more detailed assessment rubric will be provided. Throughout the quarter, we will talk about the qualities of exemplary quantitative research. In general, I expect your final project to embody these exemplary qualities.

Policies[edit]

General course policies[edit]

General policies on a wide variety of topics including classroom equity, attendance, academic integrity, accommodations, late assignments, and more are provided on Aaron's class policies page. Below are some policy statements specific to this course and quarter.

Teaching and learning in a pandemic[edit]

The Covid-19 pandemic will impact this course in various ways, some of them obvious and tangible and others harder to pin down. On the obvious and tangible front, we have things like a mix of remote and (a)synchronous instruction, the fact that many of us will not be anywhere near campus or each other this year, and the unusual academic calendar. These will reshape our collective "classroom" experience in major ways.

On the "harder to pin down" side, many of us may experience elevated levels of exhaustion, stress, uncertainty and/or distraction. We may need to provide unexpected support to family, friends, or others in our communities. I have personally experienced all of these things at various times over the past six months and I expect that some of you have too. It is a difficult time.

I believe it is important to acknowledge these realities of the situation and create the space to discuss and process them in the context of our class throughout the quarter. As your instructor and colleague, I commit to do my best to approach the course in an adaptive, generous, and empathetic way. I will try to be transparent and direct with you throughout—both with respect to the course material as well as the pandemic and the university's evolving response to it. I ask that you try to extend a similar attitude towards everyone in the course. When you have questions, feedback, or concerns, please try to share them in an appropriate way. If you require accommodations of any kind at any time (directly related to the pandemic or not), please contact the teaching team.

Expectations for synchronous remote sessions[edit]

The following are some baseline expectations for our synchronous remote class sessions. I expect that these can and will evolve. Please feel free to ask questions, suggest changes, or raise concerns during the quarter. I welcome all input.

  • All members of the class are expected to create a supportive and welcoming environment that is respectful of the conditions under which we are participating in this class.
  • All members of the class are expected to take reasonable steps to create an effective teaching/learning environment for themselves and others.

And here are suggested protocols for any video/audio portions of our class:

  • Please mute your microphone whenever you're not speaking and learn to use "push-to-talk" if/when possible.
  • Video is optional for all students at all times, although if you're willing/able to keep the instructor company in the video channel that would be nice.
  • If you need to excuse yourself at any time and for any reason you may do so.
  • Children, family, pets, roommates, and others with whom you may share your workspace are welcome to join our class as needed.

Syllabus revisions[edit]

This syllabus will be a dynamic document that will evolve throughout the quarter. Although the core expectations are fixed, the details will shift. As a result, please keep in mind the following:

  1. Assignments and readings are frozen 1 week before they are due. I will not add readings or assignments less than one week before they are due. If I forget to add something or fill in a "To Be Determined" less than one week before it's due, it is dropped. If you plan to read or work more than one week ahead, contact me first.
  2. Substantial changes to the syllabus or course materials will be announced. Please closely monitor your email and/or the announcements section on the course website on Canvas. When I make changes, these changes will be recorded in the edit history of this page so that you can track what has changed. I will also do my best to summarize these changes in an announcement on Canvas that will be emailed to everybody in the class.
  3. The course design may adapt throughout the quarter. As this is a new format for this course, I may iterate and prototype course design elements rapidly along the way. To this end, I will ask you for voluntary anonymous feedback — especially toward the beginning of the quarter. Please let me know what is working and what can be improved. In the past, I have made many adjustments based on this feedback and I expect to do so again.

Statistics and power[edit]

The subject matter of this course—statistics and statistical programming—has historical and present-day affinities with a variety of oppressive ideologies and projects, including white supremacy, discrimination on the basis of gender and sexuality, state violence, genocide, and colonialism. It has also been used to challenge and undermine these projects in various ways. I will work throughout the quarter to acknowledge and represent these legacies accurately, at the same time as I also strive to advance equity, inclusion, and justice through my teaching practice, the selection of curricular materials, and the cultivation of an inclusive classroom environment. Please see my general classroom policies for more on some of these topics.

Schedule (with all the details)[edit]

When reading the schedule below, the following key might help resolve ambiguity: §n denotes chapter n; §n.x denotes section x of chapter; §n.x-y denotes sections x through y (inclusive) of chapter n.

Week 1 (9/17)[edit]

September 17: Intro and setup[edit]

Session plan

Note: Aaron doesn't actually expect you to complete these before class on September 17

Required

  • Read this syllabus, discuss any questions/concerns with the teaching team.
  • Complete pre-course assessment of statistical concepts (access code TBA via email). Estimated time to do this is 30-40 minutes. Submission deadline: September 18, 11:00pm Chicago time
  • Confirm course registration and access to the textbook (pdf download available for $0 and b&w paperbacks for $20) as well as any software and web-services you'll need for course (Zoom, Discord, Canvas, this wiki, R, RStudio). Discord invites will be sent via email.
  • Complete problem set #0

Recommended

Week 2 (9/22, 9/24)[edit]

Session plans

September 22: Data and variables[edit]

Required

September 24: Numerical and categorical data[edit]

Required

  • Read Diez, Çetinkaya-Rundel, and Barr: §2.1-2 (Numerical and categorical data).
  • Review Lecture materials for §2.1 and §2.2 (Videos 6-7 in the playlist).
  • Complete exercises from OpenIntro §2: 2.12, 2.13, 2.16, 2.20, 2.23, 2.30 (and remember that solutions to odd-numbered problems are in the book!)
  • Submit, review, and respond to questions or requests for discussion via Discord or some other means.

Week 3 (9/29, 10/1)[edit]

Session plans

September 29: R fundamentals: Import, transform, tidy, and describe data[edit]

Required

Recommended

  • Week 3 R tutorial (note that you can access .rmd or .pdf versions by replacing the suffix of the URL accordingly).
  • Additional material from any of the recommended R learning resources suggested last week or elsewhere in the syllabus. In particular, you may find the ModernDive, RYouWithMe, Healy, and/or Wickham and Grolemund resources valuable.

October 1: Probability[edit]

Required

  • Read Diez, Çetinkaya-Rundel, and Barr: §3 (Probability).
  • Watch Probability introduction and Probability trees OpenIntro lectures (just videos 1 and 2 in the playlist).
  • Complete exercises from OpenIntro §3: 3.12, 3.15, 3.22, 3.28, 3.34, 3.38

Resources

Week 4 (10/6, 10/8)[edit]

Session plans

October 6: Emotional contagion and more advanced R fundamentals: import, tidy, transform, and simulate data; write functions[edit]

Required

Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” Proceedings of the National Academy of Sciences 111(24):8788–90. [Open access]

Recommended

October 8: Distributions[edit]

Required

  • Read Diez, Çetinkaya-Rundel, and Barr: §4.1-3 (Normal and binomial distributions).
  • Watch normal and binomial distributions OpenIntro lectures (videos 1-3 in the playlist).
  • Complete exercises from OpenIntro §4: 4.4, 4.6, 4.15, 4.22

Resources

October 9: Research project plan and dataset identification due by 5pm CT[edit]

  • Submit via Canvas (due by 5pm CT)

Week 5 (10/13, 10/15)[edit]

Session plans

October 13: Descriptive analysis and visualization of data[edit]

Required

Recommended

October 15: Foundations for (frequentist) inference[edit]

Required

  • Read Diez, Çetinkaya-Rundel, and Barr: §5 (Foundations for inference).
  • Watch foundations for inference (videos 1-3 in the playlist) OpenIntro lectures.
  • Complete Why .05? OpenIntro video/exercise.
  • Complete exercises from OpenIntro §5: 5.4, 5.8, 5.10, 5.17, 5.30, 5.35, 5.36

Resources

Week 6 (10/20, 10/22)[edit]

Session plans

October 20: Reinforced foundations for inference[edit]

Required

  • Complete problem set #4
  • Read Reinhart, §1.
  • Revisit the Kramer et al. (2014) paper we read a few weeks ago:
Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” Proceedings of the National Academy of Sciences 111(24):8788–90. [Open access]

October 22: Inference for categorical data[edit]

Required

  • Read Diez, Çetinkaya-Rundel, and Barr: §6 (Inference for categorical data).
  • Watch inference for categorical data (videos 1-3 in the playlist) OpenIntro lectures.
  • Complete exercises from OpenIntro §6: 6.10, 6.16, 6.22, 6.30, 6.40 (just parts a and b; part c gets tedious)

Resources

Week 7 (10/27, 10/29)[edit]

Session plans

October 27: Applied inference for categorical data[edit]

Required

  • Read Reinhart, §4 and §5 (both are quite short).
  • Skim the following (all are referenced in the problem set)
    • Aronow PM, Karlan D, Pinson LE. (2018). The effect of images of Michelle Obama’s face on trick-or-treaters’ dietary choices: A randomized control trial. PLoS ONE 13(1): e0189693. https://doi.org/10.1371/journal.pone.0189693
    • Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in Proceedings of the 8th ACM Conference on Designing Interactive Systems. Aarhus, Denmark: ACM. [PDF available on Hill's personal website]
    • Shaw, Aaron and Yochai Benkler. 2012. A tale of two blogospheres: Discursive practices on the left and right. American Behavioral Scientist. 56(4): 459-487. [available via NU libraries]
  • Complete problem set #5

Resources

October 29: Inference for numerical data (part 1)[edit]

Required

  • Read Diez, Çetinkaya-Rundel, and Barr: §7.1-3 (Inference for numerical data: differences of means).
  • Watch inference for numerical data (videos 1-4 in the playlist) OpenIntro lectures (and featuring one of the textbook authors!).
  • Complete exercises from OpenIntro §7: 7.12, 7.24, 7.26

Resources

October 30: Research project planning document due 5pm CT[edit]

  • Submit via Canvas (due by 5pm CT)

Week 8 (11/3, 11/5)[edit]

November 3: U.S. election day (no class meeting)[edit]

November 4: Interactive self-assessment due[edit]

November 5: Inference for numerical data (part 2)[edit]

Required

  • Read Diez, Çetinkaya-Rundel, and Barr: §7.4-5 (Inference for numerical data: power calculations, ANOVA, and multiple comparisons).
  • Watch inference for numerical data (videos 4-8 in the playlist) OpenIntro lectures (and featuring one of the textbook authors!).
  • Complete exercises from OpenIntro §7: 7.42, 7.44, 7.46

Resources

Week 9 (11/10, 11/12)[edit]

November 10: Applied inference for numerical data (t-tests, power analysis, ANOVA)[edit]

Session plans

Required

Resources

November 12: Linear regression[edit]

Required

  • Read Diez, Çetinkaya-Rundel, and Barr: §8 (Linear regression).
  • Watch linear regression (videos 1-4 in the playlist) OpenIntro lectures.
  • Read More inference for linear regression (OpenIntro supplement).
  • Complete exercises from OpenIntro §8: 8.6, 8.36, 8.40, 8.44
  • Complete exercises from OpenIntro supplement: 4 and 5 (answers provided in the supplement).

Resources

Week 10 (11/17, 11/19)[edit]

Session plans

November 17: Applied linear regression[edit]

Required

Resources

November 19: Multiple and logistic regression[edit]

Required

  • Read Diez, Çetinkaya-Rundel, and Barr: §9 (Multiple and logistic regression). (Skim §9.2-9.4)
    • Disclaimer: Aaron doesn't like §9.2-9.3, but it should be useful to understand and discuss them, so we'll do that.
  • Watch multiple and logistic regression (videos 1-4 in the playlist) OpenIntro lectures.
  • Read Interaction terms (OpenIntro supplement).
  • Read Fitting models for non-linear trends (OpenIntro supplement).
  • Complete exercises from OpenIntro §9: 9.4, 9.13, 9.16, 9.18,

Resources

Week 11 (11/24)[edit]

November 24: Applied multiple and logistic regression[edit]

Session plans

Required

Resources

Week 12+[edit]

December 3: Research project presentation due by 5pm CT[edit]

Post your video via this "Discussion" on Canvas. Please view and provide constructive feedback on other's videos!

  • Post videos directly to the "Discussion." The Canvas text editor has an option to upload/record a video. That's what you want.
  • Please remember not to over-work/think this. I mentioned this in class, but just to reiterate, the focus of this assignment should not be your video editing skills. Please do what you can to record and convey your ideas clearly without devoting insane hours to creating the perfect video.
  • Some resources for recording presentations: There are a bunch of ways you might record/share your video. Some ideas include using the embedded media recorder in Canvas (!) that can record with with your webcam (maybe attach a few visuals to accompany this?); recording a "meeting" with yourself in Zoom; and "Panopto," a piece of high-end video recording, sharing, and editing software that NU licenses for campus use. Here are some pointers:
    • NU has a "digital learning resource hub" which provides some resources for students. The first item in that list has pointers for recording yourself and posting to Canvas and includes info about the Canvas media recorder and Panopto.
    • You should be able to use your NU zoom account to create a zoom meeting, record your meeting (in which you deliver your presentation and share your screen with any visuals), and then share a link to the recording via the "Recordings" item in the left-hand menu of your https://northwestern.zoom.us/ account page.
    • If nothing works, please get in touch.

December 4: Post-course assessment of statistical concepts due by 11pm CT[edit]

Complete post-course assessment (access code TBA VIA email). Submission deadline: December 4, 11:00pm Chicago time.

December 10: Research project paper due by 5pm CT[edit]

Submit your paper, data, and code via Canvas.

Credit and Notes[edit]

This syllabus has, in ways that should be obvious, borrowed and built on the OpenInto Statistics curriculum. Most aspects of this course design extend Benjamin Mako Hill's COM 521 class from the University of Washington as well as a prior iteration of the same course offered at Northwestern in Spring 2019.