Editing Statistics and Statistical Programming (Fall 2020)
From CommunityData
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 1: | Line 1: | ||
<div style="float:right;" | <div style="float:right;" class="toclimit-2">__TOC__</div> | ||
:'''Statistics and Statistical Programming''' | |||
:Media, Technology & Society (MTS) 525 | ::Media, Technology & Society (MTS) 525 | ||
:Tuesdays & Thursdays | ::Tuesdays & Thursdays 10-11:50am (via Zoom) | ||
: | ::Spring, 2019 | ||
:Northwestern University | ::Northwestern University | ||
:'''Instructor:''' [http://aaronshaw.org Aaron Shaw] ([mailto:aaronshaw@northwestern.edu aaronshaw@northwestern.edu]) | |||
: | ::Office Hours: <TBA> or by appointment | ||
: | ::<location tba> | ||
: | |||
:'''Teaching Assistant:''' <TBA> | |||
:Office Hours: | ::Office Hours: <tba> | ||
: | ::<location tba> | ||
:'''Course Websites''': | |||
: | :* We will use [https://canvas.northwestern.edu/courses/90927 Canvas] for [https://canvas.northwestern.edu/courses/90927/announcements announcements], [https://canvas.northwestern.edu/courses/90927/assignments turning in most assignments], and maybe [https://canvas.northwestern.edu/courses/90927/discussion_topics discussions] the other possibility is [https://discord.com Discord]. | ||
: | :* Everything else will be linked on this page. | ||
== Overview and learning objectives == | |||
This course provides a get-your-hands-dirty introduction to inferential statistics and statistical programming mostly for applications in the social sciences and social computing. My main objectives are for all participants to acquire the conceptual, technical, and practical skills to conduct your own statistical analyses and become more sophisticated consumers of quantitative research in communication, human computer interaction (HCI), and adjacent disciplines. | This course provides a get-your-hands-dirty introduction to inferential statistics and statistical programming mostly for applications in the social sciences and social computing. My main objectives are for all participants to acquire the conceptual, technical, and practical skills to conduct your own statistical analyses and become more sophisticated consumers of quantitative research in communication, human computer interaction (HCI), and adjacent disciplines. | ||
Line 43: | Line 36: | ||
You are not required to know much about statistics or statistical programming to take this class. I will assume some (very little!) knowledge of the basics of empirical research methods and design, basic algebra and arithmetic, and a willingness to work to learn the rest. In general we are not going to cover most of the math behind the techniques we'll be learning. Although we may do some math, this is not a math class. This course will also not require knowledge of calculus or matrix algebra. I will *not* do proofs on the board. Instead, the class is unapologetically focused on the application of statistical methods. Likewise, while some exposure to R, other programming languages, or other statistical computing resources will be helpful, it is not assumed. | You are not required to know much about statistics or statistical programming to take this class. I will assume some (very little!) knowledge of the basics of empirical research methods and design, basic algebra and arithmetic, and a willingness to work to learn the rest. In general we are not going to cover most of the math behind the techniques we'll be learning. Although we may do some math, this is not a math class. This course will also not require knowledge of calculus or matrix algebra. I will *not* do proofs on the board. Instead, the class is unapologetically focused on the application of statistical methods. Likewise, while some exposure to R, other programming languages, or other statistical computing resources will be helpful, it is not assumed. | ||
== Why this course? Why statistical programming? Why R? == | |||
Many comparable courses in statistics and quantitative methods do not | Many comparable courses in statistics and quantitative methods do not focus on statistical programming and use easier-to-learn statistical software than R. So why bother? By learning statistical programming you will gain a deeper understanding of both the principles behind your analysis techniques as well as the tools you use to apply those techniques. In addition, a solid grasp of statistical programming will prepare you to create reproducible research, avoid common errors, and enable both greater durability and validity of your work. | ||
Other programming languages are also well suited to statistics, including Stata and Python. I | Other programming languages are also well suited to statistics, including Stata and Python. I am most comfortable and capable with R, so that guides my choice for the course. However, I like to use and teach with R for a few reasons: | ||
* R is freely available and open source. | * R is freely available and open source. | ||
* R is the most widely used package in statistics and | * R is becoming the most widely used package in statistics and many social science fields. | ||
* R (along with Stata) will be used in most of the advanced stats classes I hope you will take after this course. | * R (along with Stata) will be used in most of the advanced stats classes I hope you will take after this course. | ||
* R is better general purpose programming language than Stata which means that R programming skills will let you solve non-statistical problems and may make it easier to learn other programming languages like Python. | * R is better general purpose programming language than software like Stata which means that R programming skills will let you solve non-statistical problems and may make it easier to learn other programming languages like Python. | ||
=== | == A note about this syllabus == | ||
This | This syllabus will be a dynamic document that will evolve throughout the quarter. Although the core expectations are fixed, the details will shift. As a result, please keep in mind the following: | ||
# '''Assignments and readings are ''frozen'' 1 week before they are due.''' I will not add readings or assignments less than one week before they are due. If I forget to add something or fill in a "To Be Determined" less than one week before it's due, it is dropped. If you plan to read or work more than one week ahead, contact me first. | |||
# '''Substantial changes to the syllabus or course materials will be announced.''' Please closely monitor your email and/or [https://canvas.northwestern.edu the announcements section on the course website on Canvas]. When I make changes, these changes will be recorded in [https://wiki.communitydata.science/index.php?title=Statistics_and_Statistical_Programming_(Fall_2020)&action=history the edit history of this page] so that you can track what has changed. I will also do my best to summarize these changes in an announcement on Canvas that will be emailed to everybody in the class. | |||
# '''The course design may adapt throughout the quarter.''' As this is a new format for this course, I may iterate and prototype course design elements rapidly along the way. To this end, I will ask you for voluntary anonymous feedback — especially toward the beginning of the quarter. Please let me know what is working and what can be improved. In the past, I have made many adjustments based on this feedback and I expect to do so again. | |||
== | == Books and resources == | ||
This class will use a freely-licensed textbook: | This class will use a freely-licensed textbook: | ||
Line 117: | Line 74: | ||
* Verzani, John. 2014. ''Using R for Introductory Statistics, Second Edition''. 2 edition. Boca Raton: Chapman and Hall/CRC. ([https://en.wikipedia.org/wiki/Special:BookSources/978-1-4665-9073-1 Various Sources]; [https://www.amazon.com/Using-Introductory-Statistics-Second-Chapman/dp/1466590734/ref=mt_hardcover?_encoding=UTF8&me= Amazon]) | * Verzani, John. 2014. ''Using R for Introductory Statistics, Second Edition''. 2 edition. Boca Raton: Chapman and Hall/CRC. ([https://en.wikipedia.org/wiki/Special:BookSources/978-1-4665-9073-1 Various Sources]; [https://www.amazon.com/Using-Introductory-Statistics-Second-Chapman/dp/1466590734/ref=mt_hardcover?_encoding=UTF8&me= Amazon]) | ||
* Wickham, Hadley. 2010. ''ggplot2: Elegant Graphics for Data Analysis''. 1st ed. 2009. Corr. 3rd printing 2010 edition. New York: Springer. ([https://link.springer.com/book/10.1007%2F978-3-319-24277-4 Springer/NU Libraries]; [https://en.wikipedia.org/wiki/Special:BookSources/978-0-596-80915-7 Various Sources]) | * Wickham, Hadley. 2010. ''ggplot2: Elegant Graphics for Data Analysis''. 1st ed. 2009. Corr. 3rd printing 2010 edition. New York: Springer. ([https://link.springer.com/book/10.1007%2F978-3-319-24277-4 Springer/NU Libraries]; [https://en.wikipedia.org/wiki/Special:BookSources/978-0-596-80915-7 Various Sources]) | ||
There are also some invaluable non-textbook resources: | There are also some invaluable non-textbook resources: | ||
Line 123: | Line 79: | ||
* [ftp://cran.r-project.org/pub/R/doc/contrib/Baggott-refcard-v2.pdf Baggott's R Reference Card v2] — Print this out. Take it with you everywhere and look at it dozens of times a day. You will learn the language faster! | * [ftp://cran.r-project.org/pub/R/doc/contrib/Baggott-refcard-v2.pdf Baggott's R Reference Card v2] — Print this out. Take it with you everywhere and look at it dozens of times a day. You will learn the language faster! | ||
* [https://stackoverflow.com/questions/tagged/r StackOverflow R Tag] — Somebody already had your question about how to do ''X'' in R. They asked it, and several people have answered it, on StackOverflow. Learning to read this effectively will take time but as build up some basic familiarity with R and with StackOverflow, it will get easier. I promise. | * [https://stackoverflow.com/questions/tagged/r StackOverflow R Tag] — Somebody already had your question about how to do ''X'' in R. They asked it, and several people have answered it, on StackOverflow. Learning to read this effectively will take time but as build up some basic familiarity with R and with StackOverflow, it will get easier. I promise. | ||
* [http://rseek.org/ Rseek] — Rseek is a modified version of Google that just | * [http://rseek.org/ Rseek] — Rseek is a modified version of Google that just search R websites online. Sometimes, R is hard to search before because R is a common letter. This has become much easier over time as R has become more popular but it might still be the case sometimes and Rseek is a good solution. | ||
* [https://ggplot2.tidyverse.org/ ggplot2 documentation] — | * [https://ggplot2.tidyverse.org/ ggplot2 documentation] — Ggplot is a powerful data visualization package for R that I recommend highly. The documentation is indispensable for learning how to use it. | ||
* [https://depts.washington.edu/ | * [https://depts.washington.edu/madlab/proj/Rstats/ Statistical Analysis and Reporting in R] — A set of resources created and distributed by Jacob Wobbrock (University of Washington, School of Information) in conjunction with a MOOC he teaches. Contains cheatsheets, code snippets, and data to help execute commonly encountered statistical procedures in R. | ||
* [https://www.datacamp.com DataCamp] offers introductory R courses. Northwestern usually has some free accounts that get passed out via Research Data Services each quarter. Apparently, if you are taking or teaching relevant coursework, instructors can [https://www.datacamp.com/groups/education request] free access to DataCamp for their courses from DataCamp. If folks are interested in this, I can reach out. | * [https://www.datacamp.com DataCamp] offers introductory R courses. Northwestern usually has some free accounts that get passed out via Research Data Services each quarter. Apparently, if you are taking or teaching relevant coursework, instructors can [https://www.datacamp.com/groups/education request] free access to DataCamp for their courses from DataCamp. If folks are interested in this, I can reach out. | ||
Line 131: | Line 87: | ||
* If you are planning to analyze large-scale data (i.e., data that won't fit in memory on your laptop) then you will want to sign up for a research allocation on Quest, which is Northwestern's high-performance computing cluster. Instructions on how to do that are [[Statistics_and_Statistical_Programming_(Spring_2019)/Quest_at_Northwestern|here]]. | * If you are planning to analyze large-scale data (i.e., data that won't fit in memory on your laptop) then you will want to sign up for a research allocation on Quest, which is Northwestern's high-performance computing cluster. Instructions on how to do that are [[Statistics_and_Statistical_Programming_(Spring_2019)/Quest_at_Northwestern|here]]. | ||
=== | == Assignments == | ||
The assignments in this class focus on applied statistical research design, analysis, and interpretation. There will be no graded exams or quizzes. Unless otherwise noted, all assignments are due at the end of the day (i.e., 11:59pm on the day they are due). | |||
=== Weekly problem sets and participation === | |||
Each week I will post a problem set. Some of these will be taken from the textbooks and some will not. They will include: | |||
* '''Statistics questions''' about statistical concepts, principles, and interpretation. | |||
* '''Programming challenges''' that you must solve using R. | |||
* '''Empirical paper questions''' about other assigned readings. | |||
You should submit your solutions to the programming challenges (feel free to submit the others if you like, but they're not required!) ahead of each class session. While I will not grade them, we will spend a good chunk of class going through the answers to the assignment due on that day. | |||
Because randomness is extremely important in statistics, I will use a small R program to '''randomly call on''' students to walk through your answer to statistics questions and empirical paper questions in class. We'll then discuss the answers, address points of confusion, and consider alternative approaches as a group. | |||
For the programming challenges, you should submit code for your solutions before class (more on how in a moment) so we can walk through the material together. If you get completely stuck on a problem, that's okay, but please share whatever code you have so that you can tell us what you did and what you were thinking. | |||
Coming to class will be profoundly important to learning the material and to your final grade. Although the problem sets will not be graded, it is critical that you be present and able to discuss your answers to each of the questions. Your ability to do so will figure prominently in your participation grade for the course (40% of your final grade). | |||
I strongly encourage you to form groups to work on the problem sets if you find that helpful; however, you must still submit your work individually and respond to my cold-call prompts in class individually to help ensure that you learn and understand the material. | |||
I evaluate participation along four dimensions: attendance, preparation, engagement, and contribution. These are quite similar to the dimensions described in the "Participation Rubric" section of [https://mako.cc/teaching/assessment.html Benjamin Mako Hill's assessment page] and [https://reagle.org/joseph/zwiki/Teaching/Assessment/Participation.html Joseph Reagle's participation assessment rubric]. Exceptional participation means excelling along all four dimensions. Please note that participation ≠ talking more and I encourage all of us to seek [https://reagle.org/joseph/zwiki/Teaching/Best_Practices/Learning/Balance_in_Discussion.html balance in our classroom discussions]. | |||
=== Research project | === Research project === | ||
As a demonstration of your learning in this course, you will design and carry out a quantitative research project, start to finish. This means you will all: | As a demonstration of your learning in this course, you will design and carry out a quantitative research project, start to finish. This means you will all: | ||
Line 159: | Line 118: | ||
* '''Find a dataset''' — Very quickly, you should identify a dataset you will use to complete this project. For most of you, I suspect you will be engaging in secondary data analysis or a analysis of a previously collected dataset. | * '''Find a dataset''' — Very quickly, you should identify a dataset you will use to complete this project. For most of you, I suspect you will be engaging in secondary data analysis or a analysis of a previously collected dataset. | ||
* '''Engage in descriptive data analysis''' — Use R to calculate descriptive statistics and visualizations to describe your data. | * '''Engage in descriptive data analysis''' — Use R to calculate descriptive statistics and visualizations to describe your data. | ||
* '''Motivate and test at least one hypothesis about relationships between two or more variables''' | * '''Motivate and test at least one hypothesis about relationships between two or more variables''' | ||
* '''Report and interpret your findings''' — You will do this in both a short paper and a short | * '''Report and interpret your findings''' — You will do this in both a short paper and a short presentation. | ||
* '''Ensure that your work is replicable''' — You will need to provide code and data for your analysis in a way that makes your work replicable by other researchers. | * '''Ensure that your work is replicable''' — You will need to provide code and data for your analysis in a way that makes your work replicable by other researchers. | ||
''I strongly urge you'' to produce a project that will further your academic career outside of the class. There are many ways that this can happen. Some obvious options are to prepare a project that you can submit for publication, use as pilot analysis that you can report in a grant or thesis proposal, and/or | ''I strongly urge you'' to produce a project that will further your academic career outside of the class. There are many ways that this can happen. Some obvious options are to prepare a project that you can submit for publication, use as pilot analysis that you can report in a grant or thesis proposal, and/or that fulfills a degree requirement. | ||
There are several intermediate milestones | There are several intermediate milestones and deadlines to help you accomplish a successful research project. Unless otherwise noted, all deliverables should be submitted via Canvas. | ||
==== Project plan and dataset identification ==== | |||
;Due date: Thursday, April 18, 2019 | |||
;Due date: | |||
;Maximum length: 500 words (~1-2 pages) | ;Maximum length: 500 words (~1-2 pages) | ||
Line 176: | Line 134: | ||
* An abstract of the proposed study including the topic, research question, theoretical motivation, object(s) of study, and anticipated research contribution. | * An abstract of the proposed study including the topic, research question, theoretical motivation, object(s) of study, and anticipated research contribution. | ||
* An identification of the dataset you will use and a description of the | * An identification of the dataset you will use and a description of the columns or type of data it will include. If you do not currently have access to these data, explain why and when you will. | ||
* A short (several sentences?) description of how the project will fit into your career trajectory. | * A short (several sentences?) description of how the project will fit into your career trajectory. | ||
==== Project planning document ==== | |||
;Due date: Thursday, May 16, 2019 | |||
;Maximum length: ~5 pages | |||
The project planning document is a basic shell/outline of an empirical quantitative research paper. Your planning document should should have the following sections: (a) Rationale, (b) Objectives; (b.1) General objectives; (b.2) Specific objectives; (c) (Null) hypotheses; (d) Conceptual diagram and explanation of the relationship(s) you plan to test; (e) Measures; (f) Dummy tables/figures; (g) anticipated finding(s) and research contribution(s). Longer descriptions of each of these planning document sections (as well as a few others) can be found [[CommunityData:Planning document|on this wiki page]]. | |||
I have also provided three example planning documents via our Canvas site: | |||
* [https://canvas.northwestern.edu/files/6908602/download?download_frd=1 One by public health researcher Mika Matsuzaki]. The first planning document I ever saw and still one of the best. It's missing a measures section. It's also focused on a research context that is probably very different from yours, but try not to get bogged down by that and imagine how you might map the structure of the document to your own work. | |||
* [https://canvas.northwestern.edu/files/6919735/download?download_frd=1 One by Jim Maddock] created as part of a qualifying exam earlier in 2019. Jim doesn't provide dummy tables or anticipated findings/contributions, but he has an especially phenomenal explanation of the conceptual relationships and processes he wants to test. | |||
* [https://canvas.northwestern.edu/files/6908606/download?download_frd=1 One provided as an appendix to Gerber and Green's excellent textbook, ''Field Experiments: Design, Analysis, and Interpretation'' (FEDAI)]. It's over-detailed and incredibly long for our purposes, but nevertheless an exemplary approach to planning empirical quantitative research in a careful, intentional way that is worthy of imitation. | |||
==== Project presentation and paper ==== | |||
;Paper due date: Monday, June 10, 2019 | |||
;Maximum length: 6000 words (~20 pages) | |||
;Presentation due date: Thursday, May 30 or Thursday, June 6, 2019 | |||
;Maximum length: 8 minutes | |||
''The paper:'' Ideally, I expect you to produce a high quality short research paper that you might revise and submit for publication and/or a dissertation milestone. I do not expect the paper to be ready for publication, but it should contain polished drafts of all the necessary components of a scholarly quantitative empirical research study. In terms of the structure, please see the page on the [[structure of a quantitative empirical research paper]]. | |||
As noted above, you should also provide data, code, and any documentation sufficient to enable the replication of all analysis and visualizations. If that is not possible/appropriate for some reason, please talk to me so that we can find another solution. | |||
Because the emphasis in this class is on statistics and methods and because I'm not an expert in each of your fields, I'm happy to assume that your paper, proposal, or thesis chapter has already established the relevance and significance of your study and has a comprehensive literature review, well-grounded conceptual approach, and compelling reason why this research is important. As a result, you need not focus on these elements of the work in your written submission. Instead, feel free to start with a brief summary of the purpose and importance of this research followed by an introduction of your research questions or hypotheses. If you provide more detail, that's fine, but I won't give you detailed feedback on these parts and they will not figure prominently in my assessment of the work. | |||
I have a strong preference for you to write the paper individually, but I'm open to the idea that you may want to work with others in the class. Please contact me ''before'' you attempt to pursue a collaborative final paper. | |||
I do not have strong preferences about the style or formatting guidelines you follow for the paper and its bibliography. However, ''your paper must follow a standard format'' (e.g., [https://cscw.acm.org/2019/submit-papers.html ACM SIGCHI CSCW format] or [https://www.apastyle.org/index APA 6th edition] ([https://templates.office.com/en-us/APA-style-report-6th-edition-TM03982351 Word] and [https://www.overleaf.com/latex/templates/sample-apa-paper/fswjbwygndyq LaTeX] templates)) that is applicable for a peer-reviewed journal or conference proceedings in which you aim to publish the work (they all have formatting or submission guidelines published online and you should follow them). This includes the references. I also strongly recommend that you use reference management software to handle your bibliographic sources. | |||
'' [[Statistics_and_Statistical_Programming_(Spring_2019)/Final_project_presentations|The presentation:]]'' The presentation will provide an opportunity to share a brief summary of your project and findings with the other members of the class. Since you will all give other research presentations throughout your career, I strongly encourage you to take the opportunity to refine your academic presentation skills. The document [https://canvas.northwestern.edu Creating a Successful Scholarly Presentation] (file will be posted to Canvas) may be useful. | |||
: More details about the presentation goals, format suggestions, and more are available [[Statistics_and_Statistical_Programming_(Spring_2019)/Final_project_presentations|on this page]] | |||
=== Grading === | |||
I will assign grades (usually a numeric value ranging from 0-10) for each of the following aspects of your performance. The percentage values in parentheses are weights that will be applied to calculate your overall grade for the course. | |||
* Participation: 40% | |||
* Proposal identification: 5% | |||
* Final project planning document: 5% | |||
* Final project presentation: 10% | |||
* Final project paper: 40% | |||
My assessment of your paper will reflect the clarity of the written work, the effective execution and presentation of quantitative empirical analysis, as well as the quality and originality of the analysis. Throughout the quarter, we will talk a lot about the qualities of exemplary quantitative research. I expect your final project to embody these exemplary qualities. | |||
In order to complete your | == Note on finding a dataset == | ||
In order to complete your project, you will each need a dataset. If you already have a dataset for the project you plan to conduct, great! If not, there are many datasets to draw from. Some ideas are below. Jeremy and Aaron will also be available to help you brainstorm/find resources if needed: | |||
* Ask your advisor for a dataset they have collected and used in previous papers. Are there other variables you could use? Other relationships you could analyze? | * Ask your advisor for a dataset they have collected and used in previous papers. Are there other variables you could use? Other relationships you could analyze? | ||
Line 191: | Line 195: | ||
* Use the [http://scientificdata.isa-explorer.org/index.html ISA Explorer] to find datasets. Keep in mind the large majority of datasets it will search are drawn from the natural sciences. | * Use the [http://scientificdata.isa-explorer.org/index.html ISA Explorer] to find datasets. Keep in mind the large majority of datasets it will search are drawn from the natural sciences. | ||
* The City of Chicago has one of the best [https://data.cityofchicago.org/ data portal sites] of any municipality in the U.S. (and better than many federal agencies). There are also numerous administrative datasets released by other public entities (try searching!) that you might find inspiring. | * The City of Chicago has one of the best [https://data.cityofchicago.org/ data portal sites] of any municipality in the U.S. (and better than many federal agencies). There are also numerous administrative datasets released by other public entities (try searching!) that you might find inspiring. | ||
* [http://fivethirtyeight.com FiveThirtyEight.com] has published a [https://cran.r-project.org/web/packages/fivethirtyeight/vignettes/fivethirtyeight.html GitHub repository and an R package] with pre-processed and cleaned versions of many of the datasets they use for articles published on their website. | <!--- | ||
* <TODO fix/update accordingly> Set up a meeting with Jennifer Muilenburg — Data Curriculum and Communications Librarian who runs [https://www.lib.washington.edu/digitalscholarship/services/data research data services at the UW libraries]. Her email is: libdata@uw.edu I've have talked to her about this course and she is excited about meeting with you to help. | |||
--> | |||
* [http://fivethirtyeight.com FiveThirtyEight.com] has published a [https://cran.r-project.org/web/packages/fivethirtyeight/vignettes/fivethirtyeight.html GitHub repository and an R package] with pre-processed and cleaned versions of many of the datasets they use for articles published on their website. | |||
=== Human subjects research, IRB, and ethics === | |||
In general, you are responsible for making sure that you're on the right side of the IRB requirements and that your work meets applicable ethical norms and standards. | |||
Class projects generally do not need IRB approval, but research for publications, dissertations, and sometimes even pilot studies generally fall under IRB purview. You should ''not'' plan to seek IRB approval/determination retroactively. If your study may involve human subjects and you may ever publish it in any form, you will need IRB oversight of some sort. | |||
Secondary analysis of anonymized data is generally not considered human subjects research, but I strongly suggest that you get a determination from [https://irb.northwestern.edu/ the Northwestern IRB] before you start. For work that is not considered human subjects research, this can often happen in a few hours or days. If you need to list a faculty sponsor or Principal Investigator, that should ideally be your advisor. If that doesn't make sense for some reason, please talk to me. | |||
== Structure of Class == | |||
I | I expect everybody to come to class, every week, with a laptop and a power cord, ready to answer any question on the problem set and having uploaded code related the the programming questions. The class is listed as nearly 3 hours long and, with the exception of short breaks, I intend to use the entire period. Please be in class on time, plugged in, and ready to go. | ||
When it comes to the statistics material, this will mostly be a so-called "flipped" classroom. This means we will rely on the textbook and other resources to introduce the material and we will use the class sessions to discuss questions as they come up. | |||
The problem sets each week will | |||
Although the day-to-day routine will vary, each class session will generally include the following: | |||
* Quick updates about assignments, projects, and meta-discussion about the class. | |||
- | * Discussion of '''programming challenges''' due that day (and related to the previous week's R lecture materials). | ||
* Discussion of '''statistics questions''' related to new material in Diez, Barr, and Çetinkaya-Rundel. | |||
* Discussion of any exemplary empirical paper we have read and the '''empirical paper questions'''. | |||
== Schedule == | |||
When reading the schedule below, the following key might help resolve ambiguity: §n denotes chapter n; §n.x denotes section x of chapter; §n.x-y denotes sections x through y of chapter n. | |||
=== Week 1: Thursday April 4: Introduction, Setup, and Data and Variables === | |||
* [[Statistics and Statistical Programming (Spring 2019)/Session plan: Week 1]] | |||
Please complete the readings and assignment prior to class so that we can discuss them and start talking through some of the examples in R together. | |||
'''Required Readings:''' | |||
* Diez, Barr, and Çetinkaya-Rundel: §1 (Introduction to data) | |||
* Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks. ''Proceedings of the National Academy of Sciences'' 111(24):8788–90. [[http://www.pnas.org/content/111/24/8788.full Open Access]] | |||
'''Recommended Readings:''' | |||
* Verzani: §1 (Getting Started), §2 (Univariate data) [[https://canvas.northwestern.edu/verzani_ch1-ch2.pdf Available via Canvas]] | |||
* Verzani: §A (Programming) | |||
* Healy: §2 (and skim the preferatory material as well as §1) | |||
'''Assignment (Complete before class):''' | |||
* [[Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 1]] | |||
'''Lectures:''' | |||
* [https://communitydata.cc/~ads/teaching/2019/stats/r_lectures/w01-R_lecture.zip Week 1 R lecture materials] (.zip file) | |||
* [https://communitydata.cc/~ads/teaching/2019/stats/screencasts/w01-s01-intro.webm Week 1 screencast (part 1, 23 minutes)] (the video should load directly in browser window) | |||
* [https://communitydata.cc/~ads/teaching/2019/stats/screencasts/w01-s02-intro.webm Week 1 screencast (part 2, 27 minutes)] | |||
'''Resources:''' | |||
* [https://www.openintro.org/download.php?file=os3_slides_01&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §1 Lecture Notes] | |||
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including some for §1 | |||
=== | === Week 2: Thursday April 11: Probability and Visualization === | ||
* [[Statistics and Statistical Programming (Spring 2019)/Session plan: Week 2]] | |||
* Questions? Topics you'd like to discuss? Add them to the [https://canvas.northwestern.edu/courses/90927/discussion_topics/601700 Canvas discussion] for this week's material. | |||
'''Required Readings:''' | |||
* Diez, Barr, and Çetinkaya-Rundel: §2 (Probability) | |||
* Shaw, Aaron and Yochai Benkler. 2012. A tale of two blogospheres: Discursive practices on the left and right. ''American Behavioral Scientist''. 56(4): 459-487. [[https://doi.org/10.1177%2F0002764211433793 available via NU libraries]] | |||
'''Recommended Readings:''' | |||
* Verzani: §3.1-2 (Bivariate data), §4 (Multivariate data), §5 (Multivariate graphics) <!---[[https://faculty.washington.edu/makohill/com521/verzani-usingr-ch3.1-2_ch4_ch5.pdf Available with UW NetID]]---> | |||
* [https://seeing-theory.brown.edu/ Seeing Theory] §1 (Basic Probability) and §2 (Compound Probability). (Note: this site provides a beautiful visual introduction to core concepts in probability and statistics). | |||
<!--- | |||
* Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in ''Proceedings of the 8th ACM Conference on Designing Interactive Systems.'' Aarhus, Denmark: ACM. [[https://mako.cc/academic/buechley_hill_DIS_10.pdf PDF available on my personal website]] | |||
---> | |||
* Healy: §3. | |||
'''Assignment (Complete Before Class):''' | |||
* [[Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 2]] | |||
'''Lectures:''' | |||
* [https://communitydata.cc/~ads/teaching/2019/stats/r_lectures/w02-R_lecture.Rmd Week 2 R lecture materials] (.Rmd file) | |||
* [https://communitydata.cc/~ads/teaching/2019/stats/screencasts/w02.webm Week 2 screencast (17 minutes)] | |||
* | |||
* | |||
'''Resources:''' | |||
* [https://www.openintro.org/download.php?file=os3_slides_02&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §2 Lecture Notes] | |||
* [https://www.openintro.org/stat/videos.phpOpenIntro Video Lectures] including 2 short videos for §2 | |||
=== | === Week 3: Thursday April 18: Distributions === | ||
* [[Statistics and Statistical Programming (Spring 2019)/Session plan: Week 3]] | |||
'''Required Readings:''' | |||
* Diez, Barr, and Çetinkaya-Rundel: §3.1-3.2, §3.4: You should read the rest of the chapter (§3.3 and §3.5). I won't assign problem set questions about it but it's still important to be familiar with. | |||
'''Recommended Readings:''' | |||
* Verzani: §6 (Populations) | |||
* [https://seeing-theory.brown.edu/ Seeing Theory] §3 (Probability Distributions). | |||
'''Assignment (Complete Before Class):''' | |||
* [[Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 3]] | |||
'''Lectures:''' | |||
* [https://communitydata.cc/~ads/teaching/2019/stats/r_lectures/w03-R_lecture.Rmd Week 3 R lecture materials] (.Rmd file) | |||
* | * [https://communitydata.cc/~ads/teaching/2019/stats/screencasts/w03.webm Week 3 screencast (19 minutes)] | ||
'''Resources:''' | |||
== | * [https://www.openintro.org/download.php?file=os3_slides_03&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §3 Lecture Notes] | ||
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 2 videos for §3.1 and §3.2 | |||
=== Week 4: Thursday April 25: Statistical significance and hypothesis testing === | |||
* [[Statistics and Statistical Programming (Spring 2019)/Session plan: Week 4]] | |||
'''Required Readings:''' | |||
* Diez, Barr, and Çetinkaya-Rundel: §4 (Foundations for inference) | |||
'''Recommended Readings:''' | |||
* Verzani: §7 (Statistical inference), §8 (Confidence intervals) | |||
* [https://seeing-theory.brown.edu/ Seeing Theory] §4 (Frequentist Inference) | |||
'''Assignment (Complete Before Class):''' | |||
* [https://docs.google.com/forms/d/e/1FAIpQLScMkAPwWQUjB4C5wtbkemkNZYjNl3ipO4Dg5wsORFmdfduEtA/viewform?usp=sf_link Mid-quarter course evaluation survey] (by Monday please!) | |||
* [[Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 4]] | |||
'''Lectures:''' | |||
*[https://communitydata.cc/~ads/teaching/2019/stats/r_lectures/w04-R_lecture.Rmd Week 4 R lecture materials] (.Rmd file) | |||
*(No screencast for this week) | |||
'''Resources:''' | |||
* [https://www.openintro.org/download.php?file=os3_slides_04&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §4 Lecture Notes] | |||
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 7 videos for nearly all of §4 | |||
=== Week 5: Thursday May 2: Continuous Numeric Data & ANOVA === | |||
* [[Statistics and Statistical Programming (Spring 2019)/Session plan: Week 5|Session plan]] | |||
* | |||
'''Required Readings:''' | |||
'''Required''' | |||
* Diez, Barr, and Çetinkaya-Rundel: §5 (Inference for numerical data) | |||
<!---* Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in ''Proceedings of the 8th ACM Conference on Designing Interactive Systems.'' Aarhus, Denmark: ACM. [[https://mako.cc/academic/buechley_hill_DIS_10.pdf PDF from Hill's website]]---> | |||
* | * Sweetser, K. D., & Metzgar, E. (2007). Communicating during crisis: Use of blogs as a relationship management tool. ''Public Relations Review'', 33(3), 340–342. [[https://doi.org/10.1016/j.pubrev.2007.05.016 Available through NU Libraries]] | ||
* | * Reinhart, §1 | ||
* | |||
'''Recommended Readings:''' | |||
* Verzani: §9 (significance tests), §12 (Analysis of variance) | |||
* Gelman, Andrew and Hal Stern. 2006. “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant.” ''The American Statistician'' 60(4):328–31. [[http://dx.doi.org/10.1198/000313006X152649 Available through NU Libraries]] | |||
'''Assignment (Complete Before Class):''' | |||
* [[Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 5]] | |||
* | |||
''' | '''Lectures:''' | ||
* | * No new R material for this week. | ||
<!--- | <!--- | ||
* [[Statistics and Statistical Programming (Spring 2019)/R lecture outline: Week 5]] | |||
* [https:// | * [https://communitydata.cc/~mako/2017-COM521/com521-week_05-ttests_and_anova.ogv Week 5 R lecture screencast: t-tests] (~22 minutes) | ||
* [https:// | * [https://communitydata.cc/~mako/2017-COM521/com521-week_05-for_if.ogv Week 5 R lecture screencast: for loops and if statements] (~12 minutes) | ||
---> | ---> | ||
==== | '''Resources:''' | ||
'''Required''' | |||
* | * [https://www.openintro.org/download.php?file=os3_slides_05&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §5 Lecture Notes] | ||
* | |||
* | === Week 6: Thursday May 9: Categorical data === | ||
* [[Statistics and Statistical Programming (Spring 2019)/Session plan: Week 6|Session plan]] | |||
'''Required Readings:''' | |||
* Diez, Barr, and Çetinkaya-Rundel: §6.1-6.4 (Inference for categorical data). | |||
* Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in ''Proceedings of the 8th ACM Conference on Designing Interactive Systems.'' Aarhus, Denmark: ACM. [[https://mako.cc/academic/buechley_hill_DIS_10.pdf PDF available on Hill's personal website]] | |||
* Reinhart, §4 and §5. | |||
'''Recommended Readings: | |||
* Diez, Barr, and Çetinkaya-Rundel: §6.5-6.6 (Small samples and randomization inference) | |||
* Verzani: §3.4 (Bivariate categorical data); §10.1-10.2 (Goodness of fit) | |||
* Gelman, Andrew and Eric Loken. 2014. “The Statistical Crisis in Science Data-Dependent Analysis—a ‘garden of Forking Paths’—explains Why Many Statistically Significant Comparisons Don’t Hold Up.” ''American Scientist'' 102(6):460. [[https://www.americanscientist.org/issues/pub/2014/6/the-statistical-crisis-in-science/1 Available through NU Libraries]] (This is a reworked version of [http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf this unpublished manuscript] which provides a more detailed examples.) | |||
'''Assignment (Complete Before Class):''' | |||
* [[Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 6]] | |||
'''Lectures:''' | |||
*[https://communitydata.cc/~ads/teaching/2019/stats/r_lectures/w06-R_lecture.Rmd Week 6 R lecture materials] (.Rmd file) | |||
*(No screencast for this week) | |||
'''Resources:''' | |||
* [https://www.openintro.org/download.php?file=os3_slides_06&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §6 Lecture Notes] | |||
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 4 videos for §7 | |||
=== Week 7: Thursday May 16: Linear Regression === | |||
* [[Statistics and Statistical Programming (Spring 2019)/Session plan: Week 7|Session plan]] | |||
'''Required Readings:''' | |||
* Diez, Barr, and Çetinkaya-Rundel: §7 (Introduction to linear regression) | |||
* OpenIntro eschews a mathematical approach to correlation. Look over [https://en.wikipedia.org/wiki/Correlation_and_dependence the Wikipedia article on correlation and dependence] and pay attention to the formulas. It's tedious to compute, but you should be aware of what goes into it. | |||
* Lampe, Cliff, and Paul Resnick. 2004. “Slash(Dot) and Burn: Distributed Moderation in a Large Online Conversation Space.” In ''Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '04)'', 543–550. New York, NY, USA: ACM. doi:10.1145/985692.985761. [[http://dx.doi.org/10.1145/985692.985761 Available via NU libraries]] | |||
'''Recommended Readings:''' | |||
* Verzani: §11.1-2 (Linear regression). | |||
* [https://seeing-theory.brown.edu/ Seeing Theory] §5 (Regression Analysis) | |||
'''Assignment (Complete Before Class):''' | |||
* [[Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 7]] | |||
* Final project planning document (see details above!) | |||
'''Lectures:''' | |||
* [https://communitydata.cc/~ads/teaching/2019/stats/r_lectures/w07-R_lecture.Rmd Week 7 R lecture materials] | |||
'''Resources:''' | |||
* [https://www.openintro.org/download.php?file=os3_slides_07&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §7 Lecture Notes] | |||
* [https://www.openintro.org/download.php?file=os3_slides_08&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §8 Lecture Notes] | |||
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 4 videos for §7 and 3 videos on the sections §8.1-8.3 | |||
=== Week 8: Thursday May 23: Polynomial Terms, Interactions, and Logistic Regression === | |||
* [[Statistics_and_Statistical_Programming_(Spring_2019)/Session plan: Week 8|Session plan]] | |||
'''Required Readings:''' | |||
* Diez, Barr, and Çetinkaya-Rundel: §8 (Multiple and logistic regression) | |||
* [https://onlinecourses.science.psu.edu/stat501/node/301 Lesson 8: Categorical Predictors] and [https://onlinecourses.science.psu.edu/stat501/node/318 Lesson 9: Data Transformations] from the PennState Eberly College of Science STAT 501 Regression Methods Course. There are several subparts (many quite short), please read them all carefully. | |||
* (Revisit) Lampe, Cliff, and Paul Resnick. 2004. “Slash(Dot) and Burn: Distributed Moderation in a Large Online Conversation Space.” In ''Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '04)'', 543–550. New York, NY, USA: ACM. doi:10.1145/985692.985761. [[http://dx.doi.org/10.1145/985692.985761 Available via NU libraries]] | |||
* Reinhart, §8 and §9. | |||
''' | '''Recommended Readings:''' | ||
* [ | * Verzani: §11.3 (Linear regression), §13.1 (Logistic regression) | ||
* Ioannidis, John P. A. 2005. “Why Most Published Research Findings Are False.” ''PLoS Medicine'' 2(8):e124. [[http://dx.doi.org/10.1371%2Fjournal.pmed.0020124 Open Access]] | |||
* Head, Megan L., Luke Holman, Rob Lanfear, Andrew T. Kahn, and Michael D. Jennions. 2015. “The Extent and Consequences of P-Hacking in Science.” ''PLOS Biology'' 13(3):e1002106. [[http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106 Open Access]] | |||
'''Assignment (Complete Before Class):''' | |||
* [[Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 8]] | |||
* | |||
''' | '''Lectures:''' | ||
* [https://communitydata.science/~ads/teaching/ | *[https://communitydata.science/~ads/teaching/2019/stats/r_lectures/w08-R_lecture.Rmd Week 8 R lecture materials] | ||
'''Resources:''' | |||
' | * [https://www.openintro.org/download.php?file=os3_slides_08&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §8 Lecture Notes] | ||
* [https:// | * [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including a video on §8.4 | ||
* Mako Hill wrote this document which will likely be useful for many of you: [https://communitydata.cc/~mako/2017-COM521/logistic_regression_interpretation.html Interpreting Logistic Regression Coefficients with Examples in R] | |||
=== | === Week 9: Thursday May 30: Loose ends and Final Presentations (part 1) === | ||
* [[Statistics_and_Statistical_Programming_(Spring_2019)/Session plan: Week 9|Session plan]] | |||
''' | '''Required readings:''' | ||
* | * Reinhart, §10 and §11. | ||
'''[[Statistics_and_Statistical_Programming_(Spring_2019)/Final_project_presentations|Final presentations]]: (part 1)''' | |||
''' | * First batch today. The rest next week. | ||
'''Resources''' | '''Resources:''' | ||
* | * [https://communitydata.cc/~ads/teaching/2019/stats/r_lectures/w09-R_lecture.html Week 9 R-lecture] (we will use this in class) | ||
=== Week 6 | === Week 10: Thursday June 6: Fully reproducible research example, Replications, Final Presentations (part 2), and wrap-up === | ||
* Fully [https://www.overleaf.com/read/tkdpdcspwtkp reproducible research example]. | |||
* [https://canvas.northwestern.edu/courses/90927/files/folder/resources/Straub-Cook%20Replication Research replication study] by Polly Straub-Cook (UW Comm. Ph.D. student) | |||
:: (n.b.: cluster & heteroscedasticity robust standard errors!) | |||
* | |||
* | |||
''' | * '''[[Statistics_and_Statistical_Programming_(Spring_2019)/Final_project_presentations|Final presentations]]: (part 2)''' | ||
* | :: Second batch of presenters today. | ||
* Closing thoughts | |||
:: What next? Beyond your final projects... | |||
:: Class social gathering | |||
Followed by much rejoicing! | |||
==== | == Policies == | ||
=== Attendance === | |||
Attendance in class is expected of all participants. If you need to miss class for any reason, please contact me ahead of time (email is best). Multiple unexplained absences will likely result in a lower grade or (in extreme circumstances) a failing grade. In the event of an absence, you are responsible for obtaining class notes, handouts, assignments, etc. You are also still responsible for turning in any assignments on time unless you make prior arrangements with me. | |||
=== | === In-class device usage === | ||
Please refrain from any uses of digitally networked devices or other distraction machines that do not directly contribute to your engagement with the course material. If you struggle to comply with this policy, I may recommend you temporarily put away your device(s) or leave the classroom. | |||
=== | === Peers’ Work and In-Class Discussions === | ||
Throughout the course, you may receive, read, collaborate, and/or comment on classmates’ work. These assignments are for class use only. You may not share them with anybody outside of class without explicit written permission from the document’s author and pertaining to the specific piece. | |||
It is essential to the success of this class that all participants feel comfortable discussing questions, thoughts, ideas, fears, reservations, apprehensions and confusion about works-in-progress, statistical concepts, independent research, and more. Therefore, you may not create any audio or video recordings during class time nor share verbatim comments with those not in class nor are you allowed to share using other methods -- e.g., social media -- any comments linked to people’s identities unless you get clear and explicit permission. If you want to share general impressions or specifics of in-class discussions with those not in class, please do so without disclosing personal identities or details. | |||
=== Academic Integrity === | |||
You are responsible for reading and abiding by the Northwestern University [https://www.northwestern.edu/provost/policies/academic-integrity/principles.html Principles Regarding Academic Integrity]. Personally, I expect you to exceed the minimal standards elaborated in those principles and to strive for admirable, extraordinary conduct in every aspect of your academic career. Feel free to ask me (the instructor) for clarification about this or related matters. | |||
==== | === Deadlines === | ||
Emergencies happen. Unanticipated obstacles arise. If you cannot make a deadline, please contact me to figure out a schedule that will work. The more proactive and responsible you are, the more receptive I am likely be. | |||
A word about extensions and incompletes: I strongly discourage them. In principle, I have no problem with extensions or incompletes. In practice, they tend to be a pain for everybody involved. If you absolutely must submit an assignment late, assume that I may require up to 1 month (4 weeks) to grade it. Please take this into account if you will need me to to submit a grade in order to receive your fellowship/diploma/visa/etc. by a particular date. | |||
=== Accommodations === | |||
I am totally happy to provide accommodations for religious observance, physical needs, or other circumstances as needed. Any student requesting accommodations related to a disability or other condition is required to register with AccessibleNU (847-467-5530) and provide professors with an accommodation notification from AccessibleNU, preferably within the first two weeks of class. All information will remain confidential. For more information, visit [https://www.northwestern.edu/accessiblenu/ AccessibleNU]. | |||
=== | === Sexual Misconduct === | ||
All participants in this class are bound by the [https://www.northwestern.edu/sexual-misconduct/title-IX/university-policies/policy-on-sexual-misconduct.html Northwestern University sexual misconduct policy] Please note, that the core of the policy states, "Northwestern is committed to fostering an environment in which all members of our community are safe, secure, and free from sexual misconduct of any form, including, but not limited to, sexual assault, sexual exploitation, stalking, and dating and domestic violence." I take this very seriously. Please review the policy and speak to me if you have any questions or concerns. | |||
=== Email protocol === | |||
I receive too much email and I sometimes fail to keep up. If, for some reason, I do not respond to a message related to this course within 48 hours, please do not take it personally and feel free to re-send the message with a polite reminder. This will help me and I will not resent you for it. | |||
== Credit and Notes == | === Credit and Notes === | ||
This syllabus has, in ways that should be obvious, borrowed and built on the [https://www.openintro.org/stat/index.php OpenInto Statistics curriculum]. | This syllabus has, in ways that should be obvious, borrowed and built on the [https://www.openintro.org/stat/index.php OpenInto Statistics curriculum]. I also based nearly every aspect of the course design on Benjamin Mako Hill's [[Statistics_and_Statistical_Programming_(Winter_2017)|COM 521 class]]. |