Editing Statistics and Statistical Programming (Winter 2017)

From CommunityData

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 5: Line 5:
:'''Instructor:''' [http://mako.cc/academic/ Benjamin Mako Hill] ([http://www.com.washington.edu/hill/ University of Washington])
:'''Instructor:''' [http://mako.cc/academic/ Benjamin Mako Hill] ([http://www.com.washington.edu/hill/ University of Washington])
:'''Course Websites''':
:'''Course Websites''':
:* We will use Canvas for [https://canvas.uw.edu/courses/1098035/announcements announcements], [https://canvas.uw.edu/courses/1098035/assignments turning in assignments], and [https://canvas.uw.edu/courses/1098035/discussion_topics discussion] (if you choose to use them)
:* We will use Canvas for [https://canvas.uw.edu/courses/1124086/announcements announcements], [https://canvas.uw.edu/courses/1124086/assignments turning in assignments], and [https://canvas.uw.edu/courses/1124086/discussion_topics discussion] (if you choose to use them)
:* Everything else will be linked on this page.
:* Everything else will be linked on this page.
:* [[Statistics and Statistical Programming (Winter 2017)/List of student git repositories]]
:'''Course Catalog Description:[https://www.washington.edu/students/crscat/com.html#com521]'''
:'''Course Catalog Description:[https://www.washington.edu/students/crscat/com.html#com521]'''


Line 14: Line 13:
== Overview and Learning Objectives ==
== Overview and Learning Objectives ==


This course is the second course in a two-quarter quantitative methods sequence in the University of Washington's Department of Communication MA/PhD program. The first course (COM 520) is an introduction to quantitative social science in communication and focuses primarily on what you might think of the "soft skills" associated with doing social science: the conceptualization, operationalization of quantifiable variables, and the design of quantitative analyses. That course introduces some univariate and bivariate statistics at the end and briefly touches on linear regression. That said, all of the statistical work in that course this is done using the tools that students already know (e.g. with spreadsheet software like LibreOffice, Google Sheets or Microsoft Excel). This class assumes that students have taken COM 520 and that they understand what is involved in describing and testing social scientific theories with data and that basic terminology of quantitative social science is going to be familiar.
This course is the second course in a two-quarter quantitative methods sequence in the University of Washington's Department of Communication MA/PhD program. The first course (COM 520) is an introduction to quantitative social science in communication and focuses primarily on what you might think of the "soft skills" associated with doing social science: the conceptualization, operationalization of quantifiable variables and the design of quantitative analyses. That course introduces some univariate and bivariate statistics at the end and briefly touches on linear regression. That said, all of the statistical work in that course this is done using the tools that students already know (e.g. with spreadsheet software like LibreOffice, Google Sheets or Microsoft Excel). This class assumes that students have taken COM 520 and that they understand what is involved in describing and testing social scientific theories with data and that basic terminology of quantitative social science is going to be familiar.


This course (COM 521) is focused on technical skill-building and aims to be a get-your-hands-dirty introduction to statistics and statistical programming. The point of the course is to give you the mathematical and technical tools to carry out your own statistical analyses. Through the process, we're going to try to help you become more sophisticated consumers of quantitative research.
This course (COM 521) is focused on technical skill-building and aims to be a get-your-hands-dirty introduction to statistics and statistical programming. The point of the course is to give you the mathematical and technical tools to carry out your own statistical analyses. Through the process, we're going to try to help you become more sophisticated consumers of quantitative research.
Line 20: Line 19:
Although we'll be doing some math in the course, this is not a math class. I am going to assume you're familiar with basic algebra and arithmetic. This course will not require knowledge of calculus. In general we're not going to cover the math behind the techniques we'll be covering. Unlike many statistics classes, I'm definitely not going to be doing proofs on the board.  Instead, the class is unapologetically focused on ''the application of statistic methodology''. In that sense, the goal of the is course is to create ''informed consumers'' of quantitative methodology, not producers of new types of methods. My goal is to train producers of social scientific research that use statistics as a means toward an end.
Although we'll be doing some math in the course, this is not a math class. I am going to assume you're familiar with basic algebra and arithmetic. This course will not require knowledge of calculus. In general we're not going to cover the math behind the techniques we'll be covering. Unlike many statistics classes, I'm definitely not going to be doing proofs on the board.  Instead, the class is unapologetically focused on ''the application of statistic methodology''. In that sense, the goal of the is course is to create ''informed consumers'' of quantitative methodology, not producers of new types of methods. My goal is to train producers of social scientific research that use statistics as a means toward an end.


This course does not seek to be the last stats class you take. I started grad school having not taken a math class since high school (basically) and took 12 different statistics and math courses over the course of my time in graduate school. Honestly, I wish I had done more. What this class seeks to do is give you a solid basis on which to build statistical knowledge. Anyone who finishes this class should feel comfortable moving on to take advance classes in CSSS (classes above 510 on [https://www.csss.washington.edu/academics/courses this list]) and to start building toward a [https://www.csss.washington.edu/academics/phd-tracks/communication Statistics Concentration in the Department of Communication MA/PhD Program] or a [https://www.csss.washington.edu/academics/phd-tracks similar CSSS certificate/track] in another department.
This course does not seek to be the last stats class you take. I started grad school having not taken a math class since high school (basically) and took 12 different statistics and math courses over the course of my time in graduate school. Honestly, I wish I had done more. What this class seeks to do is give you a solid basis on which to build statistical knowledge. Anyone who finishes this class should feel comfortable moving on to take advance classes in CSSS and to start building toward a Concentration in Statistics in Communication certificate.


We'll cover theses basic statistical techniques: t-tests; chi-squared tests; ANOVA, MANOVA, and related methods; linear regression; and end with logistic regression.
We'll cover theses basic statistical techniques: t-tests; chi-squared tests; ANOVA, MANOVA, and related methods; linear regression; and end with logistic regression.
Line 30: Line 29:
* Feel comfortable reading papers that use basic statistical techniques.
* Feel comfortable reading papers that use basic statistical techniques.
* Feel comfortable and prepared enrolling in future statistics courses in CSSS.
* Feel comfortable and prepared enrolling in future statistics courses in CSSS.
== Why Statistical Programming? ==
This class will focus much more on statistical programming in R than most similar classes. Most similar classes in communication will focus on using an easier to use statistical package like SPSS.
We're focusing on programming instead of a package like SPSS for several reasons:
* Student who understands a programming language won't be limited to the "canned" functions in the off-the-shelf packages.
* Pedagogically, programming supports students in building a deeper understanding of the mathematics and assumptions behind the canned functions by both allowing them to read the code "behind" the canned functions and by allowing the students to implement the functions themselves in assignments.
* Analyses composed of code instead of clicks supports reproducible analyses that can document every step of the process of an analysis including during data cleaning and conversion where errors are common and very difficult to detect.
* Because programming is a skill that is in demand in our department and discipline more generally and that I strongly believe is generally useful.
Of course, there are other programming languages well suited to statistics including Stata and Python.  Ultimately, I'm teaching R because a few of us that seemed mostly to teach in this sequence going forward future got together and the decision was that R made the most sense and because there was consensus among the faculty in the department who were likely to teach statistics classes in the future that this made the most sense.
Our reasoning was that:
* R is freely available and open source
* R is becoming the most widely used package in statistical fields and is (by our estimate) used by most academics in my cohort or later in statistics, political science, and economics already.
* R is the system (along with Stata) that will be in other CSSS advanced stats classes we hope students will continue to take after COM521.
* R is better general purpose programming language than software like Stata which means that R programming skills will let students solve non-stastical problems like collecting data from the web and will make it easier to learn other programming languages.
For students with a strong psychometric focus or whose research will be limited to linear and logistic regression or ANOVA on small pre-collected datasets and similar, SPSS will likely be fine. R has a higher barrier to entry than SPSS but it's ceiling is ''much'' higher.


== Note About This Syllabus ==
== Note About This Syllabus ==
Line 58: Line 35:


# Although details on this syllabus will change, I will not change readings or assignments less than one week before they are due. If I don't fill in a "To Be Determined" one week before it's due, it is dropped. If you plan to read more than one week ahead, contact me first.
# Although details on this syllabus will change, I will not change readings or assignments less than one week before they are due. If I don't fill in a "To Be Determined" one week before it's due, it is dropped. If you plan to read more than one week ahead, contact me first.
# Closely monitor your email or [https://canvas.uw.edu/courses/1098035/announcements the announcements section on the course website on Canvas]. When I make changes, these changes will be recorded in [http://wiki.communitydata.cc/index.php?title=Statistics_and_Statistical_Programming_(Winter_2017)&action=history the history of this page] so that you can track what has changed and I will summarize these changes in an announcement on Canvas that will be emailed to everybody in the class.
# Closely monitor your email or [https://canvas.uw.edu/courses/1124086/announcements the announcements section on the course website on Canvas]. When I make changes, these changes will be recorded in [http://wiki.communitydata.cc/index.php?title=Statistics_and_Statistical_Programming_(Winter_2017)&action=history the history of this page] so that you can track what has changed and I will summarize these changes in an announcement on Canvas that will be emailed to everybody in the class.
# I will ask the class for voluntary anonymous feedback frequently — especially toward the beginning of the quarter. Please let me know what is working and what can be improved. In the past, I have made many adjustments based on this feedback.
# I will ask the class for voluntary anonymous feedback frequently — especially toward the beginning of the quarter. Please let me know what is working and what can be improved. In the past, I have made many adjustments based on this feedback.


Line 70: Line 47:
Diez, Barr, and Çetinkaya-Rundel's is a free, and freely-licensed, online statistics textbook. Over the last seven years, the book has also developed a large online community of students and teachers who have shared other resources. The book, lectures notes, and more are all freely licensed which has allowed the text to be adapted in a series of different fields. The book is excellent and it has been adopted extraordinarily widely. You can buy versions from Amazon in either [https://www.openintro.org/redirect.php?go=amazon_os3_hardcover&referrer=/stat/textbook.php full color hardcover] ($19.99) or in [https://www.openintro.org/redirect.php?go=createspace_os3&referrer=/stat/textbook.php black and white paperback] ($7.60). I haven't purchased a paper copy so I can't speak to the quality of either.
Diez, Barr, and Çetinkaya-Rundel's is a free, and freely-licensed, online statistics textbook. Over the last seven years, the book has also developed a large online community of students and teachers who have shared other resources. The book, lectures notes, and more are all freely licensed which has allowed the text to be adapted in a series of different fields. The book is excellent and it has been adopted extraordinarily widely. You can buy versions from Amazon in either [https://www.openintro.org/redirect.php?go=amazon_os3_hardcover&referrer=/stat/textbook.php full color hardcover] ($19.99) or in [https://www.openintro.org/redirect.php?go=createspace_os3&referrer=/stat/textbook.php black and white paperback] ($7.60). I haven't purchased a paper copy so I can't speak to the quality of either.


Verzani's book is an introduction to the R programming language. It's designed to be used as a companion to a basic introductory statistics textbook (like OpenIntro). It's a poor stand-alone text but it will provide good resources for the material we're covering in the course and it should act as a good reference going forward. The book is available online for about $50.
Verzani's book is an introduction to the R programming language. It's designed to be used as a companion to a basic introductory statistics textbook (like OpenIntro). It's a poor stand-alone text but it will provide good resources for the material we're covering in the course and it should act as a good reference going forward. The book is available online for about $50. ''I'd recommend holding off on purchasing the book until after the first class.''


Although it's not required for the course, I want to point you to these two books. When I was learning R, these both were very useful references:
Although it's not required for the course, I want to point you to these two books. When I was learning R, these both were very useful references:
Line 80: Line 57:


* [ftp://cran.r-project.org/pub/R/doc/contrib/Baggott-refcard-v2.pdf Baggott's R Reference Card v2] — When I was learning R, I ''literally'' took a similar reference card with me everywhere and looked at it dozens of times a day.
* [ftp://cran.r-project.org/pub/R/doc/contrib/Baggott-refcard-v2.pdf Baggott's R Reference Card v2] — When I was learning R, I ''literally'' took a similar reference card with me everywhere and looked at it dozens of times a day.
* [https://stackoverflow.com/questions/tagged/r StackOverflow R Tag] — Somebody already had your question about how to do ''X'' in R. They asked it, and several people have answered it, on StackOverflow. Learning to read this effectively will take time but as build up some basic familiarity with R and with StackOverflow, it will get easier. I promise.
* [https://stackoverflow.com/questions/tagged/r StackOverflow R Tag] — Somebody already had your question about how to do ''X'' in R. They asked it, and several people have answered it, on StackOverflow.
* [http://rseek.org/ Rseek] — Rseek is a modified version of Google that just search R websites online. Sometimes, R is hard to search before because R is a common letter. This has become much easier over time as R has become more popular but it might still be the case sometimes and Rseek is a good solution.


== Assignments ==
== Assignments ==
Line 92: Line 68:


* '''Statistics questions''' — These will be questions about statistics from the OpenIntro sections as well as any empirical papers that are listed as required for that that day.
* '''Statistics questions''' — These will be questions about statistics from the OpenIntro sections as well as any empirical papers that are listed as required for that that day.
* '''Programming challenges''' These will be R programming problems that cover material from the Verzani text that was listed as required from the previous session.
* '''Programming challenges''' -- These will be R programming problems that cover material from the Verzani text that was listed as required from the previous session.


I won't be grading these assignment and I won't be asking you to turn in anything for the ''statistics questions'' portion of the weekly assignment. That said, we will spend a good chunk of class each day going through the answers to the questions due on that day.
I won't be grading these assignment and I won't be asking you to turn in anything for the ''statistics questions'' portion of the weekly assignment. That said, we will spend a good chunk of class each day going through the answers to the questions due on that day.


Because randomness is an extremely important concept in statistics, I will use a small R program to '''randomly cold call''' on students in the class to walk through your "answer" to each question and explain your reasoning to the class. We'll then have an opportunity to discuss the different approaches as a group. I don't promise to ask all of these questions in class (especially if it's clear that folks get the point). Although I might ask them, I won't cold call for questions that are not on the list.
Because randomness is an extremely important concept in statistics, I will use a small R program to '''randomly cold call''' on students in the class to walk through your "answer" to each question and explain your reasoning to the class. We'll then have an opportunity to discuss the different approaches as a group. I don't promise to ask all of these questions in class (especially if clear that folks get the point). Although I might ask them, I won't cold call for questions that are not on the list.


For the programming challenges, I will ask that everybody shares code for any solutions to programming problems before class so we can walk through in class. If you get completely stuck on a problem and cannot "solve" it, that's OK, but share the code that you do have so that you can walk us through what you did and what you were thinking.
For the programming challenges, I will ask that everybody shares code for any solutions to programming problems before class so we can walk through in class. If you get completely stuck on a problem and cannot "solve" it, that's OK, but share the code that you do have so that you can walk us through what you did and what you were thinking.
Line 117: Line 93:
* '''Ensure replicability''' — I'll expect you all to provide code and data for your analysis in a way that makes your work replicable by other researchers.
* '''Ensure replicability''' — I'll expect you all to provide code and data for your analysis in a way that makes your work replicable by other researchers.


Although it's not required, I ''strongly urge each of you'' to take this opportunity to produce a document that will further your academic career outside of the class. There are many ways that this can happen but the obvious ones are that the paper is something you can submit for publication to a journal or conference, that provides primarily analysis for or acts as a pilot analysis that you can report in a grant proposal or thesis proposal, and/or that serves as part of your masters thesis or dissertation.
Although it's not required, I ''strongly urge each of you'' to take this opportunity to produce a document that will further your to academic career outside of the class. There are many ways that this can happen but the obvious ones are that the paper is something you can submit for publication to a journal or conference, that provides primarily analysis for or acts as a pilot analysis that you can report in a grant proposal or thesis proposal, and/or that serves as part of your masters thesis or dissertation.


==== Project and Dataset Identification ====
==== Project and Dataset Identification ====
Line 131: Line 107:
* An identification of the dataset you will use and a description of the columns or type of data it will include. If you do not currently have access to these data, explain when you will have access to the data.
* An identification of the dataset you will use and a description of the columns or type of data it will include. If you do not currently have access to these data, explain when you will have access to the data.


==== Final Project Ouline ====
==== Final Project ====


;Outline Due Date: February 21
;Outline Due Date: February 21
;Maximum outline length: 5 pages
;Maximum outline length: 5 pages
;Deliverables: Turn in in Canvas
The outline should should have the following sections: (a) Rationale, (b) Objectives; (b.1) General Objectives; (b.2) Specific Objectives; (c) Null hypotheses; (d) Conceptual Diagram; (e) Measures; (e) Dummy Tables.
An excellent example from my partner Mika Matsuzakis is [https://canvas.uw.edu/courses/1098035/files/40388318/download?wrap=1 online in Canavs]. Your diagram will likely be much less complicated than Matsuzaki's. Also, please don't be distracted by the fact that Mika does public health. It's the basic form I want you all to emulate, not the content. You can read [http://ajcn.nutrition.org/content/99/6/1450.full the published paper] to compare.
The example includes everything except a "Measures" section. Your Measures section only needs to include two column table where column 1 is the name of each variable in your analysis and 2 is the specific operationalization of this measures and a description of how you will create it.
==== Final Project ====
;Paper Due Date: March 19
;Paper Due Date: March 19
;Maximum length: 6000 words (~20 pages)
;Maximum outline length: 6000 words (~20 pages)
;Presentation Date: March 14
;Presentation Date: March 7
;All Deliverables: Turn in in Canvas
;All Deliverables: Turn in in Canvas


Line 156: Line 122:
I have a strong preference for you to write this paper individually but I'm open to the idea that you may want to work with others in the class.
I have a strong preference for you to write this paper individually but I'm open to the idea that you may want to work with others in the class.


In terms of content:
'''''Details Forthcoming:''''' ''Although this material is still somewhat thin, I'll be posting many additional details about the expectations for the final paper as we move forward through the quarter.''
 
* In terms of the structure of the paper, please see the page that I've written on the [[structure of a quantitative empirical research paper]].
* In terms of the structure of your presentation, you've got some latitude but this document on [https://canvas.uw.edu/files/40848246/download?download_frd=1 Creating a Successful Scholarly Presentation] (link is in Canvas) will likely be useful.


=== Grading ===
=== Grading ===
Line 180: Line 143:
* Take a look at datasets available in the [https://dataverse.harvard.edu/ Harvard Dataverse] (the largest collection of social science research data) or one of the other members of the [http://dataverse.org/ Dataverse network].
* Take a look at datasets available in the [https://dataverse.harvard.edu/ Harvard Dataverse] (the largest collection of social science research data) or one of the other members of the [http://dataverse.org/ Dataverse network].
* Look at the collection of social scientific datasets at [https://www.icpsr.umich.edu/icpsrweb/ICPSR/ ICPSR] (UW is a member). There are an enormous number of very rich datasets.
* Look at the collection of social scientific datasets at [https://www.icpsr.umich.edu/icpsrweb/ICPSR/ ICPSR] (UW is a member). There are an enormous number of very rich datasets.
* Use the [http://scientificdata.isa-explorer.org/index.html ISA Explorer] to find datasets. Keep in mind the large majority of datasets it will search are drawn from the natural sciences.
* The [http://scientificdata.isa-explorer.org/index.html ISA Explorer] to find datasets. Keep in mind the large majority of datasets it will search are drawn from the natural sciences.
* Set up a meeting with Jennifer Muilenburg — Data Curriculum and Communications Librarian who runs [https://www.lib.washington.edu/digitalscholarship/services/data research data services at the UW libraries]. Her email is: libdata@uw.edu I've have talked to her about this course and she is excited about meeting with you to help.
* Set up a meeting with Jennifer Muilenburg — Data Curriculum and Communications Librarian who runs [https://www.lib.washington.edu/digitalscholarship/services/data research data services at the UW libraries]. Her email is: libdata@uw.edu
* [http://fivethirtyeight.com FiveThirtyEight.com] has published a [https://cran.r-project.org/web/packages/fivethirtyeight/vignettes/fivethirtyeight.html GitHub repository and an R package] with pre-processed and cleaned versions of many of the datasets they use for articles published on their website.


In general, you're responsible for make sure that you're on the right side of the human subject rules and that work is ethical. Class projects generally do not need IRB approval but I hope that each of your projects will turn into something more. If your study involves human subjects research, ''that'' work will need IRB oversight of some sort. In general, you can't do a class project with IRB approval and then retroactively get it later. Secondary analysis of anonymized data is generally not considered human subjects research but I strongly suggest that you get a determination from [https://www.washington.edu/research/hsd UW's Human Subject Division] before you start. For work that is not considered human subjects research, this can often happen in a few hours or days. If you need a faculty sponsor, that should ideally be your advisor. If that doesn't make sense for any of you, I'm happy to talk about serving as the faculty supervisor for the work.
In general, you're responsible for make sure that you're on the right side of the human subject rules and that work is ethical. Class projects generally do not need IRB approval but I hope that each of your projects will turn into something more. If your study involves human subjects research, ''that'' work will need IRB oversight of some sort.


== Structure of Class ==
== Structure of Class ==


I expect everybody to come to class, every week, with their laptop and a power cord, being ready to answer any question on the problem set and having uploaded and shared code to the code related questions. The class is listed as nearly 4 hours long and, with the exception of a few short breaks, I intend to use the entire period. Be in class on time and be plugged in and ready to go.
I expect everybody to come to class, every week, with their laptop and a power cord, being ready to answer any question on the problem set and having uploaded and shared code to the code related questions. The class is listed as nearly 4 hours long and, with the exception of a few short breaks, I intend to use the entire period most days.
 
When it comes to the statistics part of this material, this will be a primarily "flipped" classroom. What this means is that we'll be relying on the textbook and other resources to introduce the material and we'll be using the class to discuss it and answer questions that come up.


Although structure of class will vary, it will generally include the following parts.
Although structure of class will vary, it will generally include the following parts.


# Quick updates about assignments, projects, and a meta-discussion about the class.
# Quick updates about assignments.
# Discussion of '''programming challenges''' due that day.
# Discussion of '''programming challenges''' due that day.
# [''Possibly/Sometimes''] Short lecture and/or Q&A about new material in Diez, Barr, and Çetinkaya-Rundel
# [''Possibly/Sometimes''] Short lecture and/or Q&A about new material in Diez, Barr, and Çetinkaya-Rundel
Line 208: Line 168:


Hopefully, the material in OpenIntro feels very familiar from COM520. The programming material will be new but I want you to read it before you come to class so we can work through the examples a group.
Hopefully, the material in OpenIntro feels very familiar from COM520. The programming material will be new but I want you to read it before you come to class so we can work through the examples a group.
'''Assignment (Complete Before Class):'''
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 1]]


'''Required Readings:'''
'''Required Readings:'''


* Diez, Barr, and Çetinkaya-Rundel: §1 (Introduction to data)
* Diez, Barr, and Çetinkaya-Rundel: §1 (Introduction to data)
* Verzani: §1 (Getting Started), §2 (Univariate data) [[https://faculty.washington.edu/makohill/com521/verzani-usingr-ch1_ch2.pdf Available with UWNetID]]
* Verzani: §1 (Getting Started), §2 (Univariate data)
* Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” ''Proceedings of the National Academy of Sciences'' 111(24):8788–90. [[http://www.pnas.org/content/111/24/8788.full Available through UW libraries]]
* Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” ''Proceedings of the National Academy of Sciences'' 111(24):8788–90. [[http://www.pnas.org/content/111/24/8788.full Available through UW libraries]]


'''Optional Readings:'''
'''Optional Readings/Resources:'''


* Verzani: §A (Programming)
* Verzani: §A (Programming)
'''Assignment (Complete Before Class):'''
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 1]]
'''Lectures:'''
* [https://communitydata.cc/~mako/2017-COM521/com521-week_01-r_programming_intro-20170103.ogv Week 1 R lecture screencast (Part I): Introduction to R and univariate statistics] (~1 hour 47 minutes)
* [https://communitydata.cc/~mako/2017-COM521/com521-week_01-github_rscripts-20170104.ogv Week 1 R lecture screencast (Part II): Setting up git/GitHub and saving files in RStudio] (~40 minutes)
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 1]]
'''Resources:'''
* [https://www.openintro.org/download.php?file=os3_slides_01&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §1 Lecture Notes]
* [https://www.openintro.org/download.php?file=os3_slides_01&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §1 Lecture Notes]
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including some for §1
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including some for §1
* [[Statistics and Statistical Programming (Winter 2017)/Session plan: Week 1]]


=== Week 2: Tuesday January 10: Probability and Visualization ===
=== Week 2: Tuesday January 10: Probability and Visualization ===
Line 239: Line 190:


* Diez, Barr, and Çetinkaya-Rundel: §2 (Probability)
* Diez, Barr, and Çetinkaya-Rundel: §2 (Probability)
* Verzani: §3.1-2 (Bivariate data), §4 (Multivariate data), §5 (Multivariate graphics) [[https://faculty.washington.edu/makohill/com521/verzani-usingr-ch3.1-2_ch4_ch5.pdf Available with UW NetID]]
* Verzani: §3.1-2 (Bivariate data), §4 (Multivariate data), §5 (Multivariate graphics)
* Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in ''Proceedings of the 8th ACM Conference on Designing Interactive Systems.'' Aarhus, Denmark: ACM. [[https://mako.cc/academic/buechley_hill_DIS_10.pdf PDF available on my personal website]]
* ''Empirical Paper TBD''


'''Assignment (Complete Before Class):'''
=== Week 3: Tuesday January 17: Distributions ===


* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 2]]
'''Lectures:'''
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 2]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_02-lists_dataframes_graphing-20170111.ogv Week 2 R lecture screencast: lists, matrixes, data frames, and beginning graphing] (~1 hour 8 minutes)
'''Resources:'''
* [https://www.openintro.org/download.php?file=os3_slides_02&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §2 Lecture Notes]
* [https://www.openintro.org/stat/videos.phpOpenIntro Video Lectures] including 2 short videos for §2
* [[Statistics and Statistical Programming (Winter 2017)/Session plan: Week 2]]
=== Week 3: Tuesday January 17: Distributions ===


'''Required Readings:'''
'''Required Readings:'''


* Diez, Barr, and Çetinkaya-Rundel: §3.1-3.2, §3.4: You should read the rest of the chapter (§3.3 and §3.5). I won't assign problem set questions about it but it's still important to be familiar with.
* Diez, Barr, and Çetinkaya-Rundel: §3.1-3.2, §3.4
* Verzani: §6 (Populations)
* Verzani: §6 (Populations)
 
* ''Empirical Paper TBD''
'''Assignment (Complete Before Class):'''
 
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 3]]
 
'''Lectures:'''
 
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 3]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_03-loading_data_functions_apply_misc.ogv Week 3 R lecture screencast: Loading data, functions; apply(), lapply(), sapply(); several miscellaneous functions] (~34 minutes) — This is the same material I covered in class. If you followed it, there's no reason you need to go back to this.
* [https://communitydata.cc/~mako/2017-COM521/com521-week_03-dates_tapply_merge.ogv Week 3 R lecture screencast: Dates; tapply(); and merge()] (~38 minutes) [The audio seems to be broken for the last 10 minutes. Sorry about that! I've rerecorded that below.]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_03-merge.ogv Week 3 R lecture screencast: merge()] (~13 minutes) [Rerecording of the last few minutes of the previous video.]
 
'''Resources:'''
 
* [https://www.openintro.org/download.php?file=os3_slides_03&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §3 Lecture Notes]
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 2 videos for §3.1 and §3.2
* [[Statistics and Statistical Programming (Winter 2017)/Session plan: Week 3]]


=== Week 4: Tuesday January 24: Statistical significance and hypothesis testing ===
=== Week 4: Tuesday January 24: Statistical significance and hypothesis testing ===
Line 287: Line 208:
* Diez, Barr, and Çetinkaya-Rundel: §4 (Foundations for inference)
* Diez, Barr, and Çetinkaya-Rundel: §4 (Foundations for inference)
* Verzani: §7 (Statistical inference), §8 (Confidence intervals)
* Verzani: §7 (Statistical inference), §8 (Confidence intervals)
 
* ''Empirical Paper TBD''
'''Assignment (Complete Before Class):'''
 
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 4]]
 
'''Lectures:'''
 
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 4]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_04-misc_confint_simulation-20170125.ogv Week 4 R lecture screencast: order(); confidence intervals; simulations drawn from repeated random samples] (~27 minutes)
 
'''Resources:'''
 
* [https://www.openintro.org/download.php?file=os3_slides_04&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §4 Lecture Notes]
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 7 videos for nearly all of §4
* [[Statistics and Statistical Programming (Winter 2017)/Session plan: Week 4]]


=== Week 5: Tuesday January 31: Continuous Numeric Data & ANOVA ===
=== Week 5: Tuesday January 31: Continuous Numeric Data & ANOVA ===
Line 309: Line 216:
* Diez, Barr, and Çetinkaya-Rundel: §5 (Inference for numerical data)
* Diez, Barr, and Çetinkaya-Rundel: §5 (Inference for numerical data)
* Verzani: §9 (significance tests), §12 (Analysis of variance)
* Verzani: §9 (significance tests), §12 (Analysis of variance)
* Gelman, Andrew and Hal Stern. 2006. “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant.” ''The American Statistician'' 60(4):328–31. [[http://dx.doi.org/10.1198/000313006X152649 Available through UW Libraries]]
* ''Empirical Paper TBD''
* Sweetser, K. D., & Metzgar, E. (2007). Communicating during crisis: Use of blogs as a relationship management tool. ''Public Relations Review'', 33(3), 340–342. https://doi.org/10.1016/j.pubrev.2007.05.016 [Available through UW Libraries]
* Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in ''Proceedings of the 8th ACM Conference on Designing Interactive Systems.'' Aarhus, Denmark: ACM. [[https://mako.cc/academic/buechley_hill_DIS_10.pdf PDF available on my personal website]]
 
'''Assignment (Complete Before Class):'''
 
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 5]]
 
'''Lectures:'''
 
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 5]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_05-ttests_and_anova.ogv Week 5 R lecture screencast: t-tests] (~22 minutes)
* [https://communitydata.cc/~mako/2017-COM521/com521-week_05-for_if.ogv Week 5 R lecture screencast: for loops and if statements] (~12 minutes)
 
'''Resources:'''
 
* [https://www.openintro.org/download.php?file=os3_slides_05&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §5 Lecture Notes]


=== Week 6: Tuesday February 7: Categorical data ===
=== Week 6: Tuesday February 7: Categorical data ===
Line 333: Line 224:
* Diez, Barr, and Çetinkaya-Rundel: §6 (Inference for categorical data)
* Diez, Barr, and Çetinkaya-Rundel: §6 (Inference for categorical data)
* Verzani: §3.4 (Bivariate categorical data); §10.1-10.2 (Goodness of fit)
* Verzani: §3.4 (Bivariate categorical data); §10.1-10.2 (Goodness of fit)
* Gelman, Andrew and Eric Loken. 2014. “The Statistical Crisis in Science Data-Dependent Analysis—a ‘garden of Forking Paths’—explains Why Many Statistically Significant Comparisons Don’t Hold Up.” ''American Scientist'' 102(6):460. [[https://www.americanscientist.org/issues/pub/2014/6/the-statistical-crisis-in-science/1 Available through UW Libraries]] (This is a reworked version of [http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf this unpublished manuscript] which provides a more detailed examples.)
* ''Empirical Paper TBD''
* Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in ''Proceedings of the 8th ACM Conference on Designing Interactive Systems.'' Aarhus, Denmark: ACM. [[https://mako.cc/academic/buechley_hill_DIS_10.pdf PDF available on my personal website]]
 
'''Assignment (Complete Before Class):'''
 
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 6]]
 
'''Lectures:'''
 
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 6]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_06-tables_chisq_debugging.ogv Week 6 R lecture screencast: Tables, <math>\chi^2</math>-tests, and debugging.] (~40 minutes)
 
'''Resources:'''


* [https://www.openintro.org/download.php?file=os3_slides_06&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §6 Lecture Notes]
=== Week 7: Tuesday February 14: Simple Linear Regression ===
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 4 videos for §7
 
=== Week 7: Tuesday February 14: Linear Regression ===


'''Required Readings:'''
'''Required Readings:'''


* Diez, Barr, and Çetinkaya-Rundel: §7 (Introduction to linear regression); §8.1-8.3 (Multiple regression)
* Diez, Barr, and Çetinkaya-Rundel: §7 (Introduction to linear regression)
* OpenIntro eschews a mathematical instruction to correlation. Can you look over [https://en.wikipedia.org/wiki/Correlation_and_dependence the Wikipedia article on correlation and dependence] and pay attentions to the formulas. It's tedious to compute but I'd like to you to at least see what goes into it.
* Verzani: §11.1-2 (Linear regression),
* Verzani: §11.1-2 (Linear regression),
* Lampe, Cliff, and Paul Resnick. 2004. “Slash(Dot) and Burn: Distributed Moderation in a Large Online Conversation Space.” In ''Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '04)'', 543–550. New York, NY, USA: ACM. doi:10.1145/985692.985761. [[http://dx.doi.org/10.1145/985692.985761 Available in UW libraries]]
* ''Empirical Paper TBD''
 
'''Assignment (Complete Before Class):'''


* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 7]]
=== Week 8: Tuesday February 21: Multiple and Logistic Regression ===
 
'''Lectures:'''
 
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 7]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_07-linear_regression.ogv Week 7 R lecture screencast: linear regression] (~42 minutes)
 
'''Resources:'''
 
* [https://www.openintro.org/download.php?file=os3_slides_07&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §7 Lecture Notes]
* [https://www.openintro.org/download.php?file=os3_slides_08&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §8 Lecture Notes]
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 4 videos for §7 and 3 videos on the sections §8.1-8.3
 
=== Week 8: Tuesday February 21: Polynomial Terms, Interactions, and Logistic Regression ===


'''Required Readings:'''
'''Required Readings:'''


* [https://onlinecourses.science.psu.edu/stat501/node/301 Lesson 8: Categorical Predictors] and [https://onlinecourses.science.psu.edu/stat501/node/318 Lesson 9: Data Transformations] from the PennState Eberly College of Science STAT 501 Regression Methods Course. There are several subparts (many quite short), please read them all carefully.
* Diez, Barr, and Çetinkaya-Rundel: §8 (Multiple and logistic regression)
* Diez, Barr, and Çetinkaya-Rundel: §8.4 (Multiple and logistic regression)
* Verzani: §11.3 (Linear regression), §13.1 (Logistic regression)
* Verzani: §11.3 (Linear regression), §13.1 (Logistic regression)
* Ioannidis, John P. A. 2005. “Why Most Published Research Findings Are False.” ''PLoS Medicine'' 2(8):e124. [[http://dx.doi.org/10.1371%2Fjournal.pmed.0020124 Open Access]]
* ''Empirical Paper TBD''
* Lampe, Cliff, and Paul Resnick. 2004. “Slash(Dot) and Burn: Distributed Moderation in a Large Online Conversation Space.” In ''Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '04)'', 543–550. New York, NY, USA: ACM. doi:10.1145/985692.985761. [[http://dx.doi.org/10.1145/985692.985761 Available in UW libraries]]
 
'''Optional Readings:'''
 
* Head, Megan L., Luke Holman, Rob Lanfear, Andrew T. Kahn, and Michael D. Jennions. 2015. “The Extent and Consequences of P-Hacking in Science.” ''PLOS Biology'' 13(3):e1002106. [[http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106 Open Access]]
 
'''Assignment (Complete Before Class):'''
 
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 8]]
 
'''Lectures:'''
 
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 8]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_08-more_regression_anova_redux.ogv Week 8 R lecture screencast: more on linear regression, including interactions, polynomials, log transformations; anova] (~28 minutes)
 
'''Resources:'''
 
* [https://www.openintro.org/download.php?file=os3_slides_08&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §8 Lecture Notes]
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including a video on §8.4
* I've written this document which will likely be useful for many of you: [https://communitydata.cc/~mako/2017-COM521/logistic_regression_interpretation.html Interpreting Logistic Regression Coefficients with Examples in R]


=== Week 9: Tuesday February 28: Consulting Meetings ===
=== Week 9: Tuesday February 28: Consulting Meetings ===
Line 407: Line 246:
We won't meet as a group. Instead, you will each meet on-on-one with me to work through challenges and issues with your analysis.
We won't meet as a group. Instead, you will each meet on-on-one with me to work through challenges and issues with your analysis.


=== Week 10: Tuesday March 7: Consulting Meetings ===
=== Week 10: Tuesday March 7: Final Presentations ===
 
We won't meet as a group. Instead, you will each meet on-on-one with me to work through challenges and issues with your analysis.
 
=== Week 11: March 14: Final Presentations ===


== Administrative Notes ==
== Administrative Notes ==
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)