Editing Statistics and Statistical Programming (Winter 2017)

From CommunityData
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 30: Line 30:
* Feel comfortable reading papers that use basic statistical techniques.
* Feel comfortable reading papers that use basic statistical techniques.
* Feel comfortable and prepared enrolling in future statistics courses in CSSS.
* Feel comfortable and prepared enrolling in future statistics courses in CSSS.
== Why Statistical Programming? ==
This class will focus much more on statistical programming in R than most similar classes. Most similar classes in communication will focus on using an easier to use statistical package like SPSS.
We're focusing on programming instead of a package like SPSS for several reasons:
* Student who understands a programming language won't be limited to the "canned" functions in the off-the-shelf packages.
* Pedagogically, programming supports students in building a deeper understanding of the mathematics and assumptions behind the canned functions by both allowing them to read the code "behind" the canned functions and by allowing the students to implement the functions themselves in assignments.
* Analyses composed of code instead of clicks supports reproducible analyses that can document every step of the process of an analysis including during data cleaning and conversion where errors are common and very difficult to detect.
* Because programming is a skill that is in demand in our department and discipline more generally and that I strongly believe is generally useful.
Of course, there are other programming languages well suited to statistics including Stata and Python.  Ultimately, I'm teaching R because a few of us that seemed mostly to teach in this sequence going forward future got together and the decision was that R made the most sense and because there was consensus among the faculty in the department who were likely to teach statistics classes in the future that this made the most sense.
Our reasoning was that:
* R is freely available and open source
* R is becoming the most widely used package in statistical fields and is (by our estimate) used by most academics in my cohort or later in statistics, political science, and economics already.
* R is the system (along with Stata) that will be in other CSSS advanced stats classes we hope students will continue to take after COM521.
* R is better general purpose programming language than software like Stata which means that R programming skills will let students solve non-stastical problems like collecting data from the web and will make it easier to learn other programming languages.
For students with a strong psychometric focus or whose research will be limited to linear and logistic regression or ANOVA on small pre-collected datasets and similar, SPSS will likely be fine. R has a higher barrier to entry than SPSS but it's ceiling is ''much'' higher.


== Note About This Syllabus ==
== Note About This Syllabus ==
Line 131: Line 109:
* An identification of the dataset you will use and a description of the columns or type of data it will include. If you do not currently have access to these data, explain when you will have access to the data.
* An identification of the dataset you will use and a description of the columns or type of data it will include. If you do not currently have access to these data, explain when you will have access to the data.


==== Final Project Ouline ====
==== Final Project ====


;Outline Due Date: February 21
;Outline Due Date: February 21
;Maximum outline length: 5 pages
;Maximum outline length: 5 pages
;Deliverables: Turn in in Canvas
The outline should should have the following sections: (a) Rationale, (b) Objectives; (b.1) General Objectives; (b.2) Specific Objectives; (c) Null hypotheses; (d) Conceptual Diagram; (e) Measures; (e) Dummy Tables.
An excellent example from my partner Mika Matsuzakis is [https://canvas.uw.edu/courses/1098035/files/40388318/download?wrap=1 online in Canavs]. Your diagram will likely be much less complicated than Matsuzaki's. Also, please don't be distracted by the fact that Mika does public health. It's the basic form I want you all to emulate, not the content. You can read [http://ajcn.nutrition.org/content/99/6/1450.full the published paper] to compare.
The example includes everything except a "Measures" section. Your Measures section only needs to include two column table where column 1 is the name of each variable in your analysis and 2 is the specific operationalization of this measures and a description of how you will create it.
==== Final Project ====
;Paper Due Date: March 19
;Paper Due Date: March 19
;Maximum length: 6000 words (~20 pages)
;Maximum outline length: 6000 words (~20 pages)
;Presentation Date: March 14
;Presentation Date: March 7
;All Deliverables: Turn in in Canvas
;All Deliverables: Turn in in Canvas


Line 156: Line 124:
I have a strong preference for you to write this paper individually but I'm open to the idea that you may want to work with others in the class.
I have a strong preference for you to write this paper individually but I'm open to the idea that you may want to work with others in the class.


In terms of content:
'''''Details Forthcoming:''''' ''Although this material is still somewhat thin, I'll be posting many additional details about the expectations for the final paper as we move forward through the quarter.''
 
* In terms of the structure of the paper, please see the page that I've written on the [[structure of a quantitative empirical research paper]].
* In terms of the structure of your presentation, you've got some latitude but this document on [https://canvas.uw.edu/files/40848246/download?download_frd=1 Creating a Successful Scholarly Presentation] (link is in Canvas) will likely be useful.


=== Grading ===
=== Grading ===
Line 225: Line 190:
'''Lectures:'''
'''Lectures:'''


* [https://communitydata.cc/~mako/2017-COM521/com521-week_01-r_programming_intro-20170103.ogv Week 1 R lecture screencast (Part I): Introduction to R and univariate statistics] (~1 hour 47 minutes)
* [https://communitydata.cc/~mako/com521-week_01-r_programming_intro-20170103.ogv Week 1 R Lecture (Part I): Introduction to R and Univariate statistics] (~1 hour 47 minutes)
* [https://communitydata.cc/~mako/2017-COM521/com521-week_01-github_rscripts-20170104.ogv Week 1 R lecture screencast (Part II): Setting up git/GitHub and saving files in RStudio] (~40 minutes)
* [https://communitydata.cc/~mako/com521-week_01-github_rscripts-20170104.ogv Week 1 R Lecture (Part II): Setting up Git/GitHub and saving files in RStudio] (~40 minutes)
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 1]]
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 1]]


Line 249: Line 214:


* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 2]]
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 2]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_02-lists_dataframes_graphing-20170111.ogv Week 2 R lecture screencast: lists, matrixes, data frames, and beginning graphing] (~1 hour 8 minutes)
* [https://communitydata.cc/~mako/com521-week_02-lists_dataframes_graphing-20170111.ogv Week 2 R Lecture: Lists, Matrixes, Data Frames, and Beginning Graphing] (~1 hour 8 minutes)


'''Resources:'''
'''Resources:'''
Line 271: Line 236:


* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 3]]
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 3]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_03-loading_data_functions_apply_misc.ogv Week 3 R lecture screencast: Loading data, functions; apply(), lapply(), sapply(); several miscellaneous functions] (~34 minutes) — This is the same material I covered in class. If you followed it, there's no reason you need to go back to this.
* [https://communitydata.cc/~mako/com521-week_03-loading_data_functions_apply_misc.ogv Week 3 Lecture: Loading data, functions; apply, lapply, sapply; several miscellaneous functions] (~34 minutes) — This is the same material I covered in class. If you followed it, there's no reason you need to go back to this.
* [https://communitydata.cc/~mako/2017-COM521/com521-week_03-dates_tapply_merge.ogv Week 3 R lecture screencast: Dates; tapply(); and merge()] (~38 minutes) [The audio seems to be broken for the last 10 minutes. Sorry about that! I've rerecorded that below.]
* [https://communitydata.cc/~mako/com521-week_03-dates_tapply_merge.ogv Week 3 Lecture: Dates; tapply(); and merge()] (~38 minutes) [The audio seems to be broken for the last 10 minutes. Sorry about that! I've rerecorded that below.]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_03-merge.ogv Week 3 R lecture screencast: merge()] (~13 minutes) [Rerecording of the last few minutes of the previous video.]
* [https://communitydata.cc/~mako/com521-week_03-merge.ogv Week 3 Lecture: merge()] (~13 minutes) [Rerecording of the last few minutes of the previous video.]


'''Resources:'''
'''Resources:'''
Line 295: Line 260:


* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 4]]
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 4]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_04-misc_confint_simulation-20170125.ogv Week 4 R lecture screencast: order(); confidence intervals; simulations drawn from repeated random samples] (~27 minutes)
* [https://communitydata.cc/~mako/com521-week_04-misc_confint_simulation-20170125.ogv Week 4 Lecture: order(); confidence intervals; simulations drawn from repeated random samples] (~27 minutes)


'''Resources:'''
'''Resources:'''
Line 311: Line 276:
* Gelman, Andrew and Hal Stern. 2006. “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant.” ''The American Statistician'' 60(4):328–31. [[http://dx.doi.org/10.1198/000313006X152649 Available through UW Libraries]]
* Gelman, Andrew and Hal Stern. 2006. “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant.” ''The American Statistician'' 60(4):328–31. [[http://dx.doi.org/10.1198/000313006X152649 Available through UW Libraries]]
* Sweetser, K. D., & Metzgar, E. (2007). Communicating during crisis: Use of blogs as a relationship management tool. ''Public Relations Review'', 33(3), 340–342. https://doi.org/10.1016/j.pubrev.2007.05.016 [Available through UW Libraries]
* Sweetser, K. D., & Metzgar, E. (2007). Communicating during crisis: Use of blogs as a relationship management tool. ''Public Relations Review'', 33(3), 340–342. https://doi.org/10.1016/j.pubrev.2007.05.016 [Available through UW Libraries]
* Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in ''Proceedings of the 8th ACM Conference on Designing Interactive Systems.'' Aarhus, Denmark: ACM. [[https://mako.cc/academic/buechley_hill_DIS_10.pdf PDF available on my personal website]]


'''Assignment (Complete Before Class):'''
'''Assignment (Complete Before Class):'''
Line 320: Line 284:


* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 5]]
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 5]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_05-ttests_and_anova.ogv Week 5 R lecture screencast: t-tests] (~22 minutes)
* [https://communitydata.cc/~mako/2017-COM521/com521-week_05-for_if.ogv Week 5 R lecture screencast: for loops and if statements] (~12 minutes)


'''Resources:'''
'''Resources:'''
Line 333: Line 295:
* Diez, Barr, and Çetinkaya-Rundel: §6 (Inference for categorical data)
* Diez, Barr, and Çetinkaya-Rundel: §6 (Inference for categorical data)
* Verzani: §3.4 (Bivariate categorical data); §10.1-10.2 (Goodness of fit)
* Verzani: §3.4 (Bivariate categorical data); §10.1-10.2 (Goodness of fit)
* Gelman, Andrew and Eric Loken. 2014. “The Statistical Crisis in Science Data-Dependent Analysis—a ‘garden of Forking Paths’—explains Why Many Statistically Significant Comparisons Don’t Hold Up.” ''American Scientist'' 102(6):460. [[https://www.americanscientist.org/issues/pub/2014/6/the-statistical-crisis-in-science/1 Available through UW Libraries]] (This is a reworked version of [http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf this unpublished manuscript] which provides a more detailed examples.)
* Ioannidis, John P. A. 2005. “Why Most Published Research Findings Are False.” ''PLoS Medicine'' 2(8):e124. [[http://dx.doi.org/10.1371%2Fjournal.pmed.0020124 Open Access]]
* Buechley, Leah and Benjamin Mako Hill. 2010. “LilyPad in the Wild: How Hardware’s Long Tail Is Supporting New Engineering and Design Communities.” Pp. 199–207 in ''Proceedings of the 8th ACM Conference on Designing Interactive Systems.'' Aarhus, Denmark: ACM. [[https://mako.cc/academic/buechley_hill_DIS_10.pdf PDF available on my personal website]]
* Gelman, Andrew and Eric Loken. 2014. “The Statistical Crisis in Science Data-Dependent Analysis—a ‘garden of Forking Paths’—explains Why Many Statistically Significant Comparisons Don’t Hold Up.” ''American Scientist'' 102(6):460. [[https://www.americanscientist.org/issues/pub/2014/6/the-statistical-crisis-in-science/1 Available through UW Libraries]] (http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf This is a reworked version of this unpublished manuscript which provides a more detailed examples.)
* ''Empirical Paper TBD''


'''Assignment (Complete Before Class):'''
'''Assignment (Complete Before Class):'''
Line 343: Line 306:


* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 6]]
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 6]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_06-tables_chisq_debugging.ogv Week 6 R lecture screencast: Tables, <math>\chi^2</math>-tests, and debugging.] (~40 minutes)


'''Resources:'''
'''Resources:'''
Line 350: Line 312:
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 4 videos for §7
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 4 videos for §7


=== Week 7: Tuesday February 14: Linear Regression ===
=== Week 7: Tuesday February 14: Simple Linear Regression ===


'''Required Readings:'''
'''Required Readings:'''


* Diez, Barr, and Çetinkaya-Rundel: §7 (Introduction to linear regression); §8.1-8.3 (Multiple regression)
* Diez, Barr, and Çetinkaya-Rundel: §7 (Introduction to linear regression)
* OpenIntro eschews a mathematical instruction to correlation. Can you look over [https://en.wikipedia.org/wiki/Correlation_and_dependence the Wikipedia article on correlation and dependence] and pay attentions to the formulas. It's tedious to compute but I'd like to you to at least see what goes into it.
* Verzani: §11.1-2 (Linear regression),
* Verzani: §11.1-2 (Linear regression),
* Lampe, Cliff, and Paul Resnick. 2004. “Slash(Dot) and Burn: Distributed Moderation in a Large Online Conversation Space.” In ''Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '04)'', 543–550. New York, NY, USA: ACM. doi:10.1145/985692.985761. [[http://dx.doi.org/10.1145/985692.985761 Available in UW libraries]]
* Head, Megan L., Luke Holman, Rob Lanfear, Andrew T. Kahn, and Michael D. Jennions. 2015. “The Extent and Consequences of P-Hacking in Science.” ''PLOS Biology'' 13(3):e1002106. [[http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106 Open Access]]
 
* ''Empirical Paper TBD''
'''Assignment (Complete Before Class):'''
 
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 7]]
 
'''Lectures:'''


* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 7]]
=== Week 8: Tuesday February 21: Multiple and Logistic Regression ===
* [https://communitydata.cc/~mako/2017-COM521/com521-week_07-linear_regression.ogv Week 7 R lecture screencast: linear regression] (~42 minutes)
 
'''Resources:'''
 
* [https://www.openintro.org/download.php?file=os3_slides_07&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §7 Lecture Notes]
* [https://www.openintro.org/download.php?file=os3_slides_08&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §8 Lecture Notes]
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including 4 videos for §7 and 3 videos on the sections §8.1-8.3
 
=== Week 8: Tuesday February 21: Polynomial Terms, Interactions, and Logistic Regression ===


'''Required Readings:'''
'''Required Readings:'''


* [https://onlinecourses.science.psu.edu/stat501/node/301 Lesson 8: Categorical Predictors] and [https://onlinecourses.science.psu.edu/stat501/node/318 Lesson 9: Data Transformations] from the PennState Eberly College of Science STAT 501 Regression Methods Course. There are several subparts (many quite short), please read them all carefully.
* Diez, Barr, and Çetinkaya-Rundel: §8 (Multiple and logistic regression)
* Diez, Barr, and Çetinkaya-Rundel: §8.4 (Multiple and logistic regression)
* Verzani: §11.3 (Linear regression), §13.1 (Logistic regression)
* Verzani: §11.3 (Linear regression), §13.1 (Logistic regression)
* Ioannidis, John P. A. 2005. “Why Most Published Research Findings Are False.” ''PLoS Medicine'' 2(8):e124. [[http://dx.doi.org/10.1371%2Fjournal.pmed.0020124 Open Access]]
* ''Empirical Paper TBD''
* Lampe, Cliff, and Paul Resnick. 2004. “Slash(Dot) and Burn: Distributed Moderation in a Large Online Conversation Space.” In ''Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '04)'', 543–550. New York, NY, USA: ACM. doi:10.1145/985692.985761. [[http://dx.doi.org/10.1145/985692.985761 Available in UW libraries]]
 
'''Optional Readings:'''
 
* Head, Megan L., Luke Holman, Rob Lanfear, Andrew T. Kahn, and Michael D. Jennions. 2015. “The Extent and Consequences of P-Hacking in Science.” ''PLOS Biology'' 13(3):e1002106. [[http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106 Open Access]]
 
'''Assignment (Complete Before Class):'''
 
* [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 8]]
 
'''Lectures:'''
 
* [[Statistics and Statistical Programming (Winter 2017)/R lecture outline: Week 8]]
* [https://communitydata.cc/~mako/2017-COM521/com521-week_08-more_regression_anova_redux.ogv Week 8 R lecture screencast: more on linear regression, including interactions, polynomials, log transformations; anova] (~28 minutes)
 
'''Resources:'''
 
* [https://www.openintro.org/download.php?file=os3_slides_08&referrer=/stat/slides/slides_0x.php Mine Çetinkaya-Rundel's OpenIntro §8 Lecture Notes]
* [https://www.openintro.org/stat/videos.php OpenIntro Video Lectures] including a video on §8.4
* I've written this document which will likely be useful for many of you: [https://communitydata.cc/~mako/2017-COM521/logistic_regression_interpretation.html Interpreting Logistic Regression Coefficients with Examples in R]


=== Week 9: Tuesday February 28: Consulting Meetings ===
=== Week 9: Tuesday February 28: Consulting Meetings ===
Line 411: Line 337:
We won't meet as a group. Instead, you will each meet on-on-one with me to work through challenges and issues with your analysis.
We won't meet as a group. Instead, you will each meet on-on-one with me to work through challenges and issues with your analysis.


=== Week 11: March 14: Final Presentations ===
=== Week 11: Date/Time TBD (Tentatively March 14): Final Presentations ===


== Administrative Notes ==
== Administrative Notes ==
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)