Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 6: Difference between revisions

Revision as of 03:00, 3 February 2017

Programming Challenges

Let's re-evaluate some data from this paper:

Lagakos, S., & Mosteller, F. (1981). A case study of statistics in the regulatory process: the FD&C Red No. 40 experiments. Journal of the National Cancer Institute, 66(1), 197–212. [PDF]

I found a copy of the dataset at this link.

PC0. Download the dataset from from this webpage. You'll find that the it's not in an ideal setup. It's an Excel files (XLS) with a series of columns labeled X1.. X4. The format is not exactly tabular.

PC1. Load the data. Now get to work on reshaping the dataset. I think a good format would be a data frame with two columns: group, time of death (i.e., lifespan).

PC2. Create summary statistics and visualizations for each group. Write code that allows you to generate a useful way to both (a) get a visual sense both for the shape of the data and its relationships and (b) the degree to which the assumptions for t-tests and ANOVA hold. What is the global mean of your dependent variable?

PC3. Do a t-test between mice with any RD40 and mice with at least a small amount. Run a t-test between the group with a high dosage and control group.

PC4. Run an anova using aov() to see if there is a difference between the groups.

Statistical Questions from OpenIntro §6

Q0. Any questions or clarifications from the OpenIntro text or lecture notes?

Q1. Exercise 6.12 on public opinion about cannabis legalization

Q2. Exercise 6.20 a continuation of 6.12

Q3. Exercise 6.38 on translating a problem in English into statistical tests

Q4. Exercise 6.50 another voter/public opinion question

Questions on the Empirical Paper

Let's just go back to the Buechley and Hill paper on LilyPad Arduino:

Q5. For Study 1, lets focus on the statistical test:

(a) What is the unit of analysis? What is the dependent variable? The independent variable? What are groups being compared in the test? Is it a one-way or two-way design?
(b) What is the null hypothesis being tested? What is the alternative hypothesis?
(c) Summarize ore restate results in statistical terms. Explain what these results mean in substantive terms? How convincing do you find these results? What should we be taking away?
(d) Why weren't we happy just leaving it where we did in week 2? Why bother with the statistical test?

Q6. Do the same as above but for Study 2.

@@ Line 1: / Line 1: @@
 == Programming Challenges ==
-: '''PC0.''' I've provided the full dataset from which I drew each of your samples in a TSV file in the directory <code>week_05</code> in [https://github.com/makoshark/uwcom521-assignments/ class assignment git repository]. These are ''tab delimited'', not comma delimited. TSV, is related to CSV and is also a common format. Go ahead and load it into R (''HINT: <code>read.delim()</code>''). Take the mean of the variable <code>x</code> in that dataset. That is the true population mean — the thing we have been creating estimates of in week 2 and week 3.
+Let's re-evaluate some data from this paper:
+: Lagakos, S., & Mosteller, F. (1981). A case study of statistics in the regulatory process: the FD&C Red No. 40 experiments. ''Journal of the National Cancer Institute'', 66(1), 197–212. [[https://www.gwern.net/docs/statistics/1981-lagakos.pdf PDF]]
+I found a copy of the dataset [http://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/owan/frames/frame.html at this link].
+: '''PC0.''' Download the dataset from [http://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/owan/frames/frame.html from this webpage]. You'll find that the it's not in an ideal setup. It's an Excel files (XLS) with a series of columns labeled X1.. X4. The format is not exactly tabular.
+: '''PC1.''' Load the data. Now get to work on reshaping the dataset. I think a good format would be a data frame with two columns: group, time of death (i.e., lifespan).
+: '''PC2.''' Create summary statistics and visualizations for each group. Write code that allows you to generate a useful way to both (a) get a visual sense both for the shape of the data and its relationships and (b) the degree to which the assumptions for t-tests and ANOVA hold. What is the global mean of your dependent variable?
+: '''PC3.''' Do a t-test between mice with ''any'' RD40 and mice with at least a small amount. Run a t-test between the group with a high dosage and control group.
+: '''PC4.''' Run an anova using aov() to see if there is a difference between the groups.
 == Statistical Questions from OpenIntro §6 ==