Statistics and Statistical Programming (Fall 2020)/pset6: Difference between revisions
No edit summary |
|||
Line 1: | Line 1: | ||
<small>[[Statistics_and_Statistical_Programming_(Fall_2020)#Week_9_.2811.2F10.2C_11.2F12.29|← Back to Week 9]]</small> | <small>[[Statistics_and_Statistical_Programming_(Fall_2020)#Week_9_.2811.2F10.2C_11.2F12.29|← Back to Week 9]]</small> | ||
== Programming challenges ( | == Programming challenges (and statistical questions) == | ||
This week's programming challenges are all about analyzing continuous data. For | This week's programming challenges are all about analyzing continuous data. For most of this, I'd like you to replicate some the analysis done in this paper (note: I do not think you need to read it deeply to answer the questions below): | ||
: Lagakos, S., & Mosteller, F. (1981). A case study of statistics in the regulatory process: the FD&C Red No. 40 experiments. ''Journal of the National Cancer Institute'', 66(1), 197–212. [[https://www.gwern.net/docs/statistics/1981-lagakos.pdf PDF]] | : Lagakos, S., & Mosteller, F. (1981). A case study of statistics in the regulatory process: the FD&C Red No. 40 experiments. ''Journal of the National Cancer Institute'', 66(1), 197–212. [[https://www.gwern.net/docs/statistics/1981-lagakos.pdf PDF]] | ||
Overall, the goal of this research was to understand whether/how doses of red dye number 40 affect the survival of mice (and, by extension, humans). | |||
=== PC1. Download, import, and reshape the data === | === PC1. Download, import, and reshape the data === | ||
* Download the dataset by clicking through on the "Red Dye Number 40" link on [http://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/owan/frames/frame.html this webpage]. You'll find that the it's not in an ideal setup. It's an Excel file (XLS) with a series of columns labeled X1.. X4 | * Download the dataset by clicking through on the "Red Dye Number 40" link on [http://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/owan/frames/frame.html this webpage]. You'll find that the it's not in an ideal setup. It's an Excel file (XLS) with a series of columns labeled X1.. X4 (the format is not particularly "tidy"). If you look at the website with the data and/or Table 1 in the paper you should be able to figure out what each column stands for. | ||
* Import the data into R and get to work on reshaping the dataset. I think a good format would be a data frame with two columns: <code>group</code> and <code>weeks_alive</code>. | * Import the data into R and get to work on reshaping the dataset. I think a good format would be a data frame with two columns: <code>group</code> and <code>weeks_alive</code>. | ||
=== PC2. Summarize the data === | === PC2. Summarize the data === | ||
Using the two columns you just created, create summary statistics and visualizations for the dataset as a whole and for each of the groups. These descriptive analyses should give you a sense of the shape of the data and relationships across groups. | |||
==== SQ1. Discuss your descriptive analysis ==== | |||
Be sure to interpret anything noteworthy. | |||
==== SQ2. State hypotheses ==== | |||
The plan here is to use ANOVA to evaluate whether there is a difference in survival time between the groups and then t-tests to compare the average survival times across some specific groups (see PC4 below for more details on which groups). State null and alternative hypotheses that correspond to these tests. | |||
==== SQ3. Address assumptions for the tests ==== | |||
Identify any assumptions you may need to make to conduct the ANOVA analysis and t-tests. Do these tests seem appropriate here? Why (not)? | |||
=== PC3. Replicate the ANOVA analysis === | === PC3. Replicate the ANOVA analysis === | ||
Estimate an ANOVA analysis using <code>aov()</code> to | Estimate an ANOVA analysis using <code>aov()</code> to test the global hypothesis of a difference between the groups. | ||
==== SQ4. Report and interpret your results ==== | |||
Make sure to call <code>summary()</code> on the output of your <code>aov()</code> command. | |||
=== PC4. Estimate differences in means === | === PC4. Estimate differences in means === | ||
After performing an ANOVA | After performing an ANOVA, people sometimes do t-tests between specific groups to test/estimate differences-in-means. In this case, you should do a t-test on the average survival time of mice with ''none'' RD40 and mice with ''any'' (i.e., at least a small amount). Next, run a t-test between the group with a high dosage and control group. | ||
==== SQ5. Report and interpret your results ==== | |||
== Empirical paper questions == | == Empirical paper questions == | ||
Line 37: | Line 51: | ||
=== EQ3. Interpret the results re: RQ6 === | === EQ3. Interpret the results re: RQ6 === | ||
Answer the same (a)-(d) questions as you did for RQs 4-5 above, but with RQ6. | Answer the same (a)-(d) questions as you did for RQs 4-5 above, but with RQ6. | ||
Revision as of 19:13, 4 November 2020
Programming challenges (and statistical questions)
This week's programming challenges are all about analyzing continuous data. For most of this, I'd like you to replicate some the analysis done in this paper (note: I do not think you need to read it deeply to answer the questions below):
- Lagakos, S., & Mosteller, F. (1981). A case study of statistics in the regulatory process: the FD&C Red No. 40 experiments. Journal of the National Cancer Institute, 66(1), 197–212. [PDF]
Overall, the goal of this research was to understand whether/how doses of red dye number 40 affect the survival of mice (and, by extension, humans).
PC1. Download, import, and reshape the data
- Download the dataset by clicking through on the "Red Dye Number 40" link on this webpage. You'll find that the it's not in an ideal setup. It's an Excel file (XLS) with a series of columns labeled X1.. X4 (the format is not particularly "tidy"). If you look at the website with the data and/or Table 1 in the paper you should be able to figure out what each column stands for.
- Import the data into R and get to work on reshaping the dataset. I think a good format would be a data frame with two columns:
group
andweeks_alive
.
PC2. Summarize the data
Using the two columns you just created, create summary statistics and visualizations for the dataset as a whole and for each of the groups. These descriptive analyses should give you a sense of the shape of the data and relationships across groups.
SQ1. Discuss your descriptive analysis
Be sure to interpret anything noteworthy.
SQ2. State hypotheses
The plan here is to use ANOVA to evaluate whether there is a difference in survival time between the groups and then t-tests to compare the average survival times across some specific groups (see PC4 below for more details on which groups). State null and alternative hypotheses that correspond to these tests.
SQ3. Address assumptions for the tests
Identify any assumptions you may need to make to conduct the ANOVA analysis and t-tests. Do these tests seem appropriate here? Why (not)?
PC3. Replicate the ANOVA analysis
Estimate an ANOVA analysis using aov()
to test the global hypothesis of a difference between the groups.
SQ4. Report and interpret your results
Make sure to call summary()
on the output of your aov()
command.
PC4. Estimate differences in means
After performing an ANOVA, people sometimes do t-tests between specific groups to test/estimate differences-in-means. In this case, you should do a t-test on the average survival time of mice with none RD40 and mice with any (i.e., at least a small amount). Next, run a t-test between the group with a high dosage and control group.
SQ5. Report and interpret your results
Empirical paper questions
We'll continue our apparent focus on blogs with the following questions about the Sweetser and Metzgar paper.
EQ1. Interpret the results re: RQ4
(a) What is the unit of analysis? What is the dependent variable? The independent variable? What are the levels or groups of being compared in the ANOVA?
(b) Clearly State the null hypothesis being tested. What is the alternative hypothesis?
(c) Summarize or restate the results in statistical terms. Explain what these results mean in substantive terms.
(d) How convincing do you find these results? What should we be taking away?
EQ2. Interpret the results re: RQ5
Answer the same (a)-(d) questions as you did for RQ4 above, but with RQ5.
EQ3. Interpret the results re: RQ6
Answer the same (a)-(d) questions as you did for RQs 4-5 above, but with RQ6.