# Statistics and Statistical Programming (Winter 2021)/Problem set 12

## OpenIntro Questions

Complete the following exercises from OpenIntro §7: 7.12, 7.24, 7.26, 7.42, 7.44, 7.46

## Programming challenges (and statistical questions)

This week's programming challenges are all about analyzing continuous data. For most of this, I'd like you to replicate some the analysis done in this paper (note: I do not think you need to read it deeply to answer the questions below):

Lagakos, S., & Mosteller, F. (1981). A case study of statistics in the regulatory process: the FD&C Red No. 40 experiments. Journal of the National Cancer Institute, 66(1), 197–212. [PDF]

Overall, the goal of this research was to understand whether/how doses of red dye number 40 affect the survival of mice (and, by extension, humans).

• Download the dataset by clicking through on the "Red Dye Number 40" link on this webpage. You'll find that the it's not in an ideal setup. It's an Excel file (XLS) with a series of columns labeled X1.. X4. Yikes! If you look at the website with the data and/or Table 1 in the paper you should be able to figure out what each column stands for.
• Import the data into R and get to work on reshaping the dataset. I think a good format would be a data frame with two columns: `group` (or `dose`) and `weeks_alive` but whatever you choose is fine.

### PC2. Summarize the data

Using the two columns you just created, create summary statistics and visualizations for the dataset as a whole and for each of the groups. These descriptive analyses should give you a sense of the shape of the data and relationships across groups.

#### SQ1. Discuss your descriptive analysis

Be sure to interpret anything noteworthy.

#### SQ2. State hypotheses

The plan here is to use ANOVA to evaluate whether there is a difference in survival time between the groups and then t-tests to compare the average survival times across some specific groups (see PC4 below for more details on which groups). State null and alternative hypotheses that correspond to these tests.

#### SQ3. Address assumptions for the tests

Identify any assumptions you may need to make to conduct the ANOVA analysis and t-tests. Do these tests seem appropriate here? Why (not)?

### PC3. Replicate the ANOVA analysis

Estimate an ANOVA analysis using `aov()` to test the global hypothesis of a difference between the groups.

#### SQ4. Report and interpret your ANOVA results

(Note: Make sure to call `summary()` on the output of your `aov()` command.)

### PC4. Estimate differences in means

After performing an ANOVA, people sometimes do t-tests between specific groups to test/estimate differences-in-means. In this case, you should do a t-test on the average survival time of mice with none RD40 and mice with any (i.e., at least a small amount). Next, run a t-test between the group with a high dosage and control group.

#### SQ5. Report and interpret your t-test results

Make sure to include the estimated difference of means as well as the test-statistic and p-value.

#### SQ6. Multiple comparisons

Now, let's imagine that you wanted to test for differences in average survival time across all of the possible pairings of groups in the study. Should you adjust for multiple comparisons? Why (not)? If so, how would you go about it?

## Empirical Paper Questions

Note: Realistically, I don't think we're going to have time in class to talk through these so I've moved these to Statistics and Statistical Programming (Winter 2021)/Problem set 13.