Statistics and Statistical Programming (Spring 2019)/Problem Set: Week 5

Programming Challenges

This week we'll work with the full dataset from which I drew the 20 group samples you analyzed in Weeks 2 and 3.

PC0. The dataset is available as a TSV file in the directory week_05 in the data repository for the course. Note that a TSV file is tab delimited, not comma delimited (it is otherwise similar to a CSV file). Go ahead and inspect the data and load it into R (Hint: You'll want to use the read.delim() function).

PC1. Calculate the mean of the variable x in the full dataset. Go back to your Week 3 problem set and revisit the mean you calculated for x. Be prepared to discuss the conceptual relationship of these two means to each other.

PC2. Again, using the variable x from your Week 3 data, compute the 95% confidence interval for the mean of this vector in two ways:

(a) By "hand" (in R is fine) using the normal formula for standard error $({\frac {\sigma }{\sqrt {n}}})$ . (Bonus challenge: Complete this by writing a function that calculates a confidence interval for the mean of any numeric vector.)
(b) Using an appropriate built-in R function (see this week's R lecture materials for a relevant example).
(c) The results from (a) and (b) should be the same or very close. After reading OpenIntro, can you explain why they might not be exactly the same?

PC3. Compare the mean of x from your Week 3 sample — and your confidence interval — to the population mean (the version of x in the Week 5 dataset). Is the true mean inside your confidence interval? Should you find this surprising? Why or why not? Be prepared to discuss the relationship of these values to each other.

PC4. Let's look beyond the mean. Compare the distribution from your sample of x to the true population of x. Draw histograms and compute other descriptive and summary statistics. What do you notice? Be prepared to discuss and explain any differences.

PC5. Calculate the conditional mean of x for each of the groups in the population and the standard deviation of this distribution of conditional means. Compare this standard deviation to the answers you calculated in PC2 part (a) above. Explain the relationship between these values.

PC6. I want you to run a simple simulation that demonstrates a fundamental insight of statistics. Please see the R lecture materials from last week for ideas about how to do this (but note that there are some differences between that example and this programming challenge).

(a) Create a vector of 10,000 randomly generated numbers that are uniformly distributed between 0 and 9.
(b) Calculate the mean of that vector. Draw a histogram of the distribution.
(c) Create 100 random samples of 2 items each from your randomly generated data and take the mean of each sample. Create a new vector that contains those means. Describe/display the distribution of those means.
(d) Do (c) except make the items 10 items in each sample instead of 2. Then do (c) again except with 100 items. Be ready to describe how the histogram changes as the sample size increases. (Bonus challenge: Write a function to complete this part.)

PC7. Compare the results from PC6 with those in the example simulation from the Week 4 R lecture materials. What fundamental statistical principle is illustrated by these simulations?

Statistical Questions

Exercises from OpenIntro §5

SQ0. Any questions or clarifications from the OpenIntro text or lecture notes?

SQ1. Exercise 5.16 which is a set of True/False questions

SQ2. Exercise 5.28 which is about Diamonds

SQ3. Exercise 5.30 which is also about Diamonds

SQ4. Exercise 5.48 which is about work hours and education

SQ5. Exercise 5.52 which is another set of True/False questions about ANOVA

Reinhart §1

SQ6.

Empirical Paper Questions

These are all in regards to the Sweetser and Metzgar paper.

EQ1. For RQ1 explain:

(a) What is the unit of analysis? What is the dependent variable? The independent variable? What are the levels or groups of being compared in the ANOVA?
(b) Clearly State the null hypothesis being tested. What is the alternative hypothesis?
(c) Summarize ore restate results in statistical terms. Explain what these results mean in substantive terms? How convincing do you find these results? What should we be taking away?

EQ2. Do the same as above but for RQ4.

EQ3. ...for RQ5.

EQ4. ...for RQ6.