Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 5: Difference between revisions

From CommunityData
Line 1: Line 1:
== Programming Challenges ==
== Programming Challenges ==


: '''PC0.''' I've provided the full dataset from which I drew each of your samples in a TSV file in the directory <code>week_05</code> in class assignment git repository. These are '''tab''' delimited, not comma delimited. This is also a common format. Go ahead and load it into R. Take the mean of the variable <code>x</code> in that dataset. That is the true population mean — the thing we were creating estimates of in week 2 and week 3.
: '''PC0.''' I've provided the full dataset from which I drew each of your samples in a TSV file in the directory <code>week_05</code> in class assignment git repository. These are '''tab''' delimited, not comma delimited. This is also a common format. Go ahead and load it into R (''HINT: <code>read.delim()</code> is your friend.''). Take the mean of the variable <code>x</code> in that dataset. That is the true population mean — the thing we were creating estimates of in week 2 and week 3.
: '''PC1.''' Go back to the dataset I distributed for [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 3|the week 3 problem set]]. You've already computed the mean for this in week 2. You should compute the 95% confidence interval for the variable <code>x</code> in two ways:
: '''PC1.''' Go back to the dataset I distributed for [[Statistics and Statistical Programming (Winter 2017)/Problem Set: Week 3|the week 3 problem set]]. You've already computed the mean for this in week 2. You should compute the 95% confidence interval for the variable <code>x</code> in two ways:
:* (a) By hand using the normal formula for standard error <math>(\frac{\sigma}{\sqrt{n}})</math>.
:* (a) By hand using the normal formula for standard error <math>(\frac{\sigma}{\sqrt{n}})</math>.
:* (b) Using a built-in R function. These number should be the same or very close. Can you explain why they might not be exactly the same?
:* (b) Using a built-in R function. These number should be the same or very close. Can you explain why they might not be exactly the same?
:* (c) Compare the mean for sample, and your confidence interval, to the true population mean. Is the true mean in or out?
:* (c) Compare the mean for sample, and your confidence interval, to the true population mean. Is the true mean inside your confidence interval?
: '''PC2.''' Compare the distribution from your sample of <code>x</code> to the true population. Draw a histogram and compute other descriptive and summary statistics. What do you notice? Be ready to talk for a minute or two about the differences. Why is your mean different?
: '''PC2.''' Compare the distribution from your sample of <code>x</code> to the true population. Draw a histogram and compute other descriptive and summary statistics. What do you notice? Be ready to talk for a minute or two about the differences. Why is your mean different?
: '''PC3.''' Compute the mean of <code>y</code> from the true population and then create the mean and confidence interval from the <code>y</code> in your sample. Is it in or out?
: '''PC3.''' Compute the mean of <code>y</code> from the true population and then create the mean and confidence interval from the <code>y</code> in your sample. Is it in or out?
: '''PC4.''' I want you to run a simple simulation that demonstrates one of the most fundamental insights of statistics.
:* (a) Create a vector of 10,000 randomly generated numbers that are uniformly distributed between 0 and 9.
:* (b) Take the mean the standard deviation of that vector. Draw a histogram? Does it looks uniformly distributed?
:* (c) Create 100 random samples of 2 items each from your randomly generated data and take the mean of each one. Create a new vector that contains those means. Show the distribution of those means (hint, use a histogram).
:* (d) Do (c) except make the items 10 items in each sample instead of 2. Then do (c) again except with 100 items. Be ready to describe how the histogram changes. (''HINT: You'll make me very happy if you write a function to do this.'')
: '''PC5.''' Do PC4 again but with random data drawn from a normal distribution (<math>N(\mu=42, \sigma=42)</math>) instead of a uniform distribution. How are you results different than in PC4?


== Questions on Gelman and Stern Paper ==
== Questions on Gelman and Stern Paper ==

Revision as of 23:01, 25 January 2017

Programming Challenges

PC0. I've provided the full dataset from which I drew each of your samples in a TSV file in the directory week_05 in class assignment git repository. These are tab delimited, not comma delimited. This is also a common format. Go ahead and load it into R (HINT: read.delim() is your friend.). Take the mean of the variable x in that dataset. That is the true population mean — the thing we were creating estimates of in week 2 and week 3.
PC1. Go back to the dataset I distributed for the week 3 problem set. You've already computed the mean for this in week 2. You should compute the 95% confidence interval for the variable x in two ways:
  • (a) By hand using the normal formula for standard error .
  • (b) Using a built-in R function. These number should be the same or very close. Can you explain why they might not be exactly the same?
  • (c) Compare the mean for sample, and your confidence interval, to the true population mean. Is the true mean inside your confidence interval?
PC2. Compare the distribution from your sample of x to the true population. Draw a histogram and compute other descriptive and summary statistics. What do you notice? Be ready to talk for a minute or two about the differences. Why is your mean different?
PC3. Compute the mean of y from the true population and then create the mean and confidence interval from the y in your sample. Is it in or out?
PC4. I want you to run a simple simulation that demonstrates one of the most fundamental insights of statistics.
  • (a) Create a vector of 10,000 randomly generated numbers that are uniformly distributed between 0 and 9.
  • (b) Take the mean the standard deviation of that vector. Draw a histogram? Does it looks uniformly distributed?
  • (c) Create 100 random samples of 2 items each from your randomly generated data and take the mean of each one. Create a new vector that contains those means. Show the distribution of those means (hint, use a histogram).
  • (d) Do (c) except make the items 10 items in each sample instead of 2. Then do (c) again except with 100 items. Be ready to describe how the histogram changes. (HINT: You'll make me very happy if you write a function to do this.)
PC5. Do PC4 again but with random data drawn from a normal distribution () instead of a uniform distribution. How are you results different than in PC4?

Questions on Gelman and Stern Paper

Q5: First, walk us through the result visualized in Figure 1. Explain and interpret the result for us. Now go back to the blockquote on page 329 and, by referencing the figure, explain why Gelman and Stern think that this is a good example to illustrate their point about the difference between statistically significant versus non-significant.
Q6: Move on to the study about EMF. Walk us through Figure 2. First explain the basic result and then explain why Gelman and Stern thinks that Figure 2b is better than 2a.
Q7: In the paper's abstract Gelman and Stern describe their approach as different from three other problems: that statistical significance, that dichotomization of significant/not-significant encourages dismissing observed differences, and that thresholds are arbitrary. Summarize why these are important issues in your own words (and ideally, with examples) and explain how Gelman's key critique is different.