Editing Data Into Insights (Spring 2021)/R4DS Chapter 12

One of the most time-consuming and trickiest parts of data analysis is actually getting the data imported, "cleaned", and organized in such a way that you can actually perform analyses and create visualizations.

So far in the class we have been using datasets that are well-organized and well-annotated. In the real world, a very important first step in any project is working to understand where the data you are analyzing comes from, what the measures mean, why data is missing, etc. In addition, it often requires reorganizing data.

There are different paradigms for how data is organized, but I highly recommend using "tidy" data, especially if you plan to do work in R. [https://r4ds.had.co.nz/tidy-data.html Chapter 12 in R4DS] explains this paradigm, but the big idea is that each row is an observation, and each column is a variable. The focus of this chapter is how to manipulate data that isn't organized in this way so that it is "tidy".

NOTE: I have made some minor changes to some of the questions, so please use the versions here. The goal is to help you to not be tempted to copy and paste solutions without thinking through the code.


== 12.2 ==

2. Compute the rate for table2, and table4a + table4b. You will need to perform four operations:

    Extract the number of TB cases per country per year.
    Extract the matching population per country per year.
    Divide cases by population, and multiply by 10000.
    Store back in the appropriate place.

Which representation is easiest to work with? Which is hardest? Why?

NOTE: Only work on this for ~20 minutes. The big idea is to understand why different representations are easier or harder to work with in R.

3. Recreate the plot showing change in cases over time using table2 instead of table1. What do you need to do first?


== 12.3 ==

1. Why are pivot_longer() and pivot_wider() not perfectly symmetrical?
Carefully consider the following example:

<syntaxhighlight lang="R">
stocks <- tibble(
  year   = c(2015, 2015, 2016, 2016),
  half  = c(   1,    2,     1,    2),
  return = c(1.88, 0.59, 0.92, 0.17)
)
stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")
</syntaxhighlight>

(Hint: look at the variable types and think about column names.)


2. Why does this code fail?

<syntaxhighlight lang="R">
table4a %>% 
  pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")
#> Error: Can't subset columns that don't exist.
#> ✖ Locations 1999 and 2000 don't exist.
#> ℹ There are only 3 columns.
</syntaxhighlight>

3. What would happen if you widen this table? Why? How could you add a new column to uniquely identify each value?

<syntaxhighlight lang="R">
people <- tribble(
  ~name,             ~variable,  ~value,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
)
</syntaxhighlight>

4. Tidy the simple tibble below. Do you need to make it wider or longer? What are the variables?

<syntaxhighlight lang="R">
preg <- tribble(
  ~pregnant, ~m, ~f,
  "yes",     NA,    10,
  "no",      20,    12
)
</syntaxhighlight>

== 12.4 ==

1. What do the extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets.

<syntaxhighlight lang="R">
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"))

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"))
</syntaxhighlight>

2. Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?


== 12.6 ==


1. In this case study I set values_drop_na = TRUE just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA and zero?

2. What happens if you neglect the mutate() step? (mutate(names_from = stringr::str_replace(key, "newrel", "new_rel")))

3. I claimed that iso2 and iso3 were redundant with country. Confirm this claim.

4. For each country, year, and sex compute the total number of cases of TB. Make a visualization of how cases have changed in India, China, South Africa, and Indonesia over time and by gender.

Hint: filter(country %in% c('India','China','South Africa','Indonesia'))