Not logged in
Talk
Contributions
Create account
Log in
Navigation
Main page
About
People
Publications
Teaching
Resources
Research Blog
Wiki Functions
Recent changes
Help
Licensing
Page
Discussion
Edit
View history
Editing
Data Into Insights (Spring 2021)/R4DS Chapter 12
From CommunityData
Jump to:
navigation
,
search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
One of the most time-consuming and trickiest parts of data analysis is actually getting the data imported, "cleaned", and organized in such a way that you can actually perform analyses and create visualizations. So far in the class we have been using datasets that are well-organized and well-annotated. In the real world, a very important first step in any project is working to understand where the data you are analyzing comes from, what the measures mean, why data is missing, etc. In addition, it often requires reorganizing data. There are different paradigms for how data is organized, but I highly recommend using "tidy" data, especially if you plan to do work in R. [https://r4ds.had.co.nz/tidy-data.html Chapter 12 in R4DS] explains this paradigm, but the big idea is that each row is an observation, and each column is a variable. The focus of this chapter is how to manipulate data that isn't organized in this way so that it is "tidy". NOTE: I have made some minor changes to some of the questions, so please use the versions here. The goal is to help you to not be tempted to copy and paste solutions without thinking through the code. == 12.2 == 2. Compute the rate for table2, and table4a + table4b. You will need to perform four operations: Extract the number of TB cases per country per year. Extract the matching population per country per year. Divide cases by population, and multiply by 10000. Store back in the appropriate place. Which representation is easiest to work with? Which is hardest? Why? NOTE: Only work on this for ~20 minutes. The big idea is to understand why different representations are easier or harder to work with in R. 3. Recreate the plot showing change in cases over time using table2 instead of table1. What do you need to do first? == 12.3 == 1. Why are pivot_longer() and pivot_wider() not perfectly symmetrical? Carefully consider the following example: <syntaxhighlight lang="R"> stocks <- tibble( year = c(2015, 2015, 2016, 2016), half = c( 1, 2, 1, 2), return = c(1.88, 0.59, 0.92, 0.17) ) stocks %>% pivot_wider(names_from = year, values_from = return) %>% pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return") </syntaxhighlight> (Hint: look at the variable types and think about column names.) 2. Why does this code fail? <syntaxhighlight lang="R"> table4a %>% pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases") #> Error: Can't subset columns that don't exist. #> β Locations 1999 and 2000 don't exist. #> βΉ There are only 3 columns. </syntaxhighlight> 3. What would happen if you widen this table? Why? How could you add a new column to uniquely identify each value? <syntaxhighlight lang="R"> people <- tribble( ~name, ~variable, ~value, #-----------------|--------|------ "Phillip Woods", "age", 45, "Phillip Woods", "height", 186, "Phillip Woods", "age", 50, "Jessica Cordero", "age", 37, "Jessica Cordero", "height", 156 ) </syntaxhighlight> 4. Tidy the simple tibble below. Do you need to make it wider or longer? What are the variables? <syntaxhighlight lang="R"> preg <- tribble( ~pregnant, ~m, ~f, "yes", NA, 10, "no", 20, 12 ) </syntaxhighlight> == 12.4 == 1. What do the extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets. <syntaxhighlight lang="R"> tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% separate(x, c("one", "two", "three")) tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% separate(x, c("one", "two", "three")) </syntaxhighlight> 2. Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE? == 12.6 == 1. In this case study I set values_drop_na = TRUE just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? Whatβs the difference between an NA and zero? 2. What happens if you neglect the mutate() step? (mutate(names_from = stringr::str_replace(key, "newrel", "new_rel"))) 3. I claimed that iso2 and iso3 were redundant with country. Confirm this claim. 4. For each country, year, and sex compute the total number of cases of TB. Make a visualization of how cases have changed in India, China, South Africa, and Indonesia over time and by gender. Hint: filter(country %in% c('India','China','South Africa','Indonesia'))
Summary:
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see
CommunityData:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:
Cancel
Editing help
(opens in new window)
Tools
What links here
Related changes
Special pages
Page information