In the Tidy Data paper, we saw kind of “domain language” for data preparation. Domain languages like this help us express concepts which are otherwise clumsy and hard to express. However, the terminology in the paper has changed over time as it has been implemented in the Tidyverse libraries, so it is important to understand the underlying concepts behind the terminology.
We will explore the Tidyverse by loading each library separately, namely readr
, magrittr
dplyr
, tidyr
, and ggplot2
. In practice, you may simply load these and other Tidyverse libraries with library(tidyverse)
.
If you haven’t already, you should install the Tidyverse
# note, eval=FALSE means that this is not evaluated in the normal knitting process
# You should run this in the console
install.packages("tidyverse")
So far we have been using load()
to load RData files. We’ve seen many examples where this contains a data table/frame and in the kobe
dataset, we saw that it can also contain functions like calc_streak()
. We also saw some examples of read.csv()
(base R) and read_csv()
. Now that we are learning about the Tidyverse, let’s use the former. (Note: remember that the dot ‘.’ naming convention correlates strongly with base-R code and underscores ’_’ correlates strongly with tidy-approach code).
In order to use read_csv
, we first need to load the readr
library.
library(readr)
We need to set where to find the data file (you may omit the setwd
line if the data file is in the same directory as this RMarkdown file).
setwd("~/Downloads")
sir_tibble <- read_csv("daily_infections.csv")
## Parsed with column specification:
## cols(
## susceptible = col_double(),
## infected = col_double(),
## recovered = col_double(),
## date = col_date(format = "")
## )
The output of the last chunk that is echoed is a specification of the data’s columns, susceptible
, infected
, recovered
, and date
. These represent the number of susceptible, infected, and recovered people in a town of 100 people.
The data is stored in the form of a “tibble”, which is short for “tidy table”. When you type the name of the tibble, it will print a shortened version of the table, like we did with head()
and a normal, base-R table
sir_tibble
If you want to convert back to a base-R table, you can run the following. Note that this chunk has the option eval=FALSE
so it won’t show all the long output (50 rows)
as.data.frame(sir_tibble)
Also, remember earlier in lecture we showed the object oriented class that the tibble and data frames are derived from. The output for the tibble shows that it is an example of multiple inheritance. If you don’t know what that is, it’s not super important for our purposes, but you can google it if you are curious.
class(sir_tibble)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
class(as.data.frame(sir_tibble))
## [1] "data.frame"
The pipe operator %>%
is one of the innovations of the Tidyverse. It captures the idea that tidy tools should take tidy data as input and return tidy data as output. Thus everything flowing through such a pipe is tidy. The magrittr
library is the library that provides the pipe operator (named after the Belgian painter Rene Magritte, who is famous for his painting of a pipe with the caption, “this is not a pipe”).
knitr::include_graphics("MagritteTreacheryOfImages.jpg")
library(magrittr)
The idea is that you can “pipe” the date from one command to another. Recall from the paper that a “tidy” tool/function takes a data table/frame as an input and produces another table/frame as an output. Thus, as long as each tool/function is “tidy” we can connect them together using the magrittr
pipes. This is really just the same as calling functions with paretheses in base-R, but using pipes makes the flow of data clearer and solves the problem of having deeply nested parentheses that are hard to count. For example, in the following command, the sir_tibble
data frame is piped to the tidy function summarize
, which returns a data frame with the summary columns (in this example, there is only one row in the data frame returned by summarize
). Summarize is in the dplyr
library.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sir_tibble %>% summarize(n=n(),
start=min(date),
end=max(date),
population=max(susceptible,
infected,
recovered))
Note that I’ve added new-lines and indentation to keep the code clean. The indentation should be mostly automatic when you use RStudio and hit return after a comma.
The base-R way of calling this would simply give sir_table as an argument in the parentheses of summarize
. This is still pretty readable, but you can imagine with more nesting, the parentheses will get annoying and hard to read.
summarize(sir_tibble,
n=n(),
start=min(date),
end=max(date),
population=max(susceptible, infected, recovered))
To Tidy the data, first we’ll try the base-R way. We’ve seen this already a few times in the hands-on activities.
# first duplicate the dates 3x
dates_combined <- c(sir_tibble$date,
sir_tibble$date,
sir_tibble$date)
# then "melt" the 3 people count columns
counts_combined <- c(sir_tibble$susceptible,
sir_tibble$infected,
sir_tibble$recovered)
# next create a disease state categorical variable
n <- dim(sir_tibble)[1]
disease_state <- c(rep("susceptible", n),
rep("infected", n),
rep("recovered", n))
sir_tible_base_r_tidy <- tibble(date=dates_combined,
disease_state=disease_state,
count=counts_combined)
Now that the date is tidy, we can use functions like group_by
, also in the dplyr
library.
sir_tible_base_r_tidy %>%
group_by(disease_state) %>%
summarize(n=n(), start=min(date), end=max(date))
Keeping with the approach in class, we’ve started with Base-R and moved on to Tidy R. The tidyr
library has tools to do this, namely pivot_longer
(older versions of the library called it melt
but as far as I can tell they have been moving from “cute” names to more “businessy” names). pivot_wider
would undo this transformation.
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
sir_tibble %>% pivot_longer(cols=c("susceptible", "infected","recovered"))
Just specifying the columns results in new columns labelled “name” and “value”. To give custom names:
sir_tibble %>%
pivot_longer(cols=c("susceptible", "infected","recovered"),
names_to = "disease_state", values_to = "count")
The first library we saw in the Tidyverse was ggplot2
. We’ve seen a couple of examles where it helps to have data in a tidy format to make it easier to plot. Just to make sure we remember, let’s try to plot the messy table before it was “melted”:
library(ggplot2)
ggplot(sir_tibble, aes(x=date)) +
geom_line(aes(y=susceptible), color="green") +
geom_line(aes(y=infected), color="red") +
geom_line(aes(y=recovered), color="blue")
Remember that the theory behind ggplot involves mapping data variables to visual variables using the aes
aesthetic mapping. Note that the color setting is outside of the aes
aesthetic mapping and the x
visual variable is set in the ggplot
command so that it will be a default across the following geom_line
commands.
The benefit of the tidy approach is that we will be able to use color as a visual variable for the disease_state
data variable, so the ggplot command will only require one geom_line
command. I’m going to use the magritter
pipes to show how commands may be chained together.
sir_tibble %>%
pivot_longer(cols=c("susceptible", "infected","recovered"),
names_to = "disease_state", values_to = "count") %>%
ggplot(aes(x=date)) + geom_line(aes(y=count,
color=disease_state))
Notice now the colors are assigned automatically and there’s a legend included automatically as well. To get exactly the same colors we had before we can do the following:
sir_tibble %>%
pivot_longer(cols=c("susceptible", "infected","recovered"),
names_to = "disease_state", values_to = "count") %>%
ggplot(aes(x=date)) + geom_line(aes(y=count,
color=disease_state)) +
scale_color_manual(values=c("red", "blue", "green"))
We can use the tidyr
function pivot_wider
to convert to our messy table with counts in three columns:
sir_tibble %>%
pivot_longer(cols=c("susceptible", "infected","recovered"),
names_to = "disease_state",
values_to = "count") %>%
pivot_wider(names_from="disease_state",
values_from="count")