Why Tidy?

For the past few days, we’ve been grapping with a data set that we’ve almost wrestled down to size. The only problem is, what we’re left with is a big table with one score per row, and a lot of NAs. This doesn’t seem efficient, nor does it seem like a good format for plotting. Clearly, we need to TIDY THE DATA.

Long versus Wide

As we talked about a few weeks ago, there are both long and wide data formats. Long data formats have one observation and one measurement per row. So, multiple rows constitude a single observation.

long from R4DS

These kinds of data sets are great for plotting summary information for each group and each variable aggregated together. Think of a bar plot with averages for different species taken across different sites.

Wide data has a every measurement in a single observation in a single row.

wide from R4DS

This kind of data is ideal for things like scatterplots of one measurement against another, with each observation as a single data point.

Long to Wide

Quite often, data is recorded in a long format for efficiency. Let’s talk about this with respect to a data set of mammals. For every site a researcher visited, they wrote down the species of mammals they say, and the density of those mammals. As this was just a running tally, the data ended up in a long format:

## The mammals data frame:
##   site                taxon density
## 1    1      Suncus etruscus     6.2
## 2    1       Sorex cinereus     5.2
## 3    2     Myotis nigricans    11.0
## 4    3 Notiosorex crawfordi     1.2
## 5    3     Scuncus etruscus     9.4
## 6    3     Myotis nigricans     9.6

But, what if we wanted to plot the abundance of, say, Myostis nigricans against Scuncus etruscus - we can’t do that from here. We need to spread the data into a wide format. For that, there is the tidyr library and it’s spread function.

spread works simply in that it take a data set, and you tell it the name of the column which contains values that will be the column names in your new data set - a so called “key” column. You also tell it which column contains the relevant numbers - the “values” column.

library(tidyr)

m_wide <- mammals %>%
  spread(taxon, density)

m_wide
##   site Myotis nigricans Notiosorex crawfordi Scuncus etruscus
## 1    1               NA                   NA               NA
## 2    2             11.0                   NA               NA
## 3    3              9.6                  1.2              9.4
##   Sorex cinereus Suncus etruscus
## 1            5.2             6.2
## 2             NA              NA
## 3             NA              NA

Great! We have a wide data set. But - uh oh, what are all of those NAs doing there?

Filling in null values

We often have data sets like this. Ones where we have many measurements that we have not observed. Sometimes, they are NA - we just didn’t take them. Other times, such as in this data set, we want to fill them in with a sensible value, such as 0.

The simplest approach here would be to use the fill argument in spread.

m_wide_0 <- mammals %>%
  spread(taxon, density, fill=0)

m_wide_0
##   site Myotis nigricans Notiosorex crawfordi Scuncus etruscus
## 1    1              0.0                  0.0              0.0
## 2    2             11.0                  0.0              0.0
## 3    3              9.6                  1.2              9.4
##   Sorex cinereus Suncus etruscus
## 1            5.2             6.2
## 2            0.0             0.0
## 3            0.0             0.0

Perfect! Other times, though, we want to fill those in right into the long data format. For that we have the complete function. In that function, we specify which columns we want all combinations of, and then supply a list of how new values should be filled in for other columns. If we don’t give a column name in that list, it defaults to NA. Oh, complete function… you complete me.

m_long_0 <- mammals %>%
  complete(site, taxon, fill=list(density=0))

m_long_0
## Source: local data frame [15 x 3]
## 
##     site                taxon density
##    (dbl)               (fctr)   (dbl)
## 1      1     Myotis nigricans     0.0
## 2      1 Notiosorex crawfordi     0.0
## 3      1     Scuncus etruscus     0.0
## 4      1       Sorex cinereus     5.2
## 5      1      Suncus etruscus     6.2
## 6      2     Myotis nigricans    11.0
## 7      2 Notiosorex crawfordi     0.0
## 8      2     Scuncus etruscus     0.0
## 9      2       Sorex cinereus     0.0
## 10     2      Suncus etruscus     0.0
## 11     3     Myotis nigricans     9.6
## 12     3 Notiosorex crawfordi     1.2
## 13     3     Scuncus etruscus     9.4
## 14     3       Sorex cinereus     0.0
## 15     3      Suncus etruscus     0.0

We could then take that new complete long data frame and spread it with abandon! Either way, now we can plot!

library(ggplot2)

ggplot(m_wide_0, mapping = aes(x=`Myotis nigricans`,
                               y=`Scuncus etruscus`)) +
  geom_point(size=1.4) +
  stat_smooth() +
  theme_bw()

OK, so, that’s not such an impressive plot, but you see what we’re getting at here…

Wide to Long

But what if your intrepid mammal researcher had been wise and recorded their data in a wide format (in truth, this is what I do all the time). And instead of a scatterplot, we wanted to compare the densities of all of the different species in a point-range plot? Well, then we’d have to go from a wide format to long.

To do this, tidyr offers the gather function, as in gather up your wide data into one nice long piece. Gather is a little trickier, as you specify what you want the name of the new key column to be, what you want the name of the new values column to be, and then…well, you have some options. You can either specify which columns are to be gathered up (which can take a while) or you can specify which columns you want to exclude. You can do fancy things, like specify all columns between two others with a :, but, honestly, I often specify which to exclude. Let’s see two examples.

m_long <- m_wide_0 %>%
  gather(Species_name, Density, -site)

m_long
##    site         Species_name Density
## 1     1     Myotis nigricans     0.0
## 2     2     Myotis nigricans    11.0
## 3     3     Myotis nigricans     9.6
## 4     1 Notiosorex crawfordi     0.0
## 5     2 Notiosorex crawfordi     0.0
## 6     3 Notiosorex crawfordi     1.2
## 7     1     Scuncus etruscus     0.0
## 8     2     Scuncus etruscus     0.0
## 9     3     Scuncus etruscus     9.4
## 10    1       Sorex cinereus     5.2
## 11    2       Sorex cinereus     0.0
## 12    3       Sorex cinereus     0.0
## 13    1      Suncus etruscus     6.2
## 14    2      Suncus etruscus     0.0
## 15    3      Suncus etruscus     0.0
#Another way of doing the same thing
m_long <- m_wide_0 %>%
  gather(Species_name, Density, `Myotis nigricans`:`Suncus etruscus`)

m_long
##    site         Species_name Density
## 1     1     Myotis nigricans     0.0
## 2     2     Myotis nigricans    11.0
## 3     3     Myotis nigricans     9.6
## 4     1 Notiosorex crawfordi     0.0
## 5     2 Notiosorex crawfordi     0.0
## 6     3 Notiosorex crawfordi     1.2
## 7     1     Scuncus etruscus     0.0
## 8     2     Scuncus etruscus     0.0
## 9     3     Scuncus etruscus     9.4
## 10    1       Sorex cinereus     5.2
## 11    2       Sorex cinereus     0.0
## 12    3       Sorex cinereus     0.0
## 13    1      Suncus etruscus     6.2
## 14    2      Suncus etruscus     0.0
## 15    3      Suncus etruscus     0.0

And now that we have a long format, let’s plot~

ggplot(m_long, mapping=aes(x=Species_name, y=Density)) +
  geom_point(size=2) +
  stat_summary(fun.data="mean_sdl", color="red")
## Warning: replacing previous import by 'ggplot2::unit' when loading 'Hmisc'
## Warning: replacing previous import by 'ggplot2::arrow' when loading 'Hmisc'
## Warning: replacing previous import by 'scales::alpha' when loading 'Hmisc'

OK - how would you apply these functions to the Axoltl data?

Hey, wait, these only work on one “Value” column…

You may have noticed that we’ve been throwing around just one value column. What if you have two - say, density and average height? Or three? Or more? In other words, a not-long-not-wide-but-in-between data set. These are common.

Well, tidyr isn’t flexible on that count. Instead, you need to unite those columns into something that’s easy to separate later down the line. Which we do with… regular expressions! Yes, they are everywhere. Let’s take an example of height in the mammals data set.

mamh <- mammals %>%
  mutate(height=rnorm(6,30,3))

So now we want to do everything above, but with density and height! So, we unite them, using a _ as our separator. We could have used anything, but I like _ because it’s used for so few other things with numbers. tidyr just needs to know the name of the new column, and then which columns will be united

mamh2 <- mamh %>%
  unite(measurement, density, height, sep="_")

mamh2
##   site                taxon          measurement
## 1    1      Suncus etruscus 6.2_35.5249261415182
## 2    1       Sorex cinereus  5.2_33.672784315605
## 3    2     Myotis nigricans  11_24.3065594737995
## 4    3 Notiosorex crawfordi 1.2_30.9904042233343
## 5    3     Scuncus etruscus   9.4_33.72128522902
## 6    3     Myotis nigricans 9.6_25.4838445938618

We can now proceed as normal. Let’s say, for example, that you wanted to fill in all of the empty species-site combos with zeroes. Here’s one workflow to do that.

mamh_long <- mamh2 %>%
  spread(taxon, measurement, fill="0_0") %>%
  gather(taxon, measurement, -site)

mamh_long
##    site                taxon          measurement
## 1     1     Myotis nigricans                  0_0
## 2     2     Myotis nigricans  11_24.3065594737995
## 3     3     Myotis nigricans 9.6_25.4838445938618
## 4     1 Notiosorex crawfordi                  0_0
## 5     2 Notiosorex crawfordi                  0_0
## 6     3 Notiosorex crawfordi 1.2_30.9904042233343
## 7     1     Scuncus etruscus                  0_0
## 8     2     Scuncus etruscus                  0_0
## 9     3     Scuncus etruscus   9.4_33.72128522902
## 10    1       Sorex cinereus  5.2_33.672784315605
## 11    2       Sorex cinereus                  0_0
## 12    3       Sorex cinereus                  0_0
## 13    1      Suncus etruscus 6.2_35.5249261415182
## 14    2      Suncus etruscus                  0_0
## 15    3      Suncus etruscus                  0_0

OK, great! We have all of the “0_0” entries (notice what I did there for the fill?), but, now we want to restore our old measurements. For that, we have separate which takes the relevant column, the new column names in a vector, and the pattern you match to split them.

mamh_long <- mamh_long %>%
  separate(measurement, into = c("density", "height"), sep="_")

mamh_long
##    site                taxon density           height
## 1     1     Myotis nigricans       0                0
## 2     2     Myotis nigricans      11 24.3065594737995
## 3     3     Myotis nigricans     9.6 25.4838445938618
## 4     1 Notiosorex crawfordi       0                0
## 5     2 Notiosorex crawfordi       0                0
## 6     3 Notiosorex crawfordi     1.2 30.9904042233343
## 7     1     Scuncus etruscus       0                0
## 8     2     Scuncus etruscus       0                0
## 9     3     Scuncus etruscus     9.4   33.72128522902
## 10    1       Sorex cinereus     5.2  33.672784315605
## 11    2       Sorex cinereus       0                0
## 12    3       Sorex cinereus       0                0
## 13    1      Suncus etruscus     6.2 35.5249261415182
## 14    2      Suncus etruscus       0                0
## 15    3      Suncus etruscus       0                0

One note - these new columns will be treated as characters (you did split a string after all), so you’ll need to do some cleanup

mamh_long <- mamh_long %>%
  mutate(density = as.numeric(density),
         height = as.numeric(height))

And there you go! You can now do exciting things like plot the density-height relationship, grouping by taxon, for example.

qplot(density, height, data=mamh_long, 
      color=taxon, geom="point") +
  stat_smooth(method="lm", fill=NA)

OK, to make sure you got separate and unite down, first, separate the taxon into two columns - Genus and species (taxon is currently a latin binomial).

Then, add four columsn to the original mammals frame. Now use any technique above (which may or may not involve unite) to fill in 0s for all four columns.