Today, this is your subject:
That’s right, it’s a data set on the effects of different diets on marmosets. You have been handed an excel file with a number of sheets of the same experiment done in different years, and you are expected to answer the following question:
How do different diets affect different species of Marmosets?
Let’s begin by looking at the data. What are the features of the data that are, shall we say, less than tidy? What are the different things we will have to do to clean individual sheets?
Now, let’s walk through the different steps of just loading up the first worksheet. Let’s begin by noticing that the first few lines do not contain the actual column names. Let’s look at what happens if we just load the first sheet
library(readxl)
marmoset_1987 <- read_excel("./monkey_weights.xlsx")
head(marmoset_1987)
## Source: local data frame [6 x 8]
##
## Response of Marmosets to Different Diets
## (chr) (chr)
## 1 1987 NA
## 2 Rainfall: 130cm NA
## 3 NA Gebus
## 4 NA Mico
## 5 NA NA
## 6 NA NA
## Variables not shown: structure(c("NA ", "NA ", "species ", "melanura", "NA
## ", (chr), structure(c("NA ", "Start Body Condition: Good", (chr),
## structure(c("NA ", "NA ", (chr), structure(c("NA ", "NA ", (chr),
## structure(c("NA ", "NA ", (chr), structure(c("NA ", "NA ", (chr)
Yuck. We can see the column headers are bad. But, there’s some nice meta-data in there as well. Line 2 has a year. Line 3 has some rainfall information in cm. And if we look at line 2 in general…
as.data.frame(marmoset_1987[2,])
## Response of Marmosets to Different Diets
## 1 Rainfall: 130cm <NA>
## structure(NA_character_, class = "AsIs")
## 1 <NA>
## structure("Start Body Condition: Good", class = "AsIs")
## 1 Start Body Condition: Good
## structure(NA_character_, class = "AsIs")
## 1 <NA>
## structure(NA_character_, class = "AsIs")
## 1 <NA>
## structure(NA_character_, class = "AsIs")
## 1 <NA>
## structure(NA_character_, class = "AsIs")
## 1 <NA>
In column 4, there’s a start body condition. So, loading this isn’t all bad. Let’s begin by saving those three values: Year, Rainfall (which is always in cm), and Body condition into variables. You’ll need to use regular expressions with gsub
to strip out the actual numbers in rainfall and just the actual status of the body condition. You’ll have to use row,column notation for this as well to get the values.
Now that we have the metadata saved in variables, we can reload the data, but now skipping all of those bad lines. Load it up and show me that you’ve loaded it properly.
Notice that last line keeps loading, despite being NA (go look). Huh. Get rid of it or any line for which the Diet is NA
Oops! One of your data entry people screwed and misspelled “Genus”. Fix it.
Oops! One of your data entry people screwed and misspelled didn’t fill the species all the way down. Fix it!
Uh oh - each subject doesn’t have a unique ID number! Give them one. It should be a unique ID that contains the year, as there will be other years. You might need to do some paste
-ing here. And it might be easier to use ye olde $
notation rather than mutate (although do it any way you want.) For example, I created IDs like 46_1987
To plot a time series, you’re going to want to have a long, not wide, format of weight gain. You also need to track individual subjects. So, first, gather up the data of weight gain for individual weeks. Perhaps call the key column Week
and the values Gain
. You also have a lot of columns to exclude, so the :
operator might come in handy, although it’s not necessary.
So, the gain column has measurement units in it. Yuck. Use separate
to split it into two columns - the actual Gain column and a units column. You’ll need to mutate your Gain column to make it as.numeric
.
Once that’s done, we need to recognize that 1 mg = 0.001g and 1ug = 0.0001g. Quite different. We need to convert the Gain column all into the same units - miligrams! There are a lot of ways to do this (some we’ll cover next week). For now, let’s make a function that takes a vector of unit types (“ug” or “mg”). For every “ug” it swaps it with 0.1. For every “mg” it swaps it with 1. Then it turns the output vector into a numeric vector and returns it. What’s great is that unknown measurements will be returned as NAs
Once you’ve done that, you can use mutate with your function to divide gain by whatever output value comes from your function to get a standardized weight change in mg. After you do that, use select
to get rid of the unsightly Units column.
Hrm. All of that Weight Gain After \\d week
stuff in the Week column has got to go!
OK, the last bit - remember when we saved year, body status, and rainfall? Add those in as columns to this new data set! (You’ll likely need to use $
)
OK, now that you’ve made this happen, what can we do here? Why not visualize the time-series of weight gain for individuals, highlighting species (or even taxon where you paste genus and species) and Diet. Can you draw any conclusions?
OK, now that you have your tidyr
mojo going, make a generalized function that can read in any sheet. Test it on each of the four - do you need to modify anything from your specific use-case above? You can do this in one big function, or perhaps try and split it into smaller sub-functions. Remember - when it doubt, THINK - pseudo-code out the workflow, make sub-functions, etc.
And, hey, maybe there are some other units in here (remember, centi- is .01g, or, 10mg, so you might have to futz with your unit function)
With a for loop to generate a master data set using the function(s) you have just written.
With your mega-data set, make a similar plot to that above (species and diet) but add some color for one of the other columsn - Year, Rainfall, and/or Body Condition. See anything? You are the analyst. Using tools at your disposal like summarize, etc., how would you create a visualizatio that tells us something about how diet affected monkies? There are a lot of answers here, so have fun!