Monkeying Around with Functions and Tidyr

2) Just one sheet

Now, let’s walk through the different steps of just loading up the first worksheet. Let’s begin by noticing that the first few lines do not contain the actual column names. Let’s look at what happens if we just load the first sheet

library(readxl)
marmoset_1987 <- read_excel("./monkey_weights.xlsx")

head(marmoset_1987)

## Source: local data frame [6 x 8]
## 
##   Response of Marmosets to Different Diets      
##                                      (chr) (chr)
## 1                                     1987    NA
## 2                          Rainfall: 130cm    NA
## 3                                       NA Gebus
## 4                                       NA  Mico
## 5                                       NA    NA
## 6                                       NA    NA
## Variables not shown: structure(c("NA ", "NA ", "species ", "melanura", "NA
##   ", (chr), structure(c("NA ", "Start Body Condition: Good", (chr),
##   structure(c("NA ", "NA ", (chr), structure(c("NA ", "NA ", (chr),
##   structure(c("NA ", "NA ", (chr), structure(c("NA ", "NA ", (chr)

2a) Meta-data

Yuck. We can see the column headers are bad. But, there’s some nice meta-data in there as well. Line 2 has a year. Line 3 has some rainfall information in cm. And if we look at line 2 in general…

as.data.frame(marmoset_1987[2,])

##   Response of Marmosets to Different Diets     
## 1                          Rainfall: 130cm <NA>
##   structure(NA_character_, class = "AsIs")
## 1                                     <NA>
##   structure("Start Body Condition: Good", class = "AsIs")
## 1                              Start Body Condition: Good
##   structure(NA_character_, class = "AsIs")
## 1                                     <NA>
##   structure(NA_character_, class = "AsIs")
## 1                                     <NA>
##   structure(NA_character_, class = "AsIs")
## 1                                     <NA>
##   structure(NA_character_, class = "AsIs")
## 1                                     <NA>

In column 4, there’s a start body condition. So, loading this isn’t all bad. Let’s begin by saving those three values: Year, Rainfall (which is always in cm), and Body condition into variables. You’ll need to use regular expressions with gsub to strip out the actual numbers in rainfall and just the actual status of the body condition. You’ll have to use row,column notation for this as well to get the values.

2b) Loading in a sensible data frame

Now that we have the metadata saved in variables, we can reload the data, but now skipping all of those bad lines. Load it up and show me that you’ve loaded it properly.

Notice that last line keeps loading, despite being NA (go look). Huh. Get rid of it or any line for which the Diet is NA

2c) Problem 1: Someone doesn’t know how to spell genus

Oops! One of your data entry people screwed and misspelled “Genus”. Fix it.

2d) Problem 2: Genus and species stop short

Oops! One of your data entry people screwed and misspelled didn’t fill the species all the way down. Fix it!

2e) Problem 3: Unique subject IDs

Uh oh - each subject doesn’t have a unique ID number! Give them one. It should be a unique ID that contains the year, as there will be other years. You might need to do some paste-ing here. And it might be easier to use ye olde $ notation rather than mutate (although do it any way you want.) For example, I created IDs like 46_1987

2f) Problem 4: Shape of the data

To plot a time series, you’re going to want to have a long, not wide, format of weight gain. You also need to track individual subjects. So, first, gather up the data of weight gain for individual weeks. Perhaps call the key column Week and the values Gain. You also have a lot of columns to exclude, so the : operator might come in handy, although it’s not necessary.

2g) Problem 5: mg and ug? What?

So, the gain column has measurement units in it. Yuck. Use separate to split it into two columns - the actual Gain column and a units column. You’ll need to mutate your Gain column to make it as.numeric.

Once that’s done, we need to recognize that 1 mg = 0.001g and 1ug = 0.0001g. Quite different. We need to convert the Gain column all into the same units - miligrams! There are a lot of ways to do this (some we’ll cover next week). For now, let’s make a function that takes a vector of unit types (“ug” or “mg”). For every “ug” it swaps it with 0.1. For every “mg” it swaps it with 1. Then it turns the output vector into a numeric vector and returns it. What’s great is that unknown measurements will be returned as NAs

Once you’ve done that, you can use mutate with your function to divide gain by whatever output value comes from your function to get a standardized weight change in mg. After you do that, use select to get rid of the unsightly Units column.

2h) Problem 6: The week column is awfully big

Hrm. All of that Weight Gain After \\d week stuff in the Week column has got to go!

2i) Problem 7: Where did that meta-data go?

OK, the last bit - remember when we saved year, body status, and rainfall? Add those in as columns to this new data set! (You’ll likely need to use $ )

Monkeying Around with Functions and Tidyr

Intro to Data Science for Biology

March 23, 2016

Marmosets

1) That’s UGLY!