The etherpad for today is: https://etherpad.wikimedia.org/p/buoy_function
One of the benefits of code is the ability to repeat an analysis or job without doing any additional work. We call this reuability. So far, we’ve only applied this to the idea that you can use code to generate or regenerate the same series of actions on a single task or data set.
But what if, within the task you’re doing, there’s something you want to do a lot. And it’s the same thing, over and over again in the service of something larger.
OK, that’s too abstract, so let’s take an example for today - wanting to load and reprocess a batch set of data files that you do the same operation to and merge into a larger data file.
Right off of Salem Sound sits a buoy, National Buoy Data Center’s number 44013. It’s been there since 1986. And it kicks out data in sea surface temperature, wave height, and more every hour. Download the buoy data and take a gander at the different files.
Let’s say our task for today is to explore whether there are any long-term trends in temperature or wave height over time.
If we were to start with what we know how to do now, we might make the following choice of a workflow
dplyr
to select the columns we want, and rename them sensiblyThis is pretty sensible. And implementing it is straightforward.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#oops - notice that 99 is NA!
buoy_1986 <- read.csv("./buoydata/44013_1986.csv", na.strings="99")
#Now the data transformations
buoy_1986 <- buoy_1986 %>%
select(YY, MM, DD, WVHT, WTMP) %>%
rename(Year = YY,
Month = MM,
Day = DD,
Wave_Height = WVHT,
Temperature_c = WTMP) %>%
group_by(Year, Month, Day) %>%
summarise(Wave_Height = mean(Wave_Height, na.rm=T),
Temperature_c = mean(Temperature_c, na.rm=T)) %>%
ungroup()
#1987
buoy_1987 <- read.csv("./buoydata/44013_1987.csv", na.strings="99")
#Now the data transformations
buoy_1987 <- buoy_1987 %>%
select(YY, MM, DD, WVHT, WTMP) %>%
rename(Year = YY,
Month = MM,
Day = DD,
Wave_Height = WVHT,
Temperature_c = WTMP) %>%
group_by(Year, Month, Day) %>%
summarise(Wave_Height = mean(Wave_Height, na.rm=T),
Temperature_c = mean(Temperature_c, na.rm=T)) %>%
ungroup()
buoydata <- rbind(buoy_1986, buoy_1987)
glimpse(buoydata)
## Observations: 663
## Variables: 5
## $ Year (int) 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, ...
## $ Month (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Day (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ Wave_Height (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Temperature_c (dbl) 4.226087, 4.229167, 4.137500, 4.129167, 4.225000...
#1988....
OH MY GOD THAT IS A MASSIVE AMOUNT OF COPYING AND PASTING.
Just imagine the errors that can creep in.
So, all this time, you’ve been using functions. From mean
to sum
to rbind
to summarise
, each of those functions has arguments, and each of those functions, in a single line, hides a mountain of code. Just imagine the amount of instructions buried under a single geom_point
?
So why not write your own? Functions have enormous benefits. In addition to reducing the burden of the sheer amount of code that happens when you copy and paste large blocks, you 1) dramatically reduce the number of bugs possible, 2) write code that, if you discover a problem, you can tweak in one place, 3) can write code that us usable for a broad variety of applications - not just this one. Good functions are reusable - making the rest of your life easier. Write enough reusable functions for a particular set of tasks, and you have a packge! But that is a topic for another time.
A function consists of a few pieces. Let’s look at an example.
my_function <- function(argument_1, argument_2 = 3, ...){
output <- do_something(argument_1)
output <- do_something_else(output, argument_2)
output <- do_last_thing(output, ...)
return(output)
}
OK, what is in there? First, note we save the function as an object. So, now, whever I call my_function()
It’s a function! Second, note the function
function (HA!) that begins the function. That’s word salad, but, you get the point, that you start a function with function
followed by a parenthesis. The parenthetical contains a set of arguments which will be used inside of the body of the function (in this case argument_1
and argument_2
). THere also this ...
argument. This is a garbage collector for arguments that will be sent to other functions inside of our function. No need to worry about that now, and often we don’t need it.
Note that we also supply a default value for argument_2
- this is some sensible default, like na.rm = T
for many functions we have used in the past.
Next up we have the curly bracket - {
- curly brackets define the body of a function. All arguments are evaluated inside of those curly-brackets, and then left, poof, disappear from memory after the function is executed.
Last up we have the return
function at the end. This is not always necessary, but generally good practice. Functions will output the last value you put in them. But…I find that to be pretty dicey, and I like to write my functions with a sense of knowing what will be returned when and where. You can even have returns at other parts of the function under certain conditions, as we shall see.
OK, given the above, write a function that will take a single number as an input, and output that number plus one. Let’s call it add_one
.
What are the parts of the function? What does it need to have in it?
Here’s one example, but there may be others
add_one <- function(x){
output <- x+1
return(output)
}
To show you how much you can strip things down
add_one <- function(x) x+1
OK, let’s come back to the buoy data. Let’s write a function that would, given an arbitrary year, load in a single file. We’ll modify the function more as we go on. Decision time - what arguments should our process_buoy
take?
For the sake or argument, let’s go with just year, as we know that the file name structure is pretty standard - 44013_year.csv. Now, write a function that takes an arbitrary year, and loads a file using read.csv
. One thing that will help here is the paste
or paste0
function depending. Test it with 1986 as an example.
For example
process_buoy <- function(a_year){
#file to read in
filename <- paste0("./buoydata/44013_", a_year, ".csv")
a_buoy <- read.csv(filename, na.strings = c("99", "999"))
return(a_buoy)
}
head(process_buoy(1986))
## YY MM DD hh WD WSPD GST WVHT DPD APD MWD BAR ATMP WTMP DEWP VIS
## 1 86 1 1 0 210 9 11 NA NA NA NA 1009.2 5.8 4.3 NA NA
## 2 86 1 1 1 210 9 10 NA NA NA NA 1009.3 4.7 4.3 NA NA
## 3 86 1 1 2 220 8 10 NA NA NA NA 1009.2 4.0 4.3 NA NA
## 4 86 1 1 3 220 6 8 NA NA NA NA 1009.3 3.9 4.3 NA NA
## 5 86 1 1 4 210 6 8 NA NA NA NA 1009.0 3.9 4.3 NA NA
## 6 86 1 1 5 210 6 8 NA NA NA NA 1008.4 3.9 4.3 NA NA
So, now if all we wanted to do was read files that had identical structures and combine them, we could do something like so.
buoydata <- rbind(process_buoy(1986),
process_buoy(1986))
str(buoydata)
## 'data.frame': 16336 obs. of 16 variables:
## $ YY : int 86 86 86 86 86 86 86 86 86 86 ...
## $ MM : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DD : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hh : int 0 1 2 3 4 5 6 7 8 9 ...
## $ WD : int 210 210 220 220 210 210 210 210 210 220 ...
## $ WSPD: int 9 9 8 6 6 6 7 7 6 5 ...
## $ GST : int 11 10 10 8 8 8 8 9 8 6 ...
## $ WVHT: num NA NA NA NA NA NA NA NA NA NA ...
## $ DPD : num NA NA NA NA NA NA NA NA NA NA ...
## $ APD : num NA NA NA NA NA NA NA NA NA NA ...
## $ MWD : logi NA NA NA NA NA NA ...
## $ BAR : num 1009 1009 1009 1009 1009 ...
## $ ATMP: num 5.8 4.7 4 3.9 3.9 3.9 3.9 3.9 3.6 3.5 ...
## $ WTMP: num 4.3 4.3 4.3 4.3 4.3 4.3 4.4 4.4 4.4 4.3 ...
## $ DEWP: logi NA NA NA NA NA NA ...
## $ VIS : logi NA NA NA NA NA NA ...
Wow! Much easier!