The etherpad for today is: https://etherpad.wikimedia.org/p/buoy_function

A Copy-Paste Problem

One of the benefits of code is the ability to repeat an analysis or job without doing any additional work. We call this reuability. So far, we’ve only applied this to the idea that you can use code to generate or regenerate the same series of actions on a single task or data set.

But what if, within the task you’re doing, there’s something you want to do a lot. And it’s the same thing, over and over again in the service of something larger.

OK, that’s too abstract, so let’s take an example for today - wanting to load and reprocess a batch set of data files that you do the same operation to and merge into a larger data file.

Exploring Local Climate Change

Right off of Salem Sound sits a buoy, National Buoy Data Center’s number 44013. It’s been there since 1986. And it kicks out data in sea surface temperature, wave height, and more every hour. Download the buoy data and take a gander at the different files.

Let’s say our task for today is to explore whether there are any long-term trends in temperature or wave height over time.

The old way of doing things

If we were to start with what we know how to do now, we might make the following choice of a workflow

  1. Load in a data file.
  2. Use dplyr to select the columns we want, and rename them sensibly
  3. Maybe downscale the file to daily means - hourly is a lot of data!
  4. Combine the cleaned data with other already loaded data.
  5. Rinse and repeat.

This is pretty sensible. And implementing it is straightforward.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#oops - notice that 99 is NA!
buoy_1986 <- read.csv("./buoydata/44013_1986.csv", na.strings="99")

#Now the data transformations
buoy_1986 <- buoy_1986 %>%
  select(YY, MM, DD, WVHT, WTMP) %>%
  rename(Year = YY,
         Month = MM,
         Day = DD,
         Wave_Height = WVHT,
         Temperature_c = WTMP) %>%
  group_by(Year, Month, Day) %>%
  summarise(Wave_Height = mean(Wave_Height, na.rm=T),
            Temperature_c = mean(Temperature_c, na.rm=T)) %>%
    ungroup()

#1987
buoy_1987 <- read.csv("./buoydata/44013_1987.csv", na.strings="99")

#Now the data transformations
buoy_1987 <- buoy_1987 %>%
  select(YY, MM, DD, WVHT, WTMP) %>%
  rename(Year = YY,
         Month = MM,
         Day = DD,
         Wave_Height = WVHT,
         Temperature_c = WTMP) %>%
  group_by(Year, Month, Day) %>%
  summarise(Wave_Height = mean(Wave_Height, na.rm=T),
            Temperature_c = mean(Temperature_c, na.rm=T)) %>%
    ungroup()

buoydata <- rbind(buoy_1986, buoy_1987)

glimpse(buoydata)
## Observations: 663
## Variables: 5
## $ Year          (int) 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, ...
## $ Month         (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Day           (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ Wave_Height   (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Temperature_c (dbl) 4.226087, 4.229167, 4.137500, 4.129167, 4.225000...
#1988....

OH MY GOD THAT IS A MASSIVE AMOUNT OF COPYING AND PASTING.

Just imagine the errors that can creep in.

There is a better way - functions!

So, all this time, you’ve been using functions. From mean to sum to rbind to summarise, each of those functions has arguments, and each of those functions, in a single line, hides a mountain of code. Just imagine the amount of instructions buried under a single geom_point?

So why not write your own? Functions have enormous benefits. In addition to reducing the burden of the sheer amount of code that happens when you copy and paste large blocks, you 1) dramatically reduce the number of bugs possible, 2) write code that, if you discover a problem, you can tweak in one place, 3) can write code that us usable for a broad variety of applications - not just this one. Good functions are reusable - making the rest of your life easier. Write enough reusable functions for a particular set of tasks, and you have a packge! But that is a topic for another time.

The anatomy of a function

A function consists of a few pieces. Let’s look at an example.

my_function <- function(argument_1, argument_2 = 3, ...){
  
  output <- do_something(argument_1)
  
  output <- do_something_else(output, argument_2)
  
  output <- do_last_thing(output, ...)
  
  return(output)
  
}

OK, what is in there? First, note we save the function as an object. So, now, whever I call my_function() It’s a function! Second, note the function function (HA!) that begins the function. That’s word salad, but, you get the point, that you start a function with function followed by a parenthesis. The parenthetical contains a set of arguments which will be used inside of the body of the function (in this case argument_1 and argument_2). THere also this ... argument. This is a garbage collector for arguments that will be sent to other functions inside of our function. No need to worry about that now, and often we don’t need it.

Note that we also supply a default value for argument_2 - this is some sensible default, like na.rm = T for many functions we have used in the past.

Next up we have the curly bracket - { - curly brackets define the body of a function. All arguments are evaluated inside of those curly-brackets, and then left, poof, disappear from memory after the function is executed.

Last up we have the return function at the end. This is not always necessary, but generally good practice. Functions will output the last value you put in them. But…I find that to be pretty dicey, and I like to write my functions with a sense of knowing what will be returned when and where. You can even have returns at other parts of the function under certain conditions, as we shall see.

Your first function

OK, given the above, write a function that will take a single number as an input, and output that number plus one. Let’s call it add_one.

What are the parts of the function? What does it need to have in it?

Here’s one example, but there may be others

add_one <- function(x){
  output <- x+1
  
  return(output)
}

To show you how much you can strip things down

add_one <- function(x) x+1

Your second function - buoys!

OK, let’s come back to the buoy data. Let’s write a function that would, given an arbitrary year, load in a single file. We’ll modify the function more as we go on. Decision time - what arguments should our process_buoy take?

For the sake or argument, let’s go with just year, as we know that the file name structure is pretty standard - 44013_year.csv. Now, write a function that takes an arbitrary year, and loads a file using read.csv. One thing that will help here is the paste or paste0 function depending. Test it with 1986 as an example.

For example

process_buoy <- function(a_year){
  
  #file to read in
  filename <- paste0("./buoydata/44013_", a_year, ".csv")
  
  a_buoy <- read.csv(filename, na.strings = c("99", "999"))
  
  return(a_buoy)
  
}

head(process_buoy(1986))
##   YY MM DD hh  WD WSPD GST WVHT DPD APD MWD    BAR ATMP WTMP DEWP VIS
## 1 86  1  1  0 210    9  11   NA  NA  NA  NA 1009.2  5.8  4.3   NA  NA
## 2 86  1  1  1 210    9  10   NA  NA  NA  NA 1009.3  4.7  4.3   NA  NA
## 3 86  1  1  2 220    8  10   NA  NA  NA  NA 1009.2  4.0  4.3   NA  NA
## 4 86  1  1  3 220    6   8   NA  NA  NA  NA 1009.3  3.9  4.3   NA  NA
## 5 86  1  1  4 210    6   8   NA  NA  NA  NA 1009.0  3.9  4.3   NA  NA
## 6 86  1  1  5 210    6   8   NA  NA  NA  NA 1008.4  3.9  4.3   NA  NA

Victory for a function

So, now if all we wanted to do was read files that had identical structures and combine them, we could do something like so.

buoydata <- rbind(process_buoy(1986), 
                  process_buoy(1986))
str(buoydata)
## 'data.frame':    16336 obs. of  16 variables:
##  $ YY  : int  86 86 86 86 86 86 86 86 86 86 ...
##  $ MM  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DD  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hh  : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ WD  : int  210 210 220 220 210 210 210 210 210 220 ...
##  $ WSPD: int  9 9 8 6 6 6 7 7 6 5 ...
##  $ GST : int  11 10 10 8 8 8 8 9 8 6 ...
##  $ WVHT: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ DPD : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ APD : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MWD : logi  NA NA NA NA NA NA ...
##  $ BAR : num  1009 1009 1009 1009 1009 ...
##  $ ATMP: num  5.8 4.7 4 3.9 3.9 3.9 3.9 3.9 3.6 3.5 ...
##  $ WTMP: num  4.3 4.3 4.3 4.3 4.3 4.3 4.4 4.4 4.4 4.3 ...
##  $ DEWP: logi  NA NA NA NA NA NA ...
##  $ VIS : logi  NA NA NA NA NA NA ...

Wow! Much easier!