We’re going to work through a few principles of ggplot2 today to aid in viewing multivariate data sets and pursuing data excellence. Yesterday, we were introduced to two of the most common geometric representations of data - points and lines. We’re going to focus on those here, but will introduce a few others in the homework. So let’s start by loading up our plankton data set.

For today, let’s use the etherpad at https://etherpad.wikimedia.org/p/eA8ySHxdkW

library(readxl)
library(ggplot2)

plankton <- read_excel("../Data/EST-PR-PlanktonChemTax.xls", sheet=2)

What to look for?

One element of exploratory graphics we didn’t talk about it visualizing ALL THE DATA. Sometimes, you just want to look. Fortunately, R has a nice function for this - pairs. As we don’t want to look at the same relationships as Tuesday (that would be boring!), let’s scan around for some others. We’ll start with the 11th column, as it’s the first to have biological data, and go to the end of the data frame.

#which columns are biological information?
names(plankton)
##  [1] "Date"                   "Site"                  
##  [3] "Latitude"               "Longitude"             
##  [5] "Distance"               "SampleName"            
##  [7] "SubsampleName"          "BottleName"            
##  [9] "SampleType"             "Volume"                
## [11] "Temp"                   "Salinity"              
## [13] "TotalChlA"              "DiatomsandChrysophytes"
## [15] "Cryptophytes"           "Chlorophytes"          
## [17] "Dinoflagellates"        "Euglenophytes"         
## [19] "Prasinophytes"          "Haptophytes"           
## [21] "Prymnesiophytes"        "Cyanobacteria"         
## [23] "Comments"
#plot them
pairs(plankton[,11:22])

Whoah! That’s a lot of information. But, it looks like our best bet for a nice relationship is going to be…

Introducing Chlorophytes, Cyanobacteria, and Log Scales

So, Chlrophytes are small unicellular green algae, and Cyanobacteria are tiny blue-green algae - bacteria really - and produce a LOT of the oxygen on this earth. Important little guys. Let’s look at the relationship between the two of them for the next little while.

But before we do - one of the great things about ggplot2 is that we build plots in steps. We can often take the same plot parameters, and put different data geoms, colors, scales, themes, and more on it. To avoid having to retype an ENTIRE plot over and over again, we can save a plot as an object, and add to it later. So, let’s make a basic scatterplot that we’ll build from for today, and call it base_plot.

base_plot <- ggplot(data=plankton, 
                    aes(x=Chlorophytes, y=Cyanobacteria)) +
  geom_point()

#Putting in the name of the object
#makes it display. You can also use
#print(base_plot)
base_plot
## Warning: Removed 3 rows containing missing values (geom_point).