Lab 3 - Data visualisation with ggplot2

February 17, 2016

Etherpad: https://etherpad.wikimedia.org/p/KcLlXgpj6b

use the text area to post questions or take notes
use the chat to post when you are done or have questions

Learning objectives

Explain the difference between categorical and continuous data.
Interpret a data table using a meta-data sheet.
Use ggplot2 to create a scatterplot that shows the relationship between two variables.
Use colours, point size, point shape, and facets to include more than two variables in a ggplot.

Things to remember

check your working directory and use the folders you have set up for class
adhere to a style guide - click here!
comment your code
Rmarkdown gives you the opportunity for integrating text and code

Basics - intro to the grammar of graphics

Here is a brief description of the basic building blocks of a creating a ggplot.

argument	description of component
data	as a data.frame (long format!)
aesthetic (aes)	mapping variables to visualise properties - position, colour, line, type, size
geom	actual visualisation of the data
scale	map values to the aesthetics, colour, size, shape (show up as legends and axes)
stat	statistical transformations, summaries of data (e.g., line fits, etc., )
facet	splitting data across panels based on different subsets of the data

Warm-up exercises:

1. Using the meta-data on worksheet one of the data file EST-PR-PlanktonChemTax.xls, give a brief description (2 - 3 sentences), in your own words, of the plankton data set.

2. Also in the meta-data worksheet, you will find a ‘Variable Descriptions’ section, much like you created for your first lab write-up of the semester. Looking through that and the other information on the meta-data sheet if you need to, identify at least two examples of categorical data and at least two examples of numerical data.

library(readxl)

Read the plankton data into R and take a look at the structure of the data.

Confirm that the data matches what you expect it to given the meta-data.

str(plankton)

It is important that we get our data sorted out and checked before plotting. While visualisation is a great way to detect problems in your data, it helps that you have a decent data.frame to start. Also, by looking at the structure, we remind ourselves what the data columns are called.

3. Use the function unique() to look at the SampleType. What do you notice? How do you think this could cause problems in analysis? Describe some possible solutions to prevent this problem before you get to the step of reading the data into R.

# Install the package dplyr if you have not already done so for this class
#install.packages('dplyr')
library(dplyr)

plankton <- 
  plankton %>% 
  mutate(SampleType = replace(SampleType, SampleType == 'Whole water', 
                              'Wholewater')) %>%
  mutate(SampleType = replace(SampleType, SampleType == '<20' | 
                              SampleType == "< 20 um", '<20um'))

One last step. Let’s create a subset of the data to work with today to keep things more manageable. Create a new object called plankton_sub that contains the first 164 rows of the plankton dataset.

plankton_sub <- plankton[1:164, ]

# Let's check what dates this covers in the data.
range(plankton_sub$Date)

## [1] "2003-04-16 UTC" "2006-12-05 UTC"

Time to plot!

library(ggplot2)

4.Given the examples we have seen, make this plot

5. Create the following plots and briefly describe what you see.

6. What is the difference between colour and fill?