Joins and Geospatial Data

Map maker, map maper, make me a map!

Today we’re going to look at how we can use joins and geospatial data more explicitly to make maps. Maps are among the first data visualizations that ever occured. And some of the most powerful. They’re also one of the places where joins become incredibly important, as to put data on a map we have to join our data with a geospatial description of the map we want.

Death from Heart Disease

Today’s data set that we’ll be using is a data set of heart disease mortality from the CDC

#load the data and prep
#for some data manipulation
library(readxl)
library(dplyr)

heart_disease <- read_excel("./join_maps/hd_all.xlsx", 
                            na="Insufficient Data")

head(heart_disease)

## Source: local data frame [6 x 4]
## 
##     State  County Death_Rate FIPS_Code
##     (chr)   (chr)      (dbl)     (dbl)
## 1 Alabama Autauga      463.0      1001
## 2 Alabama Baldwin      391.4      1003
## 3 Alabama Barbour      533.1      1005
## 4 Alabama    Bibb      511.1      1007
## 5 Alabama  Blount      425.6      1009
## 6 Alabama Bullock      483.2      1011

OK, we see that we have state, county, and information on death. FIPS codes, FYI, are standardized county codes. We’ll be ignoring them.

Introducing: Maps

There are a LOT of ways to get map data into R. We’re going to begin with the simplest - grabbing it from an R package. ggplot2 works in tandem with the maps package to provide a few standardized sets of maps for easy plotting. Let’s take a look at one of counties in the U.S. lower 48.

#install the mapdata library if you
#don't have it
library(mapdata)

## Loading required package: maps

## Warning: package 'maps' was built under R version 3.2.3

## 
##  # ATTENTION: maps v3.0 has an updated 'world' map.        #
##  # Many country borders and names have changed since 1990. #
##  # Type '?world' or 'news(package="maps")'. See README_v3. #

library(ggplot2)

#map_data gets us one of the select maps
map_df <- map_data("county")

head(map_df)

##        long      lat group order  region subregion
## 1 -86.50517 32.34920     1     1 alabama   autauga
## 2 -86.53382 32.35493     1     2 alabama   autauga
## 3 -86.54527 32.36639     1     3 alabama   autauga
## 4 -86.55673 32.37785     1     4 alabama   autauga
## 5 -86.57966 32.38357     1     5 alabama   autauga
## 6 -86.59111 32.37785     1     6 alabama   autauga

OK, we have latitude and longitude of county borders, a group (each county has one group), and both a region and subregion. Note we don’t have states and counties - this map is a bit broader than that. It includes cities and US Territories. Also note that capitalization is wonky - it’s all lower case.

To show you how we would use this data

ggplot(data=map_df, mapping = aes(x = long, y = lat, group = group)) +
geom_polygon()

geom_polygon is our newest friend, and it creates polygons from the lat/long paths above. There are other ways to use this map, but it’s the clearest.

Getting ready to join maps and data

OK, so, we want to see how death from heart disease varies by county! To do this, we’ll need to join the two data frames. Note, the number of rows in map_df is YUGE compared to those in our data set, as they define county borders. So we’re going to be merging one data point to many rows in the map_df data frame. But, we need to do some prep work first.

First, we need to fix up the columns in our heart_disease data set to not be capitalized, and we need to rename those columns in our map data frame.

map_df <- map_df %>%
  rename(State = region,
         County = subregion)

heart_disease <- heart_disease %>%
  mutate(State = tolower(State), 
         County = tolower(County))

Now let’s test out the join! Before we do the big join, let’s look at mismatch. This is the perfect job for anti_join

bad_match <- anti_join(heart_disease, map_df)

## Joining by: c("State", "County")

nrow(bad_match)

## [1] 112

head(bad_match)

## Source: local data frame [6 x 4]
## 
##                        State                           County Death_Rate
##                        (chr)                            (chr)      (dbl)
## 1 virgin islands of the u.s. saint thomas (county equivalent)         NA
## 2                  wisconsin                        st. croix      247.8
## 3                   virginia                       winchester      336.0
## 4                   virginia                     williamsburg      294.2
## 5                   virginia                       waynesboro      364.3
## 6                   virginia                         staunton      339.2
## Variables not shown: FIPS_Code (dbl)

Uh oh! 112 rows of bad matches! Why? Well, we can see one of the reasons quickly - the US Virgin Islands. They are not in the map data set. Second, though, we can see st. croix - indeed, the map data frame has no . or ’ characters in it, so, we should filter those out of heart_disease and then re-check.

heart_disease <- heart_disease %>%
  mutate(County = gsub("\\.", "", County)) %>%
  mutate(County = gsub("\\'", "", County))

bad_match2 <- anti_join(heart_disease, map_df)

## Joining by: c("State", "County")

nrow(bad_match2)

## [1] 82

bad_match2

## Source: local data frame [82 x 4]
## 
##                         State                           County Death_Rate
##                         (chr)                            (chr)      (dbl)
## 1  virgin islands of the u.s. saint thomas (county equivalent)         NA
## 2  virgin islands of the u.s.   saint john (county equivalent)         NA
## 3  virgin islands of the u.s.  saint croix (county equivalent)         NA
## 4                    virginia                       winchester      336.0
## 5                    virginia                     williamsburg      294.2
## 6                    virginia                       waynesboro      364.3
## 7                    virginia                         staunton      339.2
## 8                    virginia                            salem      388.1
## 9                    virginia                     roanoke city      429.8
## 10                   virginia                    richmond city      378.4
## ..                        ...                              ...        ...
## Variables not shown: FIPS_Code (dbl)

OK, much better - down to 82. And looking at what’s left, those are cities, not counties, so, should not affect our map making.

Are we good the other way?

bad_match3 <- anti_join(map_df, heart_disease)

## Joining by: c("State", "County")

#because we get a data frame return
length(unique(bad_match3$County))

## [1] 8

unique(bad_match3$County)

## [1] "yellowstone national" "de witt"              "de soto"             
## [4] "la porte"             "du page"              "washington"          
## [7] "de kalb"              "la moure"

Only 8. Were we trying to make this perfect, we’d try and figure out why, but, for now, let’s soldier on. 8 is acceptable.

Joining maps and data

So we know that there are some territories and cities left in the heart disease data, and we don’t want to worry about them for the moment. Given that we’ve got some mismatch on both sides, we want to use an inner_join - although we could also merge so that the entire map is retained, and thus see what counties we have missing data for. Try it both ways.

heart_disease_map_data <- inner_join(heart_disease, map_df)

## Joining by: c("State", "County")

And now - let’s plot it!

heart_map <- ggplot(data=heart_disease_map_data, 
       mapping = aes(x = long, y = lat, group = group,
                     fill = Death_Rate)) +
  geom_polygon() 

heart_map

Making pretty choropleths

So, this is a choropleth map. There are some issues - the scale is hard to resolve, and everything is hard to see. There are a lot of ways we could rectify it. Here are a few tricks.

First, a better scale. Maybe a gradient?

heart_map +
  scale_fill_gradient(low = "white", high = "red")

OK, not bad! Not great, but much better. Are there gradients you’d prefer?

However, the bigger problem is that we have a huge range to cover. One common way to make choropleths is to bin data into categories and go from there. Binning in combination with a nice color scale - say via Color Brewer - cab be great. Remember Color Brewer? http://colorbrewer2.org/ actually uses maps as it’s demo! And there’s a scale_fill_brewer function in ggplot2.

So let’s make bins and plot that instead!

heart_disease_map_data <- heart_disease_map_data %>%
  mutate(death_rate_bins = cut_interval(Death_Rate, 5))

heart_map_binned <- ggplot(data=heart_disease_map_data, 
       mapping = aes(x = long, y = lat, group = group,
                     fill = death_rate_bins)) +
  geom_polygon() 

heart_map_binned +
    scale_fill_brewer(palette="Reds")

MUCH nicer. Now we can really start to see some hot spots.

Is all map data available via packages?

Most map data comes from separate files from R. You’ll see shapefiles most commonly as files that define borders. Shapefiles are funky things. Let’s take a look at the shapefile for a common classification of marine ecoregions of the world. These are coastal regions binned by biogeographic boundaries. To load them, we’ll need the sp and rgdal libraries. These might take some doing to install properly, but, it’s worth it.

library(sp)
library(rgdal)
ecoregions <- readOGR("./join_maps/MEOW-TNC/meow_ecos.shp", layer="meow_ecos")

## OGR data source with driver: ESRI Shapefile 
## Source: "./join_maps/MEOW-TNC/meow_ecos.shp", layer: "meow_ecos"
## with 232 features
## It has 9 fields

plot(ecoregions)

Neat - these kinds of files can be plotted on their own! But, we’re going to take this a bit further.

Looking under the hood of a shapefile

So, if we dig into this object, what do we see.

class(ecoregions)

## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

#It's an S4 object
slotNames(ecoregions)

## [1] "data"        "polygons"    "plotOrder"   "bbox"        "proj4string"

So, it’s a SpatialPolygonsDataFrame. Weird. And it has all of these pieces to it - called slots - as it is what we call an S4 object. More on that another time, but, suffice to say, instead of the $ notation we use an @ notation to access pieces of it. Such as

head(ecoregions@data)

##   ECO_CODE                   ECOREGION PROV_CODE
## 0    20192                Agulhas Bank        51
## 1    20053            Aleutian Islands        10
## 2    20072                    Amazonia        13
## 3    20194           Amsterdam-St Paul        52
## 4    20228 Amundsen/Bellingshausen Sea        61
## 5    20109 Andaman and Nicobar Islands        24
##                           PROVINCE RLM_CODE                      REALM
## 0                          Agulhas       10  Temperate Southern Africa
## 1 Cold Temperate Northeast Pacific        3 Temperate Northern Pacific
## 2               North Brazil Shelf        4          Tropical Atlantic
## 3                Amsterdam-St Paul       10  Temperate Southern Africa
## 4       Continental High Antarctic       12             Southern Ocean
## 5                          Andaman        5       Western Indo-Pacific
##   ALT_CODE ECO_CODE_X  Lat_Zone
## 0      189        192 Temperate
## 1       50         53 Temperate
## 2       64         72  Tropical
## 3      191        194 Temperate
## 4      209        228     Polar
## 5      105        109  Tropical

Here we see the ecoregion names and some other properties of them. Other slots are filled with information that defines polygons, etc.

Using joins to make ecoregion data frame for plotting

So, to make this into a data frame for mapping, we have to work a littl ebit of magic. Step 1, we need to give the data slot an ID. This is necessary for a join, as the polygons all have an ID as well

slotNames(ecoregions@polygons[[1]])

## [1] "Polygons"  "plotOrder" "labpt"     "ID"        "area"

ecoregions@data$id = rownames(ecoregions@data)

Great! Step 2, we need to turn those polygons into a data frame of points. There’s a function for this in ggplot2, called fortify

ecoregions_points = fortify(ecoregions, ECOREGION="id")

## Regions defined for each Polygons

head(ecoregions_points)

##       long       lat order  hole piece id group
## 1 28.35993 -36.64435     1 FALSE     1  0   0.1
## 2 28.28350 -36.67695     2 FALSE     1  0   0.1
## 3 28.22727 -36.69682     3 FALSE     1  0   0.1
## 4 28.13197 -36.74429     4 FALSE     1  0   0.1
## 5 28.00760 -36.77946     5 FALSE     1  0   0.1
## 6 27.86775 -36.84124     6 FALSE     1  0   0.1

And now the last step - a join! We need to join the original data defining names of ecoregions with the data defining their borders.

ecoregions_df = inner_join(ecoregions_points, ecoregions@data, by="id")

Voila! We have ecoregions in a data frame format!

ggplot(data=ecoregions_df, 
       mapping = aes(x = long, y = lat, group = group)) +
geom_polygon()

Enter Regional Temperature Information

OK, we have some polygons - now what are we going to do with them? One use is to visualize climate change over time. I’ve been working on a project to look at climate change in Ecoregions that contain kelp. So, for each Ecoregion, I’ve gotten all of the temperature data from 1950 - the present from http://www.metoffice.gov.uk/hadobs/hadisst/data/download.html - a great source for SST data.

sst_data <- read.csv("./join_maps/hadsst_regions.csv",
                     stringsAsFactors=FALSE)

head(sst_data)

##                                    ECOREGION   DateName     tempC Year
## 1                               Agulhas Bank 1950.01.16 21.137959 1950
## 2                           Aleutian Islands 1950.01.16  3.909404 1950
## 3                                    Bassian 1950.01.16 14.309411 1950
## 4 Beaufort Sea - continental coast and shelf 1950.01.16 -1.800000 1950
## 5                                  Cape Howe 1950.01.16 18.455363 1950
## 6                                Celtic Seas 1950.01.16 10.399248 1950

How can we use this data to see climate change on a map? There are a lot of methods. One simple one is to visualize change by decade. First, we have to use a quick trick to calculate decades - divide by ten, round to a whole number, and then multiple by ten.

sst_decadal <- sst_data %>%
  mutate(Decade = round(Year/10)*10) %>%
  group_by(ECOREGION, Decade) %>%
  summarise(tempC = mean(tempC, na.rm=T)) %>%
  ungroup()

Now, we can look at temperature by decade, but, we can’t just look at raw temperature. The color palatte would be too spread out, and we couldn’t compare, say, change in the tropics to change in the Arctic. One way that scientists have gotten around this is to look at temperature anomoly - that is, deviation from a long-term mean. So let’s calculate the decadal deviation from the long-term average for each ecoregion.

sst_decadal <- sst_decadal %>%
  group_by(ECOREGION) %>%
  mutate(tempC_anomoly = tempC - mean(tempC)) %>%
  ungroup()

OK, now we’ve got something worth plotting!

Visualizing Climate Change

There are a lot of ways we can visualize this. We can look at long-term trends, of course, and even look at each region as a single data point to get average long-term trends

ggplot(data=sst_decadal, mapping=aes(x=Decade, y=tempC_anomoly)) +
  geom_line(mapping=aes(group=ECOREGION), colour="grey") +
  theme_bw() +
  stat_smooth(method="lm", color="red", fill=NA) +
  ylab("Decadal Temperature Anomoly")

But how do we put this information on a map so that we can truly see it?

As we did with states, we can join the temperature data to the marine ecoregion data.

sst_map_data <- inner_join(sst_decadal, ecoregions_df)

## Joining by: "ECOREGION"

## Warning in inner_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector

Once that is done, we can use this decadal anomoly as a fill, with facets to show different decades.

ggplot(data=sst_map_data, 
       mapping = aes(x = long, y = lat, 
                     group = group, fill = tempC_anomoly)) +
  borders("world", lwd=0.1, colour="black") +
  geom_polygon(alpha=0.9) +
  facet_wrap(~Decade) +
  scale_fill_gradient(low = "blue", high = "red") +
  theme_bw(base_size=14) +
  coord_equal()

It’s a different way of looking at the same data. But what do you see?