Today we’re going to look at how we can use joins and geospatial data more explicitly to make maps. Maps are among the first data visualizations that ever occured. And some of the most powerful. They’re also one of the places where joins become incredibly important, as to put data on a map we have to join our data with a geospatial description of the map we want.
Today’s data set that we’ll be using is a data set of heart disease mortality from the CDC
#load the data and prep
#for some data manipulation
library(readxl)
library(dplyr)
heart_disease <- read_excel("./join_maps/hd_all.xlsx",
na="Insufficient Data")
head(heart_disease)
## Source: local data frame [6 x 4]
##
## State County Death_Rate FIPS_Code
## (chr) (chr) (dbl) (dbl)
## 1 Alabama Autauga 463.0 1001
## 2 Alabama Baldwin 391.4 1003
## 3 Alabama Barbour 533.1 1005
## 4 Alabama Bibb 511.1 1007
## 5 Alabama Blount 425.6 1009
## 6 Alabama Bullock 483.2 1011
OK, we see that we have state, county, and information on death. FIPS codes, FYI, are standardized county codes. We’ll be ignoring them.
There are a LOT of ways to get map data into R. We’re going to begin with the simplest - grabbing it from an R package. ggplot2
works in tandem with the maps
package to provide a few standardized sets of maps for easy plotting. Let’s take a look at one of counties in the U.S. lower 48.
#install the mapdata library if you
#don't have it
library(mapdata)
## Loading required package: maps
## Warning: package 'maps' was built under R version 3.2.3
##
## # ATTENTION: maps v3.0 has an updated 'world' map. #
## # Many country borders and names have changed since 1990. #
## # Type '?world' or 'news(package="maps")'. See README_v3. #
library(ggplot2)
#map_data gets us one of the select maps
map_df <- map_data("county")
head(map_df)
## long lat group order region subregion
## 1 -86.50517 32.34920 1 1 alabama autauga
## 2 -86.53382 32.35493 1 2 alabama autauga
## 3 -86.54527 32.36639 1 3 alabama autauga
## 4 -86.55673 32.37785 1 4 alabama autauga
## 5 -86.57966 32.38357 1 5 alabama autauga
## 6 -86.59111 32.37785 1 6 alabama autauga
OK, we have latitude and longitude of county borders, a group (each county has one group), and both a region and subregion. Note we don’t have states and counties - this map is a bit broader than that. It includes cities and US Territories. Also note that capitalization is wonky - it’s all lower case.
To show you how we would use this data
ggplot(data=map_df, mapping = aes(x = long, y = lat, group = group)) +
geom_polygon()
geom_polygon
is our newest friend, and it creates polygons from the lat/long paths above. There are other ways to use this map, but it’s the clearest.
OK, so, we want to see how death from heart disease varies by county! To do this, we’ll need to join the two data frames. Note, the number of rows in map_df
is YUGE compared to those in our data set, as they define county borders. So we’re going to be merging one data point to many rows in the map_df
data frame. But, we need to do some prep work first.
First, we need to fix up the columns in our heart_disease
data set to not be capitalized, and we need to rename those columns in our map data frame.
map_df <- map_df %>%
rename(State = region,
County = subregion)
heart_disease <- heart_disease %>%
mutate(State = tolower(State),
County = tolower(County))
Now let’s test out the join! Before we do the big join, let’s look at mismatch. This is the perfect job for anti_join
bad_match <- anti_join(heart_disease, map_df)
## Joining by: c("State", "County")
nrow(bad_match)
## [1] 112
head(bad_match)
## Source: local data frame [6 x 4]
##
## State County Death_Rate
## (chr) (chr) (dbl)
## 1 virgin islands of the u.s. saint thomas (county equivalent) NA
## 2 wisconsin st. croix 247.8
## 3 virginia winchester 336.0
## 4 virginia williamsburg 294.2
## 5 virginia waynesboro 364.3
## 6 virginia staunton 339.2
## Variables not shown: FIPS_Code (dbl)
Uh oh! 112 rows of bad matches! Why? Well, we can see one of the reasons quickly - the US Virgin Islands. They are not in the map data set. Second, though, we can see st. croix
- indeed, the map data frame has no . or ’ characters in it, so, we should filter those out of heart_disease and then re-check.
heart_disease <- heart_disease %>%
mutate(County = gsub("\\.", "", County)) %>%
mutate(County = gsub("\\'", "", County))
bad_match2 <- anti_join(heart_disease, map_df)
## Joining by: c("State", "County")
nrow(bad_match2)
## [1] 82
bad_match2
## Source: local data frame [82 x 4]
##
## State County Death_Rate
## (chr) (chr) (dbl)
## 1 virgin islands of the u.s. saint thomas (county equivalent) NA
## 2 virgin islands of the u.s. saint john (county equivalent) NA
## 3 virgin islands of the u.s. saint croix (county equivalent) NA
## 4 virginia winchester 336.0
## 5 virginia williamsburg 294.2
## 6 virginia waynesboro 364.3
## 7 virginia staunton 339.2
## 8 virginia salem 388.1
## 9 virginia roanoke city 429.8
## 10 virginia richmond city 378.4
## .. ... ... ...
## Variables not shown: FIPS_Code (dbl)
OK, much better - down to 82. And looking at what’s left, those are cities, not counties, so, should not affect our map making.
Are we good the other way?
bad_match3 <- anti_join(map_df, heart_disease)
## Joining by: c("State", "County")
#because we get a data frame return
length(unique(bad_match3$County))
## [1] 8
unique(bad_match3$County)
## [1] "yellowstone national" "de witt" "de soto"
## [4] "la porte" "du page" "washington"
## [7] "de kalb" "la moure"
Only 8. Were we trying to make this perfect, we’d try and figure out why, but, for now, let’s soldier on. 8 is acceptable.
So we know that there are some territories and cities left in the heart disease data, and we don’t want to worry about them for the moment. Given that we’ve got some mismatch on both sides, we want to use an inner_join
- although we could also merge so that the entire map is retained, and thus see what counties we have missing data for. Try it both ways.
heart_disease_map_data <- inner_join(heart_disease, map_df)
## Joining by: c("State", "County")
And now - let’s plot it!
heart_map <- ggplot(data=heart_disease_map_data,
mapping = aes(x = long, y = lat, group = group,
fill = Death_Rate)) +
geom_polygon()
heart_map
So, this is a choropleth map. There are some issues - the scale is hard to resolve, and everything is hard to see. There are a lot of ways we could rectify it. Here are a few tricks.
First, a better scale. Maybe a gradient?
heart_map +
scale_fill_gradient(low = "white", high = "red")
OK, not bad! Not great, but much better. Are there gradients you’d prefer?
However, the bigger problem is that we have a huge range to cover. One common way to make choropleths is to bin data into categories and go from there. Binning in combination with a nice color scale - say via Color Brewer - cab be great. Remember Color Brewer? http://colorbrewer2.org/ actually uses maps as it’s demo! And there’s a scale_fill_brewer
function in ggplot2.
So let’s make bins and plot that instead!
heart_disease_map_data <- heart_disease_map_data %>%
mutate(death_rate_bins = cut_interval(Death_Rate, 5))
heart_map_binned <- ggplot(data=heart_disease_map_data,
mapping = aes(x = long, y = lat, group = group,
fill = death_rate_bins)) +
geom_polygon()
heart_map_binned +
scale_fill_brewer(palette="Reds")
MUCH nicer. Now we can really start to see some hot spots.
Most map data comes from separate files from R. You’ll see shapefiles most commonly as files that define borders. Shapefiles are funky things. Let’s take a look at the shapefile for a common classification of marine ecoregions of the world. These are coastal regions binned by biogeographic boundaries. To load them, we’ll need the sp and rgdal libraries. These might take some doing to install properly, but, it’s worth it.
library(sp)
library(rgdal)
ecoregions <- readOGR("./join_maps/MEOW-TNC/meow_ecos.shp", layer="meow_ecos")
## OGR data source with driver: ESRI Shapefile
## Source: "./join_maps/MEOW-TNC/meow_ecos.shp", layer: "meow_ecos"
## with 232 features
## It has 9 fields
plot(ecoregions)
Neat - these kinds of files can be plotted on their own! But, we’re going to take this a bit further.
So, if we dig into this object, what do we see.
class(ecoregions)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
#It's an S4 object
slotNames(ecoregions)
## [1] "data" "polygons" "plotOrder" "bbox" "proj4string"
So, it’s a SpatialPolygonsDataFrame. Weird. And it has all of these pieces to it - called slots - as it is what we call an S4 object. More on that another time, but, suffice to say, instead of the $
notation we use an @
notation to access pieces of it. Such as
head(ecoregions@data)
## ECO_CODE ECOREGION PROV_CODE
## 0 20192 Agulhas Bank 51
## 1 20053 Aleutian Islands 10
## 2 20072 Amazonia 13
## 3 20194 Amsterdam-St Paul 52
## 4 20228 Amundsen/Bellingshausen Sea 61
## 5 20109 Andaman and Nicobar Islands 24
## PROVINCE RLM_CODE REALM
## 0 Agulhas 10 Temperate Southern Africa
## 1 Cold Temperate Northeast Pacific 3 Temperate Northern Pacific
## 2 North Brazil Shelf 4 Tropical Atlantic
## 3 Amsterdam-St Paul 10 Temperate Southern Africa
## 4 Continental High Antarctic 12 Southern Ocean
## 5 Andaman 5 Western Indo-Pacific
## ALT_CODE ECO_CODE_X Lat_Zone
## 0 189 192 Temperate
## 1 50 53 Temperate
## 2 64 72 Tropical
## 3 191 194 Temperate
## 4 209 228 Polar
## 5 105 109 Tropical
Here we see the ecoregion names and some other properties of them. Other slots are filled with information that defines polygons, etc.
So, to make this into a data frame for mapping, we have to work a littl ebit of magic. Step 1, we need to give the data
slot an ID. This is necessary for a join, as the polygons all have an ID as well
slotNames(ecoregions@polygons[[1]])
## [1] "Polygons" "plotOrder" "labpt" "ID" "area"
ecoregions@data$id = rownames(ecoregions@data)
Great! Step 2, we need to turn those polygons into a data frame of points. There’s a function for this in ggplot2, called fortify
ecoregions_points = fortify(ecoregions, ECOREGION="id")
## Regions defined for each Polygons
head(ecoregions_points)
## long lat order hole piece id group
## 1 28.35993 -36.64435 1 FALSE 1 0 0.1
## 2 28.28350 -36.67695 2 FALSE 1 0 0.1
## 3 28.22727 -36.69682 3 FALSE 1 0 0.1
## 4 28.13197 -36.74429 4 FALSE 1 0 0.1
## 5 28.00760 -36.77946 5 FALSE 1 0 0.1
## 6 27.86775 -36.84124 6 FALSE 1 0 0.1
And now the last step - a join! We need to join the original data defining names of ecoregions with the data defining their borders.
ecoregions_df = inner_join(ecoregions_points, ecoregions@data, by="id")
Voila! We have ecoregions in a data frame format!
ggplot(data=ecoregions_df,
mapping = aes(x = long, y = lat, group = group)) +
geom_polygon()
OK, we have some polygons - now what are we going to do with them? One use is to visualize climate change over time. I’ve been working on a project to look at climate change in Ecoregions that contain kelp. So, for each Ecoregion, I’ve gotten all of the temperature data from 1950 - the present from http://www.metoffice.gov.uk/hadobs/hadisst/data/download.html - a great source for SST data.
sst_data <- read.csv("./join_maps/hadsst_regions.csv",
stringsAsFactors=FALSE)
head(sst_data)
## ECOREGION DateName tempC Year
## 1 Agulhas Bank 1950.01.16 21.137959 1950
## 2 Aleutian Islands 1950.01.16 3.909404 1950
## 3 Bassian 1950.01.16 14.309411 1950
## 4 Beaufort Sea - continental coast and shelf 1950.01.16 -1.800000 1950
## 5 Cape Howe 1950.01.16 18.455363 1950
## 6 Celtic Seas 1950.01.16 10.399248 1950
How can we use this data to see climate change on a map? There are a lot of methods. One simple one is to visualize change by decade. First, we have to use a quick trick to calculate decades - divide by ten, round to a whole number, and then multiple by ten.
sst_decadal <- sst_data %>%
mutate(Decade = round(Year/10)*10) %>%
group_by(ECOREGION, Decade) %>%
summarise(tempC = mean(tempC, na.rm=T)) %>%
ungroup()
Now, we can look at temperature by decade, but, we can’t just look at raw temperature. The color palatte would be too spread out, and we couldn’t compare, say, change in the tropics to change in the Arctic. One way that scientists have gotten around this is to look at temperature anomoly - that is, deviation from a long-term mean. So let’s calculate the decadal deviation from the long-term average for each ecoregion.
sst_decadal <- sst_decadal %>%
group_by(ECOREGION) %>%
mutate(tempC_anomoly = tempC - mean(tempC)) %>%
ungroup()
OK, now we’ve got something worth plotting!
There are a lot of ways we can visualize this. We can look at long-term trends, of course, and even look at each region as a single data point to get average long-term trends
ggplot(data=sst_decadal, mapping=aes(x=Decade, y=tempC_anomoly)) +
geom_line(mapping=aes(group=ECOREGION), colour="grey") +
theme_bw() +
stat_smooth(method="lm", color="red", fill=NA) +
ylab("Decadal Temperature Anomoly")
But how do we put this information on a map so that we can truly see it?
As we did with states, we can join the temperature data to the marine ecoregion data.
sst_map_data <- inner_join(sst_decadal, ecoregions_df)
## Joining by: "ECOREGION"
## Warning in inner_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector
Once that is done, we can use this decadal anomoly as a fill, with facets to show different decades.
ggplot(data=sst_map_data,
mapping = aes(x = long, y = lat,
group = group, fill = tempC_anomoly)) +
borders("world", lwd=0.1, colour="black") +
geom_polygon(alpha=0.9) +
facet_wrap(~Decade) +
scale_fill_gradient(low = "blue", high = "red") +
theme_bw(base_size=14) +
coord_equal()
It’s a different way of looking at the same data. But what do you see?