Gridding data for multi-scale macroecological analyses

These are materials for the first practical lesson of the Spatial Scale in Ecology course. All of the data and codes are available here. The class covered a 1.5h session. R code for the session is also at the end of this post. The following advices and ideas are mostly my personal opinions, they were not published, feel free to spread them, or to comment on them at the end of the post:

1. Do we need to fit everything to grids?

In other words, can't we just use points and shapefiles, and "extract" whatever they overlap? Is scale real, or is it only in the eye of the beholder? What are the alternatives to grids? We haven't really started a deeper discussion here, it was perhaps too abstract. I was trying to push forward the idea that grids should be used always. There is something rotten in the way that current SDM techniques simply extract environmental variables around point records. I consider it to be an open and important (and broad) problem.

2. What is the best shape of grid cells for multi-scale analyses?

Almost everybody works with grids that have rectangular grid cells. It is practical because you can aggregate them into grids of coarser resolution, and you can think about them as simple rectangular matrices and effectively analyze them (e.g. as raster data). Rectangular grids are also useful at sub-continental extents, as geometry of their grid cells is less sensitive to the particular projection at these extents.  Triangular and hexagonal grids have one potential advantage over rectangular ones: you can wrap them around a sphere like that (image source: Wikipedia):

tri_shere hex_sphere

Such grids are sort of "projection-free" - they are equidistant and they have constant area of grid cells at the same time, which could be huge advantage in macroecology. The disadvantage is that triangles and hexagons do not fit our rectangular mindsets and toolsets. Another disadvantage of hexagonal grids is that you cannot simply aggregate them to produce grid of coarser grain. On the other hand, triangular grids are scalable - I quite fancy the idea of triangular grids - maybe we should reconsider them as maybe they are easier to work with than we think. Maybe.

What about elongation of grid cells? Can we use any rectangular shapes for macroecological grids? Essentially yes, but it may be wise to keep the shape of a grid cell constant over the extent of a study. Consider the following illustration:
kunin_shape
It is from Kunin 1997 who gave the issue the first treatment in ecological context. All of the rectangular shapes have identical area, yet the more elongated ones will generally harbor more habitat heterogeneity and more species, simply because of the First Law of Biogeography: distant places are more different from each other than places that lay close.

3. What needs to be the projection of a rectangular grid?

So elongated shapes harbor more species (habitats, heterogeneity, diversity of any kind) than regular shapes. Well, then you better keep shape of all grid cells constant. But also larger cells harbor more species (diversity of any kind) than small cells. So we need to also keep the area of each grid cell constant. With a spherical world and rectangular grid these are two contradicting requirements, and there is no good solution.
For macroecological purposes the usual compromise is Cylindrical Equal Area projection (or other equal area projection) of the grid, as the effect of area is usually considered more serious than the effect of shape (but this has not been rigorously tested yet - it may not be true). This is gridded world map in the Cylindrical Equal Area projection:
equal_map_global
This is how the equal-area grid cells look in the common WGS 1984 projection:
wgs_map_global

Note the elongated shapes at high latitudes - these will surely over-represent diversity (because of their shape, not because of their area).

4. Should the grids be nested? Should they have the same extent?

I would argue that for simple data visualizations and simple pattern explorations at different grid resolutions, the grids do not have to be strictly nested. However, for more rigorous and multi-scale statistical modelling it can be wise to keep the origin and the extent of the grids constant. For example, in order to build a spatially structured hierarchical model it is really advantageous when the grids are nested.

5. What if my grid cells overlap sea (abyss, World edge, lava field, administrative boundary)?

Currently there are two common solutions, both of them sub-optimal. First approach is to ignore the problem and to pretend that 1/3 of a grid cell is a whole grid cell. This is what (amongst many) Rahbek & Graves 2001 did:
rahbek

Another extreme is to follow the dictatorship of geometry strictly, and to delete all grid cells that have even a very small proportion of sea. In case of a system of nested grid this will be dependent on our coarsest grid. This the approach that I chose in one of my papers:

keil_jbi

The cost is that one loses most of the coastal areas. There are also compromises - some workers have used some arbitrarily chosen thresholds on the acceptable proportion of sea area within a grid cell (e.g. up to 10%-20% is still acceptable).

There is a third way, which we will explore in the following lessons. It involves keeping all grid cells in the data, regardless to the area of sea within them. The next step is not to ignore this fact (as is often done), but to effectively incorporate it into a statistical model as an area of completely unsuitable habitat, with a proper prior knowledge of its effect.

6. What if my grid is irregular?

For simple data explorations it is a problem (see my paper), as the patterns are affected both by shape and area of a grid cell. However, this problem can be easily fixed in hierarchical modelling.
keil_hawkins

Usually, data in the form of "bioregions", biomes, altitudinal bands, administrative units, "evolutionary arenas", regional checklists or nature reserves are subject to this problem - they can be considered as grids that just happen to have very non-constant grid cell shape and area. Some may argue that they represent a sort of "natural grain or natural grid" - often, their boundaries are really important and determining what happens within the cells. There is, however, a pitfall: they confound the effect of shape, area and environment in such a complex way, that it is difficult to separate them. By working in a regular grid one has at least the geometry under control. This is desirable - unless one is interested purely in the role of geometry, then these "natural units" may be useful, but only if they are defined based on environmental conditions within them.

7. Is there a "good" or "optimal" grid resolution?

No. What really matters is not a particular grain resolution, but spatial scaling relationships that go across all imaginable grain resolutions. We should study them all - the species-area relationship, the occupancy-area relationship, the area-energy-diversity relationship, the relationship between area, habitat and occurrence of a species. They are the key. Everything is sort of "hanging" on these pillars. We should look for simple, rigorous and intuitive ways to construct these relationships, and build the rest of spatial ecology on them.

This is even more important for applied ecology. It does not matter that much what is (or will be in future) the pattern of species distribution at a single scale. It does not matter that much what predicts species occurrences at the finest grain resolutions. It is naive to say that, for applied and conservation purposes, it is fine grain that matters. All grains (potentially) matter. Appropriate scaling relationships are the baseline that links all of the grains together.

Simply put: Do not work at a single scale, do not seek the single best scale, keep in mind that fine-grain ecological properties may not be necessarily more important than coarse-grain properties (but note that fine resolution data can be more useful for further work than coarse-grain data).

8. What is the reliable grid resolution that captures species distributions based on point records and expert-drawn range maps?

Currently, and for continental to global extents, it is something roughly above 100 km x 100 km. See the Jetz & Hurlbert's paper and Hawkins et al.'s paper. This may be different for more systematically collected data. This is also an open problem, there is much more to be explored about spatial scaling of sampling effort, and it would be a nice project for a really nice paper. Somebody should do it. We should do it. Let's do it next Monday.

9. What is the best way to aggregate different types of data?

It all depends on the character of variable that you are aggregating. Recall what you learned in physics about extensive and intensive properties.Use this knowledge to your advantage:
Intensive variables - impossible to aggregate by very definition, unless you convert them into something else. Altitude is such variable. You cannot aggregate it, but you can calculate its mean, minimum or maximum in a broader area. But then it is not the variable it used to be any more (you have just converted it into an extensive one). Temperature can be similarly tricky.
Extensive variables - easy to aggregate. Area of a suitable habitat, total annual rainfall are examples of such variables.
Strange variables - species richness, beta diversity, endemism and all that. They are somewhere between being extensive and intensive. They have their own rules. Beware! (And study species-area relationships).

10. What is the available software
- ArcGIS - see the fishnet function; probably useful to know Python scripting in arcpy
- R - this is what we will focus on.

11. Practical exercise - loading data into R, creating grids in R and aggregating data into grids of different resolutions in R?

The code may have its bugs. This is what we went through during the second half of the class:

# ------------------------------------------------------------------------------
# 1. LOADING LIBRARIES

# loading some useful libraries
library(dismo)
library(XML)
library(maptools)
library(sp)
library(raster)
library(foreign)
library(rgdal)

# for the proj4 strings see www.spatialreference.org
us.atlas.proj <- CRS("+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs")
wgs1984.proj <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")


# ------------------------------------------------------------------------------
# 2. USA BOUNDARY SHAPEFILE
us <- readShapePoly("US_boundary/US_boundary2.shp")

par(mfcol=c(1,2))
plot(us)
# projecting the us shapefile
proj4string(us) <- wgs1984.proj
# reprojecting to US National Atlas equal-area projection
us.equal <- spTransform(us, us.atlas.proj)
plot(us.equal)

# ------------------------------------------------------------------------------
# 3. EXPERT RANGE-MAP DATA - Lynx canadensis
lynx.shape <- readShapePoly("LYNX/lynx.shp")
# plot shapefile:
par(mfcol=c(1,2))
plot(us)
plot(lynx.shape, add=T, col="pink")
# projecting the range map
proj4string(lynx.shape) <- wgs1984.proj
# reprojecting to US National Atlas equal-area projection
lynx.shape.equal <- spTransform(lynx.shape, us.atlas.proj)
plot(us.equal)
plot(lynx.shape.equal, add=T, col="pink")


# ------------------------------------------------------------------------------
# 4. POINT DATA - Picoides dorsalis

# get GBIF data with function gbif():
picoides.gbif <- gbif("Picoides", "dorsalis", geo=T)
# plot occurrences:
par(mfcol=c(1,2))
plot(us)
points(picoides.gbif$lon, picoides.gbif$lat,
       pch=19, cex=0.3, col="blue")
# projecting
# important: long goes first and hence 8:7
picoides <- SpatialPoints(coords=picoides.gbif[,8:7],
                                 proj4string=wgs1984.proj)
# and reprojecting
picoides.equal <- spTransform(picoides, us.atlas.proj)
plot(us.equal)
points(picoides.equal, pch=19, cex=0.3, col="blue")

# ------------------------------------------------------------------------------
# 5. CREATING AN EMPTY GRID (Thanks, Robert Hijmans!)
r <- raster(us.equal)
res(r) <- 50000
r[] <- rnorm(ncell(r))
plot(r)
plot(us.equal, add=T)
r[] <- 0

# ------------------------------------------------------------------------------
# 6. RASTERIZING (fitting data into a grid)

plot(us.equal)

# rasterizing the point data on Picoides dorsalis
picoides.raster <- rasterize(picoides.equal, r)
picoides.raster[picoides.raster>=1] <- 1
picoides.raster[picoides.raster==0] <- NA
# limiting the data only to US
picoides.raster <- picoides.raster*us.raster
plot(picoides.raster, add=T)
plot(us.equal, add=T)

# rasterizing the shapefile on Lynx arcticus
lynx.raster <- rasterize(lynx.shape.equal, r, getCover=TRUE)
lynx.raster[lynx.raster>=1] <- 1
lynx.raster[lynx.raster==0] <- NA
plot(lynx.raster, add=T)

# rasterizing US boundary
us.raster <- rasterize(us.equal, r, getCover=TRUE)
# conditioning on the amount of mainland in a grid cell
us.raster[us.raster>=1] <- 1
us.raster[us.raster==0] <- NA
plot(us.raster)

# rasterizing environmental data
temp <- raster("ENVI/bio_1")
temp.proj <- projectRaster(temp, crs=wgs1984.proj)
temp.equal <- projectRaster(from=temp.proj, to=r)
plot(temp.equal)
plot(us.equal, add=T)

# ------------------------------------------------------------------------------
# 7. AGGREGATE TO COARSE GRAIN - picoides occurrences
par(mfrow=c(2,2), mai=c(0.1, 0.1, 0.5, 0.1))

# 50 x 50 resolution
plot(picoides.raster, axes=FALSE,
     legend=FALSE, main="50 km x 50 km")
plot(us.equal, add=T)
#points(picoides.equal, col="red", cex=0.4, pch=19)

# 100 x 100 resolution
pic.coarser <- aggregate(picoides.raster, fact=2, fun=max)
plot(pic.coarser, axes=FALSE,
     legend=FALSE, main="100 km x 100 km")
plot(us.equal, add=T)

# 200 x 200 resolution
pic.coarser2 <- aggregate(pic.coarser, fact=2, fun=max)
plot(pic.coarser2, axes=FALSE,
     legend=FALSE, main="200 km x 200 km")
plot(us.equal, add=T)

# 50 x 50 resolution
pic.coarser4 <- aggregate(pic.coarser2, fact=2, fun=max)
plot(pic.coarser4, , axes=FALSE,
     legend=FALSE, main="400 km x 400 km")
plot(us.equal, add=T)


# ------------------------------------------------------------------------------
# 8. AGGREGATE TO COARSE GRAIN - lynx occurrences
par(mfrow=c(2,2), mai=c(0.1, 0.1, 0.5, 0.1))

# 50 x 50 resolution
lynx.raster <- lynx.raster*us.raster
plot(lynx.raster, axes=FALSE,
     legend=FALSE, main="50 km x 50 km")
plot(us.equal, add=T)
#plot(lynx.shape.equal, add=TRUE)

# 100 x 100 resolution
lynx.coarser <- aggregate(lynx.raster, fact=2, fun=max)
plot(lynx.coarser, axes=FALSE,
     legend=FALSE, main="100 km x 100 km")
plot(us.equal, add=T)

# 200 x 200 resolution
lynx.coarser2 <- aggregate(lynx.coarser, fact=2, fun=max)
plot(lynx.coarser2, axes=FALSE,
     legend=FALSE, main="200 km x 200 km")
plot(us.equal, add=T)

# 400 x 400 resolution
lynx.coarser4 <- aggregate(lynx.coarser2, fact=2, fun=max)
plot(lynx.coarser4, , axes=FALSE,
     legend=FALSE, main="400 km x 400 km")
plot(us.equal, add=T)


# ------------------------------------------------------------------------------
# 9. AGGREGATE TO COARSE GRAIN - temperature
par(mfrow=c(2,2), mai=c(0.1, 0.1, 0.5, 0.1))

temp.equal <- temp.equal*us.raster
plot(temp.equal, axes=FALSE,
     main="50 km x 50 km")
plot(us.equal, add=T)

temp.coarser <- aggregate(temp.equal, fact=2, fun=mean)
plot(temp.coarser, axes=FALSE,
     main="100 km x 100 km")
plot(us.equal, add=T)

temp.coarser2 <- aggregate(temp.coarser, fact=2, fun=mean)
plot(temp.coarser2, axes=FALSE,
     main="200 km x 200 km")
plot(us.equal, add=T)

temp.coarser4 <- aggregate(temp.coarser2, fact=2, fun=mean)
plot(temp.coarser4,
     axes=FALSE, main="400 km x 400 km")
plot(us.equal, add=T)


# ------------------------------------------------------------------------------
# 10. EXTRACTING CARNIVORANS SPECIES RICHNESS

# reading the complete IUCN mammal dataset
# which is downloadable at:
# http://www.iucnredlist.org/technical-documents/spatial-data#mammals
# THIS DATA ARE NOT PROVIDED BY THIS POST!
mammterr <- readShapePoly("MAMMTERR/MAMMTERR.shp")

# reading the list of carnivoran species
carniv.list <- read.table("carnivorans.txt",
               sep="\t", header=T)
carniv.list <- paste(carniv.list[,1], carniv.list[,2], sep=" ")

# subsetting the mammal data only to carnivorans
carniv <- mammterr[mammterr$BINOMIAL %in% carniv.list,]
# projecting and reprojecting the dataset
proj4string(carniv) <- wgs1984.proj
carniv.equal <- spTransform(carniv, us.atlas.proj)

# calculation of species richness
# THIS IS A BETA VERSION - IS PROBABLY WRONG!
carniv.raster <- rasterize(carniv.equal, r,
                           fun=function(x)length(unique(x)))
# eliminating cells that do not overlap US
carniv.raster <- carniv.raster*us.raster

# creating empty raster at coarse scale (200 x 200 km)
r4 <- aggregate(r, fact=4, fun=mean)
# calculation of species richness at the coarse scale
carniv.raster.4 <- rasterize(carniv.equal, r4,
                             fun=function(x)length(unique(x)))
# eliminating cells that do not overlap US
carniv.raster.4 <- m2.raster.4*us.raster.4

# plotting richness
par(mfrow=c(1,2), mai=c(0.1, 0.1, 0.5, 0.1))
# 50 x 50 km
plot(carniv.raster, axes=FALSE,
     main="50 km x 50 km",
     col=heat.colors(20))
plot(us.equal, add=T)
# 200 x 200 km
plot(carniv.raster.4, axes=FALSE,
     main="200 km x 200 km",
     col=heat.colors(20))
plot(us.equal, add=T)

3 thoughts on “Gridding data for multi-scale macroecological analyses

  1. Thanks Petr Keil!
    This post was very helpful for I start use GIS in R.
    I think you could do a little change in the command for calculate species richness. You need calculate richness without "NA" and specify the field of the data frame to the function calculate the length(). When I did't specify the field the R returns the following error:
    Error in match(x, table, nomatch = 0L) :
    'match' requires vector arguments

    carniv.raster <- rasterize(carniv.equal, r, field="BINOMIAL"
    fun=function(x)length(unique(na.omit(x))))

    cheeRs!

  2. I'd be interested in your thoughts regarding free-form gridding of the kind the non-linear modelers use, e.g., JGR article by Chen, at al, figure linked at
    Figure 2. Now, clearly, if a design grid were supposed to be a sampling frame, this might not be so suitable, unless there were obstacles. But, say, if points were sampled pseudo-randomly in space and there was a high accuracy means of noting where the points were sampled, there might be a sense in which the samples could be stratified so the same number appeared in each cell, and these would result in grid cell boundaries which were not uniform. I guess, too, if the sampling plan were administered automatically, it might be driven by a random spatial point pattern, with coordinates noted, but the pattern could be driven by what the student thought were good places to get specimens versus not-so-much.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>