Analyses for spatial autocorrelation

Many environmental and biological variables contain spatial structure. This article describes where autocorrelation occurs and statistical methods to account for it.

Where does spatial autocorrelation happen?

Spatial autocorrelation occurs in almost every system where a process of transport occurs – or has occurred in the past.

Many chemical, physical, and biological processes lead to patterns of spatial autocorrelation (SAC). For instance, on a molecular scale, diffusion generates SAC in the concentration of a chemical. On an oceanic scale, convection mixes water packages to the same effect. In ecosystems, species invade nearby areas more often than distant ones, leading to SAC in community structure. Essentially, SAC occurs in almost every system where a process of transport occurs – or has occurred in the past.

A map of the world shows the concentration of Phosphate in the oceans. Areas with more phosphate are brighter.
Many environmental variables (Phosphate concentration shown here) are spatially autocorrelated. The similarity in color between nearby grid cells indicates autocorrelation. This pattern results from processes of transport (e.g. ocean circulation).

Making things worse, where one variable is autocorrelated, others probably are, too! Two variables could both be autocorrelated either because one depends on the other (“induced” SAC), or because the same underlying process structures both variables (“true” SAC). Thinking about a biological system, SAC in vegetation structure could induce SAC in pollinator communities, for example. In contrast, atmospheric circulation causes SAC in both temperature and precipitation patterns. (In the latter case, not only would temperature and precipitation maps each contain SAC, but also the two would correlate positively with each other. That could cause other problems, such as “model misspecification.”)

Spatial autocorrelation is positive in the large majority of cases. However, like any other type of correlation, autocorrelation can be negative. In a case of negative autocorrelation, close items are more different from each other than are slightly farther away items. Imagine a flower bed where the gardener arranged plants to maximize the contrast between adjacent colors. Reds might never occur next to one another, but might often occur three or four plants apart.

How to account for spatial autocorrelation

The good news is that there are a variety of methods to account for spatially structured data. Each comes with trade-offs, and none stands above the rest as a “gold standard.” You can apply any of the methods below to data with either spatial or phylogenetic autocorrelation. (I have yet to find an analysis that controls for both kinds of autocorrelation at the same time in a symmetrical and rigorous way.) Below, I weigh pros and cons for 3 main statistical approaches to account for autocorrelation.

A monarch butterfly sits on a flower.

Autocorrelation in one variable (e.g. plant distribution) could cause the same spatial pattern in another variable that depends on it (e.g. pollinator distribution). It’s wise to think about the process that causes the pattern for which you are designing an analysis.

1. Generalized Least Squares

I consider Generalized Least Squares (GLS) the simplest method in theory and in practice. It is easily adapted for phylogenetic correlation, in which case it is equivalent to Felsenstein’s independent contrasts. If an error distribution is non-normal, there’s a variation to use called Generalized Linear Mixed Models (GLMM). For all of these reasons, many people have used GLS for real data sets, and this in turn means that it is easy to find examples and explanations of the method. Unfortunately, statisticians don’t find the topic very interesting, so there is a lot of mathematical optimization behind the scenes that still needs to be done. I have run into difficulty when a dataset is big or contains complex structures. In addition, I have dug in to R packages with GLS functions, and some do not contain enough documentation for me to feel comfortable publishing analyses from them.

2. Generalized Estimating Equations

In contrast to GLS, Generalized Estimating Equations (GEE) have a strong mathematical basis. However, perhaps exactly because of this equation-heavy nature (as the name suggests) I have not seen as many cases where people used the method. Paradis and Claude wrote about GEE in detail, and claimed that it analyzes discrete states (e.g. in species traits) better than does GLS. Both GEE and GLS estimate correlation structures in the form of a matrix; a technical difference is that the matrix reflects correlation in GEE and variance-covariance in GLS. Also, GEE subsets the data as a preliminary step, which gets around some of the optimization problems I mentioned above for GLS. As a starting point for implementing GEE, I suggest the supplemental code by Dormann and others.

3. Eigenvector Maps

There are good examples of analysis with Eigenvector Maps (EM) to draw on from the literature, as well as many variations (e.g. Moran’s Eigenvector Maps, Principal Coordinate Analysis of Neighbour Matrices, Asymmetric Eigenvector Maps, Phylogenetic Eigenvector Regression/Mapping, plus the related category of autocovariate regression). One downside is that results from the method are tricky to interpret. Rather than a single kind of output, response variance is partitioned into 4 components representing different combinations of causal factors.

EM works by adding predictors that soak up autocorrelation in an ordinary regression. Just like with a plain old linear model, the analyst has to decide which predictor variables to include. There’s a limit to how many predictors can go in a model (degrees of freedom), so you have to leave some of the extra EM predictors out. In my opinion, this step introduces a risky amount of subjectivity. Also, I think most people have not thought enough about the fact that SAC reduces degrees of freedom.

The technical meaning of the added predictors in EM is that they are eigenvectors (information about principal components of variation). It’s tempting but misleading to match single eigenvectors to response data patterns. The correct approach is to focus on sets of eigenvectors that are relevant for a given scale; again, this complicates interpretation of results.

Further resources

I applaud the detailed review that Dormann and others wrote on methods to account for spatial autocorrelation, including a couple not discussed here (e.g. Generalized Additive Models, which are not spatial). Here are the citations for that work and others that I mentioned or have found helpful:

  1. Beale CM, Lennon JJ, Yearsley JM, Brewer MJ, Elston DA. 2010. Regression analysis of spatial data. Ecology letters 13(2):246-264.
  2. Dormann CF, McPherson JM, Araújo MB, Bivand R, Bolliger J, Carl G, Davies RG, Hirzel A, Jetz W, Kissling WD. 2007. Methods to account for spatial autocorrelation in the analysis of species distributional data: a review. Ecography 30(5):609-628.
  3. Dray S, Legendre P, Peres-Neto PR. 2006. Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices (PCNM). ecological modelling 196(3-4):483-493.
  4. Paradis E, Claude J. 2002. Analysis of comparative data using generalized estimating equations. Journal of Theoretical Biology 218(2):175-185.