An essential but feared topic in spatial analysis is spatial autocorrelation (SAC). If this term is new to you, check out my primer on autocorrelation in ecology. If you work with spatial data in R (or plan to do so) and want to calculate the degree to which SAC could affect your investigation, read on! If you already know about SAC and have decided it is confounding your analysis, the good news is that there are a variety of sophisticated techniques you can apply, which I review in another post.
Here, I simulate spatial data and work through simple ways to:
- estimate the spatial scale of autocorrelation
- calculate the inflation of degrees of freedom
- aggregate data to decrease SAC
The first step is to generate autocorrelated data on a grid (raster). Santiago Begueria wrote a detailed post about doing this with the gstat package. Alternatively, it’s possible to accomplish the same thing using the fields package. In geospatial analysis on continental scales, I might use an equal-area projection (e.g. Eckert VI). For computational efficiency and simplicity here, however, let us assume that the study area is small and near the equator, so we will work with lat/long as Cartesian coordinates and disregard the Earth’s curvature.
gridDim <- 60
xy <- expand.grid(x=1:gridDim, y=1:gridDim)
Now let’s simulate data with known SAC. In this case, there are two independent rasters that each contain SAC, and a third raster that is a combination of the first two with normal error. I am making a simplification in using independent predictors: most environmental variables co-vary (e.g. surface temperature and precipitation on land, or water depth and light availability in the sea). This is why in the previous post I recommended thinking about the process that generated the SAC in the first place. Also note that if the error was not normally distributed, I would recommend analyzing the data through Generalized Linear Mixed Models (GLMM) or Generalized Estimating Equations (GEE).
library(gstat)
# Variogram model, with defined sill and range
varioMod <- vgm(psill=0.05, range=10, model='Exp')
# Set up an additional variable from simple kriging
zDummy <- gstat(formula=z~1, locations = ~x+y, dummy=TRUE,
beta=1, model=varioMod, nmax=20)
# Generate 2 randomly autocorrelated predictor data fields
set.seed(3)
xyz <- predict(zDummy, newdata=xy, nsim=2)
# Generate an autocorrelated response variable:
# Deterministic on predictors, with normal error
e <- rnorm(nrow(xyz), sd=.5)
xyz$resp <- .7*xyz$sim1 + 1.4*xyz$sim2 + e
What does this SAC look like?
library(sp)
test <- xyz
gridded(test) = ~x+y
spplot(test[1])
Even this simple simulation looks remarkably like an environmental map. If the range had been set at a larger value (e.g. 20), each blotch of color would be wider.
The "range" of a variogram is the point at which the curve levels off, i.e. the maximum distance at which data are autocorrelated. The "sill" is the SAC that occurs at a distance equal to the range.
Estimate scale of autocorrelation
There are many packages and functions to calculate SAC as a function of distance. Ecologists tend to plot a metric called Moran’s I, where 0 indicates no correlation. Geoscientists tend to use variograms, which plot the inverse pattern.
I prefer the “variogram” function in gstat, for several reasons. First, it can calculate great circle distances, which would be convenient if the data extended over a large part of the planet’s surface. Also, this function will plot a nice map (set “map=TRUE” and define the “width” argument). Lastly, if you have reason to suspect that SAC is aligned in a certain direction (e.g. down a slope or along a river), an optional argument “alpha” will calculate correlation along any given axis. Alternatively, the related function “variogramST” can handle spatio-temporal correlation. The “Variogram” function in nlme can calculate a variogram from standardized residuals.
spdf <- SpatialPointsDataFrame(xy, data=xyz)
vario <- variogram(resp~1, data=xyz, locations= ~x+y,
cutoff=0.5*gridDim)
plot(vario$dist, vario$gamma)
In this realization of the simulation, the range seems a bit larger than the value set in the code above (10 units). The variogram indicates a range of ~15 units.
Aggregate data, decrease autocorrelation
There are many ways to account for SAC, including the models I reviewed in a previous post. An alternative approach is to treat the SAC prior to analysis – perhaps even during data collection. Specifically, the goal is to sample at a spatial grain larger than the range. Continuing with the example data above, and taking advantage of the squareness of the initial extent, we can subset to the grid of cells that are spaced 15 units apart. (There are many equivalent ways to implement this step in R, however.)
rng <- 15
mxPosition <- floor(gridDim/rng)
keepPosition <- rng*(1:mxPosition)
keepX <- which(xy$x %in% keepPosition)
keepY <- which(xy$y %in% keepPosition)
bigGrain <- xyz[intersect(keepX, keepY),]
Now we have a smaller raster, in which values of adjacent cells correlate weakly if at all. How does analysis on this raster compare to analysis on the original data?
Regression
In this section we will build a regression on data at both a coarse and fine resolution, and then compare the performance. First we predict the response variable from the coarse-grain data:
fmla <- as.formula(resp ~ sim1 + sim2)
lmBig <- lm(fmla, data = bigGrain)
summary(lmBig)
# Call:
# lm(formula = fmla, data = bigGrain)
# Residuals:
# Min 1Q Median 3Q Max
# -1.0081 -0.3124 0.1485 0.3659 0.6990
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.5547 1.2599 -0.440 0.6670
# sim1 0.5998 0.7152 0.839 0.4168
# sim2 1.8767 0.7237 2.593 0.0223 *
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 0.5699 on 13 degrees of freedom
# Multiple R-squared: 0.3427, Adjusted R-squared: 0.2416
# F-statistic: 3.389 on 2 and 13 DF, p-value: 0.06538
The estimated beta coefficients are approximately 1 standard error from the true values (0.7 and 1.4). Let’s repeat this regression on the fine-grain data:
lmFine <- lm(fmla, data = xyz)
summary(lmFine)
# Call:
# lm(formula = fmla, data = xyz)
# Residuals:
# Min 1Q Median 3Q Max
# -1.73515 -0.33877 -0.00586 0.32890 1.76475
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.10916 0.05731 -1.905 0.0569 .
# sim1 0.72254 0.04107 17.591 <2e-16 ***
# sim2 1.47054 0.04145 35.477 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 0.4979 on 3597 degrees of freedom
# Multiple R-squared: 0.3183, Adjusted R-squared: 0.3179
# F-statistic: 839.8 on 2 and 3597 DF, p-value: < 2.2e-16
Note the huge inflation in degrees of freedom because of the extra cells, and hence smaller standard errors than in the coarser data.
More is not better: additional raster cells add redundant information. In other words, closely spaced observations “pseudo-replicate” each other, which is a violation of the model’s assumption of independence.
At this point it may be tempting to think that the finer-grain data result in a more informative model, on the basis of any common performance metric. However, this is an unfair comparison of models. The goal is not to minimize a metric like p-values to satisfy critics, but to accurately and precisely estimate 1) the underlying relationship and 2) our uncertainty about that estimate. If we subset the fine-grain data to the same number of cells as the coarse-grain data, for instance by selecting a corner of the original grid, we find that the original raster does not contain as much information as we may have thought.
cornrPosition <- 1:length(keepPosition)
keepX <- which(xy$x %in% cornrPosition)
keepY <- which(xy$y %in% cornrPosition)
cornrFine <- xyz[intersect(keepX, keepY),]
lmCornr <- lm(fmla, data = cornrFine)
summary(lmCornr)
# Call:
# lm(formula = fmla, data = cornrFine)
# Residuals:
# Min 1Q Median 3Q Max
# -0.71805 -0.29791 0.00389 0.28861 0.61303
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.2528 1.9329 0.131 0.898
# sim1 0.3068 1.1960 0.257 0.802
# sim2 1.2137 1.6056 0.756 0.463
# Residual standard error: 0.4306 on 13 degrees of freedom
# Multiple R-squared: 0.04222, Adjusted R-squared: -0.1051
# F-statistic: 0.2865 on 2 and 13 DF, p-value: 0.7555
The coarse-grain data provide more precise estimates of the relationship than an equivalent sample of fine-grain data. In this case, more is not better: additional cells add redundant information. In other words, closely spaced observations pseudoreplicate each other, which is a violation of the model’s assumption that the observations are independent of each other.
So what?
Collecting spatial data takes time, resources, and effort. Due diligence in deciding the appropriate resolution and extent of study up front could could simplify later analysis massively. Variograms are useful tools to calculate a resolution that will yield the most information without violating model assumptions of independence. After data collection, variograms can justify succinctly whether or not subsequent analysis should account for spatial autocorrelation. Aggregation is a quick means to lessen the degree of autocorrelation prior to analysis; post hoc methods can be effective as well, but are often more complex and computational.