Does anyone know a statistical test for telling me if I have an outlier within a set of spatial point data? It seems like someone should have invented said method in the 1960s and I just can’t find it through my googling. But I do read a fair bit of GIS and geostatistics literature, and I’ve never seen it. (Or, gasp, someone tried to do it and concluded it was intractable…) Guess I’ll have to make my own.

So here’s my situation: I have some point data – just normal latitude and longitude coordinates – with an associated covariate. Let’s call the covariate m for now. I want to draw a polygon around my points and argue that the resulting shape can be defined as the boundary of a neighborhood. Except I’m worried that there are really high, or really low, values of m near the border of the neighborhood, and thus my resulting polygons are potentially skewed toward/away from these “spatial outliers.”

As an illustration, imagine that a potato farmer wants to spray her field for aphids, but only wants to spray the affected areas. Logically, she decides to randomly sample 10 locations within her field; draw a polygon around the locations where she finds aphids; and then spray all the area inside the polygon (and none of the area outside the polygon). You could check out UC-Davis’s excellent website for potato pest management to confirm that this is not exactly the correct method, but it’s a reasonable approximation (although, their recommendation on sampling is radically different).

Our potato farmer might observe something like this. The labels are the aphid counts at each location. I’ll explain the diamond symbols in a moment…. Remember that the “polygon of infestation” is drawn according to a presence/absence dichotomy, so the aphid count is the covariate in this situation. (And, yes, it has already occurred to me that everything I’m writing here might only apply to this special case where the polygon is based on a dichotomized version of an underlying count variable, m. But that additional complication is for future blogage….)

Here’s the data for those that want to play along at home:
```i         x       y  m 1 -118.8682 46.1734 10 2 -118.8687 46.1737 3 3 -118.8683 46.1738 0 4 -118.8685 46.1732 0 5 -118.8688 46.1735 0 6 -118.8681 46.1732 0 7 -118.8686 46.1733 4 8 -118.8685 46.1734 9 9 -118.8684 46.1737 2 10 -118.8686 46.1736 1```

I am trying to calculate the amount that any given point might be considered an outlier. Either an outlier in terms of the distribution of the covariate, a spatial outlier, or both. To me, this sounds like a perfect situation to calculate leverages via the hat matrix. You may remember from an intro regression course – or maybe you don’t, because sadly, regression is usually not taught from a matrix algebra point of view – that the hat matrix is a n X n square matrix of observations, which puts a hat on the observed Y variable:
``` y-hat = H * y ```
More importantly for my purposes, the diagonal elements of any hat matrix (usually denoted h_ii), indicate the amount of leverage that observation i has on y-hat. And even better for my purposes, I don’t need a Y variable to calculate a hat matrix because it’s composed entirely of the design matrix, and a few transformations there of:
``` H = X * ( t(X) * X )^{-1} * t(X) ```
where X is the n X p design matrix of n observations and p independent variables; and t(X) is the transpose of X, and X^{-1} is the usual inverse matrix.

My design matrix, X, has the latitude, longitude, and aphid counts from my potato example. When I calculate a hat matrix for it, the resulting values are indicators of leverage — both spatially and on the covariate. Now look back up there at that map. See those diamond shapes? They’re proportionally sized based on the value of the diagonal of the hat matrix (h_ii) for each observation. And what do we conclude from this little example?

By looking at the raw aphid counts, our potato farmer may have been tempted to enlarge the spraying zone around the points with nine and 10 aphids — they seem like rather high values and they’re both kind of near the edge of the polygon. However, the most “outlierly” observation in her sample was the spot with four aphids located at the southwest corner of the polygon.
It has a hat value of .792488, a good bit larger than the location with 10 aphids, which had a hat value of 0.615336.

At this point, a good geostatistician could probably come up with a measure of significance to go along with my hat values, but I’m not a geostatistician – good or otherwise. I just Monte Carlo-ed the values a bit and concluded…. given this arrangement of sample points *with aphids,* about 11% of the time we would see a hat value equal to or above .792488. If we use the standard alpha level of .05 found in most social science publications, our potato farmer would be forced to accept the null hypothesis that the observed aphid counts were drawn from a random distribution. I.e. there aren’t any outliers – beyond what we would expect from randomness – so she should trust the polygon as a good boundary of the zone of infestation. (Note my emphasis of “with aphids” in the conclusion. I could have Monte Carlo-ed the points with zero counts, but chose not to because, laziness. Not sure if that changes the conclusions…)

So? Two things: 1) I would love to find out that someone else invented a better method for detecting spatial outliers in point pattern data; and 2) hat matrices are really useful.

And because one dataset is never enough, I downloaded a version of John Snow’s cholera data that Robin Wilson digitized from the original maps. Same procedure here, except color indicates the number of deaths at each location and the dot circumference indicates the hat values. That point in the middle had 15 deaths when most of the addresses had one or two deaths; this created a hat value of .28009. Given 1000 Monte Carlo trials, less than 2% of the draws showed a hat value higher than .28009. So even though this point looks to be quite close to the middle of the points, it is likely to be a spatial outlier – above and beyond what we would expect given randomness.

Comments? Suggestions for improvements? Pictures of John Snow wearing dapper hats?

• Oh, yeah. It wasn’t just laziness that kept from doing Monte Carlo runs on the full set of points. I.e. including the zeros. On purpose, I left them out, because in my real application for this procedure, I don’t really have any zero values. Kind of like with the cholera data, the points are only in the dataset where there are positive (non-zero) values.

• Trey

I don’t have suggestions for improvement per se, but I do wonder about when this method is applied to actual practice. For instance, say the potato farmer identifies the four-aphid spot as an outlier and decides not to include it in her spraying plan due to cost or potentially negative effects from spraying. However, the method by which that spot came to have four aphids is important — if aphids can diffuse to nearby plants and go unsprayed, that throws off your entire spraying strategy. Then again, this seems like a classic case of “which do you want to minimize — Type I or Type II errors?”

• The claims about significance from the Monte Carlo draws is admittedly the weakest part of this approach. I had originally hoped that there would be closed form solutions to the maxima of the distribution of hat values, given the polygon and the range of m values; but if there is such a beast, I couldn’t derive it. Maybe our potato farmer should abandoned null-hypothesis type reasoning all together and just know that the location with 10 aphids is less of an outlier than the location with 4 aphids. (Given what I know of farmers, they’re very risk-averse, so they spray the whole field, disirregardless.)

• Isn’t this just a special case of multivariate outlier detection? Below see the discussion of it from my dissertation, which also has the source citations as well as a specific instance of its use:

“Multivariate Outlier Detection. Multivariate outlier detection is conducted
through a visual inspection of Cook’s (1977) distance and the standardized residuals.
Both of these approaches first regress all indicators in the model against an ID variable.
Cook’s distance is a measure of influence that reports how the regression coefficients
would change if an observation—that is, a participant—is excluded (Stevens, 1984). In
this study, Cook’s Distance is calculated through the cookd function of the car module
(Fox, 2009) in R (R Core Development Team, 2009) . The standardized residual or z
score is conducted using the stdres function of the MASS module (Venables & Ripley, 99,
2002) in R. Cook’s distance and standardized residuals are calculated, plotted, and then
visually inspected. Outliers will be considered to be present in the data if any
observations has an absolute standardized residual greater than Tabachnick and Fidell’s
(2006) recommended cut-off of 3.29 (p. 73), or if 5% or more of the observations have a
Cook’s distance greater than Cohen, Cohen, West, and Aiken’s (2003) suggested cut-off of
1.0 (p. 404).”

http://digitalcommons.unl.edu/dissertations/AAI3402936/

• Those are great citations in your excerpt. And I began by thinking about this in much the same way … except I’m not running any regressions here so I don’t have any residuals – studentized or otherwise. I also thought of simply correlating the value of m with the distance to the polygon’s centroid, which might work out algebraically to the same thing.

• The regression, IIRC, is simply to a nonsense index variable. So the regression isn’t to a meaningful formula, just with the index as Y, and your two spatial dimensions (if that’s all you have) as X1 and X2… though it probably makes more sense to view it as true multidimensional data, and other meaningful indicators as X3…. Xn.

• Are you suggesting:

z = X + Y + m

where z is just an index (1, 2, …. n)? I hadn’t thought of that idea.

I did run the regression:

m = X + Y + X^2 + Y^2

which is commonly called a trend surface regression. Ie it predicts the variable of interest, e.g. aphid counts, based on the location of the point. (The exact form of the exponential terms is up to the rsearcher.) This regression gives us hat values, studentized residuals, and Cook’s D; but I had two problems with this approach:

a) in small samples, the in/exclusion of certain quadratic terms makes a lot of difference in the results;

b) it assumes a continuous process, which might not be applicable for all types of point data.

• I’m not sure how statistically rigorous it is, but DBSCAN (and related algorithms) identify outliers. You could possibly iterate the epsilon to see ‘how much of an outlier’ different points are?

• I’ve never heard of that one in particular, but I looked into a lot of techniques for splines and kriging. Essentially those things are local density estimators, and they work great – probably giving much more visually-pleasing and intuitive results. But – as far as I can tell – they all require some selection/tuning of arbitrary parameters. It’s usually something like a “bandwidth.” Eventually I’d like to claim that my method is a an “assumption free” method of spatial outlier detection. That’s the hope anyways.

Is the idea to drop outlying points from the edges of an existing polygon or to generate polygons that exclude all outliers?

The reason I ask is that the leverage statistic doesn’t actually care about the spatial arrangement of points and, hence, we could imagine a scenario in which we create doughnut-like polygons in which we exclude a spatially typical observation (i.e. typical with respect to x and y) that has a large leverage statistic by virtue of being exceptional with respect to m. This problem doesn’t come up, however, if you only care about dropping points from the boundaries of an existing polygon.

• But that’s the whole point, Adam. The X and Y coordinates are in the design matrix, and thus when I calculate the diagonal elements of the hat matrix; we can (I think) conclude that extreme hat values indicate outliers that do care about the spatial arrangement. In the cholera map, that point with 15 deaths is such an extreme outlier on the m variable that, even though it’s right in the middle of X and Y, it still has the highest hat value.

And on the issue of dropping outliers: I’m not sure yet myself; though, “what do we do with outliers?” has never been a settled question, so I don’t feel too bad about not quite having that part figured out, yet.

I think we agree on the basic facts but disagree on the conclusion here. Let’s say we could spray for cholera the same way we could spray for aphids. If the idea is to drop outliers, this would mean NOT spraying for cholera in the precisely the place where you need to spray the most! See what I’m getting at here?

Getting back to your data, try adding an 11th point at the mean of x and the mean of y (i.e. a spatially typical observation) with an aphid count of 30. By my calculation, the new observation would have a leverage value of 0.856—way higher than any other observation in the sample. I don’t know how you did your Monte Carlo-ing so I can’t work this example through entirely, but you can see where this is going. Play around with the aphid count on the 11th observations and see where exactly it becomes significant.

The other thing to note is that the trace of the hat matrix (i.e. the sum of the leverage statistics) is always equal to p—the number of columns in the original design matrix (i.e. the number of independent variables). This means that the distribution of hat values for any given design matrix is going to be governed in part by the dimensions of the matrix in question. I’m not sure how this plays into a Monte Carlo-based significance test, but it is important to keep in mind.

To get a sense of what the distribution of leverage statistics might look like, I did the following:

#libraries
library(MASS)

#generate a spherical distribution (n = 10000, p = 3)
z<-mvrnorm(10000, c(0,0,0), matrix(c(1,0,0,0,1,0,0,0,1), nrow = 3))

#generate leverage statistics
z.hat<-diag(z%*%solve(t(z)%*%z)%*%t(z))

#graph it!
hist(z.hat)

I have no idea what this shows, but you could use this framework to figure out how the distribution of leverage statistics changes depending on the properties of the design matrix. Let's get back to the idea of trying to identify polygons based on the spatial arrangement of points…

It seems to me that you are interested in a process which generates data characterized by a clustering of values around some central point, with the magnitude of values declining as we get further away from the center. Let's think about this as if it were a topography with the magnitude of any given point represented as height. What you want to do is draw the boundaries of the polygon so that they correspond to the places where the topography begins to level off. In theory, this could be done by modeling height with respect to x and y and then finding the values of x and y where the partial derivatives of the resulting curve go to zero.