Matt Moehr « Bad Hessian

About Matt Moehr

I am a graduate student in sociology at the University of Wisconsin. The elevator-speech of my research goes like this, "I'm trying to understand neighborhoods by analyzing the geography of photographs uploaded to the internet." On my more ambitious days, I sometimes say that I'm developing an alternative to census tracts for measuring neighborhood-level processes. But most days I'm just looking at pictures of cats. My current project has an equal number of methodological problems -- e.g. selection bias on the internet, the dreaded MAUP, and coding meaning in photographs -- and theoretical quandries -- e.g. why does it seem that the Chicago School of sociology completely ignored Marx? In my free time I like to dabble in union activism, listen to live music, and make nachos. Usually in that order.

There have been repeated calls for “space” in many fields of social science (all links are behind paywalls, sorry):

Demography: (Voss 2007)
Sociology: (Gieryn 2000)
Epidemiology: for an early critical review (Jacquez 2000)
Geography: obviously geographers were into space before it was cool. A couple of pieces I like are Doreen Massey’s book, For Space and O’Sullivan (2006) for a review of GIS.
Anthropology: the proceedings of a conference including a piece by Clifford Geertz, Senses of Place (1996). Though what I’m writing here has less to do with the space/place debate.

These are nice papers about what the authors think should be new research agendas, bu I think social sciences need to stop calling for space and start “playing” with space. Let me explain…

This idea started when fellow Bad Hessian, Alex Hanna, suggested that I read a paper about spatio-temporal models of crime in Chicago. We are in the same writing group. Alex has suffered through many presentations of a paper I’m writing about crime in Chicago. Chicago? Crime? I mean these have to be related papers, right? So I gave it a quick read:

Seth R. Flaxman, Daniel B. Neill, Alex J. Smola. 2013. Correlates of homicide: New space/time interaction tests for spatiotemporal point processes. Heinz College working paper, available at: http://www.heinz.cmu.edu/faculty-and-research/research/research-details/index.aspx?rid=483

…and it’s a really great paper! Flaxman reviews three standard measures for spatial and temporal independence and then proposes a new measure that can simultaneously test for spatio-temporal dependence. The measures are validated against real crime data from Chicago. On the other hand, it’s also completely useless for my project. I mean, I stuck it in a footnote, but I can’t engage with it in a substantively meaningful way because my paper is about the Modifiable Areal Unit Problem and the good ol’ MAUP is fundamentally about polygons — not points. The MAUP occurs because a given set of points can be aggregated into any number of different polygon units, and the subsequent results of models, bivariate relationships, or even hot spot analysis might change based on the aggregation method.

This means that Flaxman’s approach and my approach are not comparable because they each rest on different assumptions about how to measure distance, social interaction, and spatial dependence. They’re based on different spatial ontologies, if you will. But back to the main argument of this post: could we play around with the models in Flaxman, the models I’m making, plus some other models in order to test some of the implications of our ideas of space? Here are some hypothetical hypotheses….

Isotropy. Isotropy means that effects are the same in every direction. For example, weather models often take into account anisotropy because of prevailing wind direction. As Flaxman mentions at the end of the paper, alternative distance measures like Manahatten distance could be used. I would take it a step further and suggest that distance could be measured across a trend surface which might control for higher crime rates on the south side of Chicago and in the near-west suburbs. Likewise, spatial regression models of polygon data can use polynomial terms to approximate trend surfaces. Do the additional controls for anisotropy improve model fit? Or change parameter estimates?

Spatial discontinuities. A neighborhood model posits — albeit implicitly and sort of wishy-washy — that there could be two locations that are very close as the crow flies, but are subject to dramatically different forces because they are in different polygons. These sharp breaks might really exist, e.g. “the bad side of the tracks”, red-lining, TIFF funding, empowerment zones, rivers, gated suburbs. Or they might not. Point process models usually assume that space is continuous, i.e. that there are no discontinuities. Playing around with alternative models might give us evidence one way or another.

Effect decay. In spatial regression models like I’m using, it’s pretty normal to operationalize spatial effects for contiguous polygons and then set the effect to zero for all higher order neighbors. As in the Flaxman paper, most point models use some sort of kernal function to create effect estimates between points within a given bandwidth. These are both pretty arbitrary choices that make spatial effects too “circular”. For exmple, think of the economic geographies of interstate exchanges in middle America. You’ll see fast food, big box retail, gas stations, car dealerships, hotesls, etc. at alomst every interchange. Certainly there is a spatial pattern here but it’s not circular and it’s not (exponentially, geometrically, or linearly) decaying across distance. Comparisons between our standard models — where decay is constrained to follow parametric forms — and semi-parametric “hot spot” analyses might tell us if our models of spatial effects are too far away from reality.

Ok. Those sound like valid research questions, so why not just do that research and publish some results? As I see it, spatial work in social sciences usually boils down to two main types of writing. First, there are the papers that aren’t terribly interested in the substantive research areas, and are more about developing statistical models or testing a bunch of different models with the same data. Here are some examples of that type:

(Dormann et al 2007) undertake a herculean task by explicating and producing R code for no less than 13 different spatial models.
(Hubbard et al 2010) compare GEE to mixed models of neighborhood health outcomes.
(Tita and Greenbaum 2009) compare a spatial versus a spatio-social network as weighting matrices in spatial regression.

The problem with this approach is that the data is often old, simplified data from well-known example datasets. Or worst yet, it is simulated data with none of the usual problems of missing data, measurement error, and outliers. At best, these papers use over simplified models. For example, there aren’t any control variables for crime even when there is a giant body of literature about the socio-cultural correlates of spatial crime patterns (Flaxman and I are both guilty of this).

The second type of research would be just the opposite: interested in the substantive conclusions and disinterested in the vagaries of spatial models. They might compare hierchical or logistic regressions to the spatial regressions, but very rarely go in depth about all the possible ways of operationalizing the spatial processes they’re studying. And when you think about it, you can’t blame them because journal editors like to see logical arguments for the model assumptions used in a paper – not an admission that we don’t know anything about the process under study and a bunch of different models all with slightly different operationalizations of the spatial process. But here’s the thing: we don’t actually know very much about the spatial processes at work! And we have absolutely no evidence that the spatial processes for, say, crime are also useful in other domains like educational outcomes, voting behavior, factory siting, human pathogens, or communication networks.

Thus, we don’t need more social science papers that do spatial models. We need (many) more social science papers that do multiple, incongruent spatial models on the same substantively rich datasets. I think it’s only by modeling, for example, crime as an isotropic point process, a social network with spatial distance between nodes, and a series of discrete neighborhood polygons can we start to grasp if one set of assumptions about space is more/less accurate and more/less useful. In case you couldn’t tell, I’m a big fan of George Box’s famous quote. This is the slightly longer version:

“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.” (Box & Draper 1987, 74)

Good luck, and go play!

[Update, 2013-07-22: I changed the citation to the Flaxman paper, as it is now a working paper in his department at Carneige Mellon University.]

Does anyone know a statistical test for telling me if I have an outlier within a set of spatial point data? It seems like someone should have invented said method in the 1960s and I just can’t find it through my googling. But I do read a fair bit of GIS and geostatistics literature, and I’ve never seen it. (Or, gasp, someone tried to do it and concluded it was intractable…) Guess I’ll have to make my own.

So here’s my situation: I have some point data – just normal latitude and longitude coordinates – with an associated covariate. Let’s call the covariate m for now. I want to draw a polygon around my points and argue that the resulting shape can be defined as the boundary of a neighborhood. Except I’m worried that there are really high, or really low, values of m near the border of the neighborhood, and thus my resulting polygons are potentially skewed toward/away from these “spatial outliers.”

As an illustration, imagine that a potato farmer wants to spray her field for aphids, but only wants to spray the affected areas. Logically, she decides to randomly sample 10 locations within her field; draw a polygon around the locations where she finds aphids; and then spray all the area inside the polygon (and none of the area outside the polygon). You could check out UC-Davis’s excellent website for potato pest management to confirm that this is not exactly the correct method, but it’s a reasonable approximation (although, their recommendation on sampling is radically different).

Our potato farmer might observe something like this. The labels are the aphid counts at each location. I’ll explain the diamond symbols in a moment….

Remember that the “polygon of infestation” is drawn according to a presence/absence dichotomy, so the aphid count is the covariate in this situation. (And, yes, it has already occurred to me that everything I’m writing here might only apply to this special case where the polygon is based on a dichotomized version of an underlying count variable, m. But that additional complication is for future blogage….)

Here’s the data for those that want to play along at home:
i x y m 1 -118.8682 46.1734 10 2 -118.8687 46.1737 3 3 -118.8683 46.1738 0 4 -118.8685 46.1732 0 5 -118.8688 46.1735 0 6 -118.8681 46.1732 0 7 -118.8686 46.1733 4 8 -118.8685 46.1734 9 9 -118.8684 46.1737 2 10 -118.8686 46.1736 1

I am trying to calculate the amount that any given point might be considered an outlier. Either an outlier in terms of the distribution of the covariate, a spatial outlier, or both. To me, this sounds like a perfect situation to calculate leverages via the hat matrix. You may remember from an intro regression course – or maybe you don’t, because sadly, regression is usually not taught from a matrix algebra point of view – that the hat matrix is a n X n square matrix of observations, which puts a hat on the observed Y variable:
y-hat = H * y
More importantly for my purposes, the diagonal elements of any hat matrix (usually denoted h_ii), indicate the amount of leverage that observation i has on y-hat. And even better for my purposes, I don’t need a Y variable to calculate a hat matrix because it’s composed entirely of the design matrix, and a few transformations there of:
H = X * ( t(X) * X )^{-1} * t(X)
where X is the n X p design matrix of n observations and p independent variables; and t(X) is the transpose of X, and X^{-1} is the usual inverse matrix.

My design matrix, X, has the latitude, longitude, and aphid counts from my potato example. When I calculate a hat matrix for it, the resulting values are indicators of leverage — both spatially and on the covariate. Now look back up there at that map. See those diamond shapes? They’re proportionally sized based on the value of the diagonal of the hat matrix (h_ii) for each observation. And what do we conclude from this little example?

By looking at the raw aphid counts, our potato farmer may have been tempted to enlarge the spraying zone around the points with nine and 10 aphids — they seem like rather high values and they’re both kind of near the edge of the polygon. However, the most “outlierly” observation in her sample was the spot with four aphids located at the southwest corner of the polygon.
It has a hat value of .792488, a good bit larger than the location with 10 aphids, which had a hat value of 0.615336.

At this point, a good geostatistician could probably come up with a measure of significance to go along with my hat values, but I’m not a geostatistician – good or otherwise. I just Monte Carlo-ed the values a bit and concluded…. given this arrangement of sample points *with aphids,* about 11% of the time we would see a hat value equal to or above .792488. If we use the standard alpha level of .05 found in most social science publications, our potato farmer would be forced to accept the null hypothesis that the observed aphid counts were drawn from a random distribution. I.e. there aren’t any outliers – beyond what we would expect from randomness – so she should trust the polygon as a good boundary of the zone of infestation. (Note my emphasis of “with aphids” in the conclusion. I could have Monte Carlo-ed the points with zero counts, but chose not to because, laziness. Not sure if that changes the conclusions…)

So? Two things: 1) I would love to find out that someone else invented a better method for detecting spatial outliers in point pattern data; and 2) hat matrices are really useful.

And because one dataset is never enough, I downloaded a version of John Snow’s cholera data that Robin Wilson digitized from the original maps. Same procedure here, except color indicates the number of deaths at each location and the dot circumference indicates the hat values.

That point in the middle had 15 deaths when most of the addresses had one or two deaths; this created a hat value of .28009. Given 1000 Monte Carlo trials, less than 2% of the draws showed a hat value higher than .28009. So even though this point looks to be quite close to the middle of the points, it is likely to be a spatial outlier – above and beyond what we would expect given randomness.

Comments? Suggestions for improvements? Pictures of John Snow wearing dapper hats?

Google is back in the news for collecting WiFi data. As it turns out, the Europeans are really touchy about Google Street View and their private data. This story started back in 2010, when Google admitted that they were collecting public WiFi information with the same vehicles that drive around the world taking pictures for their Street View and Google Maps applications. Seemed like a good idea, but multiple European privacy agencies got all bent out of shape.

At first, I was sort of on Google’s side on this one. It would cool to have a map of WiFi density. If you read through that blog post from Google, though, you’ll notice that they only meant to collect public information — like the WiFi network name and it’s broadcasting channel — but “mistakingly” collected “samples” of payload data. Huh? I.e. they collected samples of websites that were being visited at unsecured WiFi access points like coffee shops (and if a website had poorly implemented it’s security, they may have collected your personal information, but you can’t really blame Google for that one). That’s creepy. Google claimed they had deleted all the payload data, but Google maintains a worldwide system of redundant storage servers, and it turns out they didn’t get it all deleted.

So I’m not on Google’s side anymore. They may be making a good faith effort to make this right, or they may be running a test program to identify which coffee shops slant towards Facebook or Google Plus usage. Such a program wouldn’t be evil, per se, but it would highly unethical. The whole episode brought to mind an article from four years ago where a geographer used basically the same procedure to measure WiFi density in and around Salt Lake City:

Torrens, P. M. 2008. Wi-Fi geographies. Annals of the Association of American Geographers 98:59-84.

For those of you without access to academic libraries, you can get a pretty good flavor of his research from this website, and here’s the punchline:

Torrens briefly addressed the issue of private/public space and legalities of collecting his data:

Most computer networks use IP to disassemble and reconstitute data as they are conveyed across networks and routers. Wi-Fi beacon frames essentially advertise the presence of the access point to clients in the surrounding environment and ensure that it is visible (in spectrum space) to many devices. Because they do not actually carry any substantive data from users of the network (their queries to a search engine, for example), it is legal to capture beacon frames. (p. 66)

Two questions: 1) What is the line between the data that is legal to collect and that which is illegal? And are researchers obligated to follow international standards or their home nation’s laws? and 2) Has a human subjects review board ever considered this issue? For example Madison, Wisconsin has a downtown wireless network that sells subscription service. It seems like a valid research question in communications geography to figure out which one of their access points have traffic, during which times of day, the distribution of laptops and smart phones, etc. Could I sample payload data if I just used it to collect presence/absence of users? and cross-my-heart promised to delete the raw data?

Looking around the University of Wisconsin’s IRB website I couldn’t find any memos about collecting ambient wireless signals. And their summary of exempt research might imply that WiFi data collection would be exempt based on its “public” nature, but it’s less clear if it is truly de-identified because Google and Torrens were collecting MAC addresses and SSIDs. True, that’s not like storing a person’s name, but IRB standards generally hold that street addresses are identifying data. The relevant guidelines from the Wisconsin policy on exempt research:

Research involving the collection or study of existing data, documents, records, pathological specimens, or diagnostic specimens, if these sources are publicly available or if the information is recorded by the investigator in such a manner that subjects cannot be identified, directly or through identifiers linked to the subjects.

Hm, I don’t know.

Bad Hessian

skocpol <- sapply('state', function(x) { paste('bringing the', x, 'back in') })

Author Archives: Matt Moehr

About Matt Moehr

Spatial Playtime

Spatial outliers, potato farmers, and John Snow’s cholera map

WiFi and IRB