This week, the Global Data on Events, Location, and Tone, or GDELT dataset went public. The architect of this project is Kalev Leetaru, a researcher in library and information sciences, and owes much to the work of Phil Schrodt.

The scale of this project is nothing short of groundbreaking. It includes 200 million dyadic events from 1979-2012. Each event profiles target and source actors, including not only states, but also substate actors, the type of event drawn from the Schrodt-specified CAMEO project, and even longitude and latitude of the event for many of the events. The events are drawn from several different news sources, including the AP, AFP, Reuters, and Xinhua and are computer-coded with Schrodt’s TABARI system.

To give you a sense how much more this has improved upon the granularity of what we once had, the last large project of this sort that hadn’t been in the domain of a national security organization is King and Lowe’s 10 million dyadic events dataset. Furthermore, the dataset will be updated daily. And to put a cherry on the top, as Jay Ulfelder pointed out, it was funded by the National Science Foundation.

For my own purposes, I’m planning on using these data to extract protest event counts. Social movement scholars have typically relied on handcoding newspaper archives to count for particular protest events, which is typically time-consuming and also susceptible to selection and description bias (Earl et al. 2004 have a good review of this). This dataset has the potential to take some of the time out of this; the jury is still out on how well it accounts for the biases, though.

For what it’s worth, though, it looks like it does a pretty bang-up job with some of the Egypt data. Here’s a simple plot I did across time for CAMEO codes related to protest with some Egyptian entity as the source actor. Rather low until January 2011, and then staying more steady through out the year, peaking again in November 2011, during the Mohamed Mahmoud clashes.


These data have a lot of potential for political sociology, where computer-coded event data haven’t really made much of an appearance. Considering the granularity of the data, that it accounts for many substate actors, social movement scholars would be remiss not to start digging in.

A few other resources on GDELT:
Leetaru and Schrodt’s 2013 ISA paper
Jay Yonamine‘s (one of Schrodt’s students) paper on predicting levels of violence in Afghanistan

Last week, Alyssa got the boot and Jinkx kept her place. And I totally called it with my first model that accounted for the proportional hazards assumption. I think the model is having a little more success as the season plods on.

Before I get to the predictions for episode 10, there’s two really interesting prospects that may either give this model some more predictive power or become very interesting projects in their own right.

Continue reading

Last week, Alaska took it home with her dangerous performance, while Ivy Winters was sent home after going up against Alyssa Edwards. This is sad on many fronts. First, I love me some Ivy Winters. Second, Jinkx had revealed that she had a crush on Ivy, and the relationship that may have flourished between the two would have been too cute. But lastly, and possibly the worst, both of my models from last week had Ivy on top. Ugh.

What went wrong? Well, this certainly wasn’t Ivy’s challenge. But it’s high time that I started interrogating the models a little further.
Continue reading

A few weeks ago, Twitter announced that they were releasing a client for their Streaming API. It’s open-source! Get it here:

This is pretty great news, for a few reasons:

  1. The Streaming API relies on a consistent connection, so doing all that messy authentication and making sure you’re not going to drop any information is simplified and will comport to Twitter specs.
  2. Twitter is deprecating v1 of their APIs, including Streaming and RESTful. They haven’t made any dramatic changes in the Streaming API but it still means changing libraries or expecting someone who is maintaining your library of choice to update it.
  3. It’s all being developed and actively maintained in-house by Twitter. The maintainers, @steven and @kevino (apparently one of the perks of working at Twitter is getting an awesome username), are especially responsive with bug fixes and pull requests.
  4. There’s a plugin for the Twitter4j library, if you want to implement listeners that do any background data handling or parsing for particular pieces of data (deletes vs. stall_warnings). I haven’t tried this yet but it looks promising.

The downside? It’s in Java. While this used to be a nice insult when I was hacking around in CS 180 and Java was at version 1.4.2, Java has gotten much faster since then. The addition of projects like Apache Maven has made development with dependencies and handling classpaths much easier. But then, you still have to know at least a little Java to get the thing up and running.

I’ve been using this as my primary gardenhose collection device for a few weeks now with only a handful of issues, as bugs are surfacing in development but being squashed soon after.

Wow, last week’s Drag Race post made the rounds in the stats and Drag Race circles. It was cross-posted to Jezebel and has been getting some pretty high-profile links. A little birdy told me that Ms. Ru herself has read it. I think I can die a happy man knowing that RuPaul has visited Bad Hessian.

Anyhow, last week I tried to count Coco out. I was reading her like the latest AJS. The library is open. But her response to me was simple — girl, please:

(Also this happened. Wig under a wig.)

(both of these gifs courtesy of f%^@yeahdragrace)

Can that win safeguard Coco from getting eliminated? Let’s look at the numbers after the jump.
Continue reading


If you follow me on Twitter, you know that I’m a big fan of RuPaul’s Drag Race. The transformation, the glamour, the sheer eleganza extravanga is something my life needs to interrupt the monotony of grad school. I was able to catch up on nearly four seasons in a little less than a month, and I’ve been watching the current (fifth) season religiously every Monday at Plan B, the gay bar across from my house.

I don’t know if this occurs with other reality shows (this is the first I’ve been taken with), but there is some element of prediction involved in knowing who will come out as the winner. A drag queen we spoke with at Plan B suggested that the length of time each queen appears in the season preview is an indicator, while Homoviper’s “index” is largely based on a more qualitative, hermeneutic analysis. I figured, hey, we could probably build a statistical model to know which factors are the most determinative in winning the competition.
Continue reading

The quote above comes from Firebaugh and Gibbs’s “User’s Guide to Ratio Variables” (1985: 718). I first ran across this article a couple of years ago but only just got around to reading it this past week. This article, along with a couple of companion pieces (Firebaugh and Gibbs 1986; Firebaugh 1988), helped to redefine what was, at that point, a nearly century-old debate dating back to an 1897 article by Karl Pearson on ratio variables and the problem of spurious correlation. The gist of “Pearson’s Paradox” is that “two ratios can be correlated even when their components are not—for example, X/Z and Y/Z can be correlated even when X, Y, and Z are not” (Firebaugh 1988: 524).* This basic fact became a point of contention among methodologists interested in, among other things, the best approach to controlling for population size when analyzing aggregate data in which the magnitude of a given outcome is at least partially driven by the size of the underlying units. While the debate itself is pretty interesting, the thing I liked best about the Firebaugh and Gibbs piece is the way in which the authors managed to clear away a significant amount of methodological underbrush using simple math.

Following Firebaugh and Gibbs (1986: 103), let’s start with a component-based model in which y is a continuous outcome, x is the predictor of interest, z is a control representing the size of the population, and \eta represents a random disturbance:

    \[ y = \beta_0 + \beta_1{x} + \beta_2{z} + \eta. \]

If we then divide everything through by z we end up with an equivalent ratio-based model:

    \[ \frac{y}{z} = \beta_0{\left(\frac{1}{z}\right)} + \beta_1{\left(\frac{x}{z}\right)} + \beta_2 + \varepsilon, \]

where \varepsilon = \eta/z. On its face, the equivalence of these two expressions seems obvious. Yet prior to the work of Firebaugh and Gibbs, much of the fight was over the difference between the component-based model described above and the following:

    \[ \frac{y}{z} = \beta^*_1{\left(\frac{x}{z}\right)} + \beta^*_2 + \varepsilon^*. \]

Simply put, the fight was driven by an attempt to adjudicate between fundamentally non-comparable models, hence the reference in the title to wasted journal space, confused readers, and solutions to phantom problems.

What Firebaugh and Gibbs ultimately show is that when we compare the component method to the equivalent ratio method (i.e. when we make the correct comparison), we find alternative estimators for the same basic model. To the extent that \sigma^2—the variance of \eta—is proportional to z^2 (i.e. to the extent that the variance of the error term is characterized by a particular form of population-related heteroscedasticity), the ratio method actually provides more efficient estimates of the parameters of interest than the corresponding component method (see Firebaugh and Gibbs 1986).** So where we once saw a potential problem, we now see a potential solution.

Even if you don’t care about ratio variables, I think that the original piece, subsequent follow ups, and exchanges with critics (namely Bradshaw and Radbill 1987) are well worth the read. This is a great example of someone thinking through the problem of model specification, as well as the implications of the often overlooked distinction between specification and estimation. There is also a serious discussion of the relationship between theory and method. More specifically, Firebaugh and Gibbs go to great lengths to emphasize that, by definition, our theoretical interests cannot help us decide between mathematically equivalent expressions. The trick, of course, is recognizing equivalent expressions when you see them.

* Firebaugh (1988: 524-526) shows that Pearson’s Paradox is a byproduct of the fact that correlation coefficients do not account for the value of the y-intercept. Pearson’s Paradox does not extend to the case of regression in which the intercept is explicitly taken into account.

** Nerdy readers may recognize the ratio method for what it is: a weighted least squares model.

Does anyone know a statistical test for telling me if I have an outlier within a set of spatial point data? It seems like someone should have invented said method in the 1960s and I just can’t find it through my googling. But I do read a fair bit of GIS and geostatistics literature, and I’ve never seen it. (Or, gasp, someone tried to do it and concluded it was intractable…) Guess I’ll have to make my own.

So here’s my situation: I have some point data – just normal latitude and longitude coordinates – with an associated covariate. Let’s call the covariate m for now. I want to draw a polygon around my points and argue that the resulting shape can be defined as the boundary of a neighborhood. Except I’m worried that there are really high, or really low, values of m near the border of the neighborhood, and thus my resulting polygons are potentially skewed toward/away from these “spatial outliers.”

As an illustration, imagine that a potato farmer wants to spray her field for aphids, but only wants to spray the affected areas. Logically, she decides to randomly sample 10 locations within her field; draw a polygon around the locations where she finds aphids; and then spray all the area inside the polygon (and none of the area outside the polygon). You could check out UC-Davis’s excellent website for potato pest management to confirm that this is not exactly the correct method, but it’s a reasonable approximation (although, their recommendation on sampling is radically different).

Our potato farmer might observe something like this. The labels are the aphid counts at each location. I’ll explain the diamond symbols in a moment….


Remember that the “polygon of infestation” is drawn according to a presence/absence dichotomy, so the aphid count is the covariate in this situation. (And, yes, it has already occurred to me that everything I’m writing here might only apply to this special case where the polygon is based on a dichotomized version of an underlying count variable, m. But that additional complication is for future blogage….)

Here’s the data for those that want to play along at home:
i         x       y  m
1 -118.8682 46.1734 10
2 -118.8687 46.1737 3
3 -118.8683 46.1738 0
4 -118.8685 46.1732 0
5 -118.8688 46.1735 0
6 -118.8681 46.1732 0
7 -118.8686 46.1733 4
8 -118.8685 46.1734 9
9 -118.8684 46.1737 2
10 -118.8686 46.1736 1

I am trying to calculate the amount that any given point might be considered an outlier. Either an outlier in terms of the distribution of the covariate, a spatial outlier, or both. To me, this sounds like a perfect situation to calculate leverages via the hat matrix. You may remember from an intro regression course – or maybe you don’t, because sadly, regression is usually not taught from a matrix algebra point of view – that the hat matrix is a n X n square matrix of observations, which puts a hat on the observed Y variable:

y-hat = H * y

More importantly for my purposes, the diagonal elements of any hat matrix (usually denoted h_ii), indicate the amount of leverage that observation i has on y-hat. And even better for my purposes, I don’t need a Y variable to calculate a hat matrix because it’s composed entirely of the design matrix, and a few transformations there of:

H = X * ( t(X) * X )^{-1} * t(X)

where X is the n X p design matrix of n observations and p independent variables; and t(X) is the transpose of X, and X^{-1} is the usual inverse matrix.

My design matrix, X, has the latitude, longitude, and aphid counts from my potato example. When I calculate a hat matrix for it, the resulting values are indicators of leverage — both spatially and on the covariate. Now look back up there at that map. See those diamond shapes? They’re proportionally sized based on the value of the diagonal of the hat matrix (h_ii) for each observation. And what do we conclude from this little example?

By looking at the raw aphid counts, our potato farmer may have been tempted to enlarge the spraying zone around the points with nine and 10 aphids — they seem like rather high values and they’re both kind of near the edge of the polygon. However, the most “outlierly” observation in her sample was the spot with four aphids located at the southwest corner of the polygon.
It has a hat value of .792488, a good bit larger than the location with 10 aphids, which had a hat value of 0.615336.

At this point, a good geostatistician could probably come up with a measure of significance to go along with my hat values, but I’m not a geostatistician – good or otherwise. I just Monte Carlo-ed the values a bit and concluded…. given this arrangement of sample points *with aphids,* about 11% of the time we would see a hat value equal to or above .792488. If we use the standard alpha level of .05 found in most social science publications, our potato farmer would be forced to accept the null hypothesis that the observed aphid counts were drawn from a random distribution. I.e. there aren’t any outliers – beyond what we would expect from randomness – so she should trust the polygon as a good boundary of the zone of infestation. (Note my emphasis of “with aphids” in the conclusion. I could have Monte Carlo-ed the points with zero counts, but chose not to because, laziness. Not sure if that changes the conclusions…)

So? Two things: 1) I would love to find out that someone else invented a better method for detecting spatial outliers in point pattern data; and 2) hat matrices are really useful.

And because one dataset is never enough, I downloaded a version of John Snow’s cholera data that Robin Wilson digitized from the original maps. Same procedure here, except color indicates the number of deaths at each location and the dot circumference indicates the hat values.


That point in the middle had 15 deaths when most of the addresses had one or two deaths; this created a hat value of .28009. Given 1000 Monte Carlo trials, less than 2% of the draws showed a hat value higher than .28009. So even though this point looks to be quite close to the middle of the points, it is likely to be a spatial outlier – above and beyond what we would expect given randomness.

Comments? Suggestions for improvements? Pictures of John Snow wearing dapper hats?


Neal Caren of UNC Sociology has put out a call to forecast the number of NRA members in June 2013 using NRA publication subscriber data. Given that Neal’s interests elide with those of several Bad Hessian contributors and per my previous post on predictions in sociology, I took a stab using the forecast package for R. You can find the code I used here and my forecast can be found on scatterplot here.