For my dissertation, I’ve been working on a way to generate new protest event data using principles from natural language processing and machine learning. In the process, I’ve been assessing other datasets to see how well they have captured protest events.

I’ve mused on before on assessing GDELT (currently under reorganized management) for protest events. One of the steps of doing this has been to compare it to the Dynamics of Collective Action dataset. The Dynamics of Collective Action dataset (here thereafter DoCA) is a remarkable undertaking, supervised by some leading names in social movements (Soule, McCarthy, Olzak, and McAdam), wherein their team handcoded 35 years of the New York Times for protest events. Each event record includes not only when and where the event took place (what GDELT includes), but over 90 other variables, including a qualitative description of the event, claims of the protesters, their target, the form of protest, and the groups initiating it.

Pam Oliver, Chaeyoon Lim, and I compared the two datasets by looking at a simple monthly time series of event counts and also did a qualitative comparison of a specific month.


The figure above compares the monthly US protest counts for the months for which GDELT overlaps with DoCA. As the graph indicates, GDELT generally finds more protest events than DoCA, a result that is not surprising given the greater number of news sources that it draws upon. More importantly for protest event analysis, the two data series are uncorrelated (r = -.11) in counts across time. DoCA can be taken as the gold standard for coding events that are in the New York Times, but many concerns have been expressed about the limitation of relying on a single newspaper for protest event analysis (e.g. Ortiz et al. 2005, Davenport 2010). It’s not clear whether the non-correspondence between GDELT and DoCA is due to the limitations of the New York Times as a source, the limitations of the GDELT search protocol for protest events, or both. But this does suggest that we ought to be cautious in interpreting how much protest is being captured until we know more about the biases of each.

To even better understand the sources of the discrepancy between the two datasets, we arbitrarily selected one month, April 1995, for which GDELT had 65 protest events and DoCA had 51 (including events either reported or occurring that month). We sought to match up events in the two datasets using place and date information combined with qualitative event descriptors in DoCA, and source and target actor codes in GDELT. For this one month, there appeared to be zero overlap between the two datasets. That is, there were no date/place exact matches, and the few events with the same place but different dates were ruled out as possible matches when the DoCA descriptive fields were compared to the actor fields in GDELT.

There is enough information in DoCA to identify some sources of divergence. Many of the events in DoCA were lawsuits or meetings which we would not expect to be captured by GDELT. Both datasets also exhibited a strong location bias: 25 of the 51 events in DoCA occurred in New York, Connecticut, or New Jersey and only 5 in Washington, DC. Most of the New York-area events had a state or local emphasis that probably would not be of interest to international wire services, which are main sources for GDELT. By contrast, 22 of the 65 GDELT events occurred in Washington, DC, only 5 events in New York or Connecticut, and 33 (about a half) did not specify a location more detailed than just “United States.” But even events that “should” be in GDELT didn’t seem to be. For example, DoCA listed a march of 50,000 people occurring in Washington, DC around a national issue that had no possible match in GDELT. The few events in GDELT that listed protesters or opposition as an actor type had no possible matches in DoCA; in most cases, the GDELT actor types were listed as governments, not protesters. Information in these GDELT records was usually too sparse to gain any sense of what the events were or to permit date-focused searches of news archives to try to identify the original event GDELT had classified. Jay Ulfelder, in monitoring atrocities from GDELT, also suggests a high rate of false positives and use human readers to clean its results; he also reports that GDELT misses events they know of from other sources.

This is just a first attempt to understand what GDELT is getting at when it reports protest events. It’s getting at something but there needs to be a move towards more transparency (possibly a source trail, which I know is part of the daily update files) when it reports protest event hits. It’d be helpful to know more about what’s in the historical backfiles (those from 1979-2012), however.

  • Jay Ulfelder

    Thanks a lot for doing this and reporting on it, Alex. We need a lot more of these exercises to get a clearer sense of what GDELT is and is not, and to help guide the further development of that project and others like it.

    One small point of clarification about what I’ve been finding in our efforts to use GDELT to monitor atrocities: Our two-stage, human-in-the-loop screening isn’t about eliminating false positives as much as it’s about adapting and extending GDELT to do something it wasn’t designed to do. We’re trying to use GDELT to spot deliberate, lethal, political violence against noncombatant civilians, and that’s not an event type CAMEO was designed to recognize. So, most of the records we review and toss out aren’t false positives so much as related events that turn out not to meet our criteria. To be sure, there are also events that appear to be miscategorized, and we see quite a few errors in the actor codings, but the single-biggest issue for us is that we’re looking for a type of event that CAMEO and GDELT weren’t designed to code directly. Hopefully, future versions of these systems will either add what we’re looking for or be flexible enough so we can adapt it to that ourselves.

    • Thanks for clarifying, Jay. I knew you were using GDELT as a sort of prescreen to dig into the events proper more. I also hope that future event ontologies will be sensitive enough to pick up these kinds of things. I know Phil Schrodt has said that CAMEO was designed for mediation so gathering info on atrocities and protests is somewhat out of its element.

  • Brian Phillips

    This is very interesting. I’d be curious to know more about GDELT’s 300+ spike in 1987 or so.

  • Pingback: Two Rubes Walk into a Bar, Order Event Data (Part 2) | Will Opines()

  • Pingback: No More Fountains of Youth/Pots o’ Gold: Conceptualization and Events Data (Part 1) | Will Opines()

  • arnimar

    I’d love to hear more thoughts generally on what kinds of provenance information would be helpful as a ‘source trail’.

    There is clearly growing interest in keeping track of data history for larger data management systems but to my best knowledge the field is still largely untouched when it comes to simple compiled datasets (in csv, etc formats).

    • Here I am referencing the source newspaper (or blog or where ever) the coded event came from.

  • Pingback: Searching for Trends: Analyzing Patterns in Conflict Violence Data | Political Violence @ a Glance()

  • Pingback: The Current State of Protest Event Data | Mobilizing Ideas()

  • Pingback: Event Data in Popular Media | Anthony A. Boyles()

  • Pingback: Distilling Event Data from News Articles |

  • Pingback: Some reflections on the availability, reliability, and replicability of protest data – Philippe Joly()