This is a guest post by Neal Caren. He is an Associate Professor of Sociology at the University of North Carolina, Chapel Hill. He studies social movements and the media.

Folks like Jay Ulfelder and Erin Simpson have already pointed out the flaws in Mona Chalabi’s recent stories that used GDELT to count and map the number of kidnappings in Nigeria. I don’t have much to add, except to point out that hints to some of the problems with using the data to count events were in the dataset all along.

In the first story, “Kidnapping of Girls in Nigeria Is Part of a Worsening Problem,” Chalabi writes:

The recent mass abduction of schoolgirls took place April 15; the database records 151 kidnappings on that day and 215 the next.

To investigate the source of this claim, I downloaded the daily GDELT files for those days and pulled all the kidnappings (CAMEO Code 181) that mentioned Nigeria. GDELT provides the story URLs. Each different GDELT event is assocaited with a URL, although one article can produce more than one GDELT event.

I’ve listed the URLs below. Some of the links are dead, and I haven’t looked at all of the stories yet, but, as far as I can tell, every single story that is about a specific kidnapping is about the same event. You can get a sense of this by just look at the words in the URLS for just those two days. For example, 89 of the URLs contain the word “schoolgirl” and 32 contain Boko Haram. It looks like instead of 366 different kidnappings, there were many, many stories about one kidnapping.

Something very strange is happening with the way the stories are parsed and then aggregated. I suspect that this is because when reports differ on any detail, each report is counted as a different event. Events are coded on 57 attributes each of which has multiple possible values and it appears that events are only considered duplicates when they match all on attributes. Given the vagueness of events and variation in reporting style, a well-covered, evolving event like the Boko Haram kidnapping is likely to covered in multiple ways with varying degrees of specificity, leading to hundreds of “events” from a single incident.

Plotting these “events” on a map only magnifies the errors–there are 41 different unique latitudes/longitudes pairs listed to described the same abduction.

At a minimum, GDELT should stop calling itself an “event” database and call itself a “report” database. People still need to be very careful about using the data, but defaulting to writing that there were 366 reports about kidnapping in Nigeria over these two days is much more accurate than saying there were 366 kidnappings.

In case you were wondering, GDELT lists 296 abductions associated with Nigeria that happened yesterday (May 14th, 2014) in 42 different locations. Almost all of the articles are about the Boko Haram school girl kidnappings, and the rest are entirely miscoded, like the Heritage blog post about how the IRS is targeting the Tea Party.

Continue reading

The Databasement

This weekend, I made it out to Penn State to participate in the GDELT hackathon, sponsored by the Big Data Social Science IGERT and held in the punnily-named Databasement. The hackathon brought together a lot of different groups — political scientists, industry contractors, computer and information scientists, geographers, and — of course — sociologists (I was one of two).

GDELT, as you may remember, a political events database with nearly  225 million events from 1979 to the present. Hackathon attendees had interests ranging from optimizing and normalizing the database, predicting violent conflict, and improving event data in general.

Continue reading