This weekend, I made it out to Penn State to participate in the GDELT hackathon, sponsored by the Big Data Social Science IGERT and held in the punnily-named Databasement. The hackathon brought together a lot of different groups — political scientists, industry contractors, computer and information scientists, geographers, and — of course — sociologists (I was one of two).
GDELT, as you may remember, a political events database with nearly 225 million events from 1979 to the present. Hackathon attendees had interests ranging from optimizing and normalizing the database, predicting violent conflict, and improving event data in general.
My team — constructed of Carnegie Mellon CS PhD student Brendan O’Connor, industry data scientist Liz Merkhofer, Penn State Big Data Social Science IGERT fellow Muhammed Idris, and myself — was interested in how to improve on the current CAMEO verb dictionaries, upon which GDELT is currently based and produced. We used a method developed by Brendan and Harvard PhD student Brandon Stewart to locate the most frequently used CAMEO codes in the current verb dictionary in the clusters produced by their unsupervised machine learning method, then mapped the new verbs in those clusters to the most prevalent CAMEO codes. If this isn’t exactly clear, I wouldn’t worry — it was a somewhat blunt and heavy-handed way to approach this. You can find our code for this on GitHub.
A team led by Patrick Brandt worked on normalization of the GDELT. Given that GDELT has a huge jump in sources after about the year 2000, which means seeing more events is a data artifact rather than actual signal. They worked out a way to normal protest data and it looks pretty great.
Travis Pinney and friends worked on an optimized database for common GDELT queries. Using only some open-source magic and an Amazon Web Services server they were able to retrieve slices and aggregates of the dataset in under 10 seconds. We were all pretty impressed at this. He’s thrown the code up on GitHub.
Other projects worked on included an R package (called gdelt tools) and building aggregates of the dataset into some common political conflict models.
I thought it was a really successful event. It had all the social trappings and networking of a traditional social science conference but all the ad hoc-ness and flexibility of a hack-oriented event. I met a lot of people I had only known through Twitter or other electronic means, including the main GDELT grad John Beieler, and Phil Schrodt, the architect of GDELT’s predecessors. The hackathon format allowed folks with common research interests but wildly divergent backgrounds to get together and work towards a common research goal without having to establish formal partnerships. I hope (cough cough) this format can be brought to other parts of the social sciences.