This is a guest post by Jen Schradie. Jen is a doctoral candidate in the Department of Sociology at the University of California-Berkeley and the Berkeley Center for New Media. She has a master’s degree in sociology from UC Berkeley and a MPA from the Harvard Kennedy School. Using both statistical methods and qualitative fieldwork, her research is at the intersection of social media, social movements and social class. Her broad research agenda is to interrogate digital democracy claims in light of societal and structural differences. Before academia, she directed six documentary films on social movements confronting corporate power. You can find her at www.schradie.com or @schradie on Twitter.

Five years ago, Chris Anderson, editor-in-chief of Wired Magazine, wrote a provocative article entitled, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” (2008). He argued that hypothesis testing is no longer necessary with google’s petabytes of data, which provides all of the answers to how society works. Correlation now “supercedes” causation:

This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.

An easy strawman, Anderson’s piece generated a host of articles in academic journals decrying his claim. The overall consensus, to no surprise, was that the scientific method – i.e. hypothesis testing – is far from over. Most argued as Pigliucci (2009:534) articulated,

But, if we stop looking for models and hypotheses, are we still really doing science? Science, unlike advertising, is not about finding patterns—although that is certainly part of the process—it is about finding explanations for those patterns.

Other analysts focused on the debate around “correlation is not causation.” Some critiqued Anderson in that correlation can lead you in the wrong direction with spurious noise.  Others implicitly pointed to what Box (1976) articulated so well pre-Big Data – that science is an iterative process in which correlation is useful in that it can trigger research which uses hypothesis testing.

Continue reading

datagothamThis is a guest post by Sean J. Taylor, a PhD student in Information Systems at NYU’s Stern School of Business.

Last Thursday and Friday I attended the 2nd annual DataGotham conference in New York City. Alex Hanna asked me to write about my experience there for the benefit of those who were unable to attend, so here’s my take on the event.

Thursday evening was a social event in a really sweet rooftop space in Tribeca with an open bar and great food (a dangerous combination for this still-grad-student). Though I spent a lot of the time catching up with old friends, I would describe the evening as “hanging out on Twitter, but in person.” I met no fewer than a dozen people I had only previously known online. I am continually delighted at how awesomeness on Twitter is a reliable indicator of awesomeness in-person. Events like DataGotham are often worth it for this reason alone.

Continue reading

This is a guest post by John Beieler, originally posted at http://johnbeieler.org/blog/2013/04/12/gdelt/

I made the remark on Twitter that it seemed like GDELT week due to a Foreign Policy piece about the dataset, Phil and Kalev’s paper for the ISA 2013 meeting, and a host of blog posts about the data. So, in the spirit of GDELT week, I thought I would throw my hat into the ring. But instead of taking the approach of lauding the new age that is approaching for political and social research due to the monstrous scale of the data now available, I thought I would write a little about the issues that come along with dealing with such massive data.

Dealing with GDELT

As someone who has spent the better part of the past 8 months dealing with the GDELT dataset, including writing a little about working with the data, I feel that I have a somewhat unique perspective. The long and the short of my experience is: working with data on this scale is hard. This may strike some as obvious, especially given the cottage industry that has sprung up around Hadoop and and other services for processing data. GDELT is 200+ million events spread across several years. Each year of the reduced data is in a separate file and contains information about many, many different actors. This is part of what makes the data so intriguing and useful, but the data is also unlike data such as the ever-popular MID data in political science that is easily managed in a program like Stata or R. The data requires subsetting, massaging, and aggregating; having so much data can, at some points, become overwhelming. What states do I want to look at? What type of actors? What type of actions? What about substate actors? Oh, what about the dyadic interactions? These questions and more quickly come to the fore when dealing with data on this scale. So while the GDELT data offers an avenue to answer some existing questions, it also brings with it many potential problems.

Careful Research

So, that all sounds kind of depressing. We have this new, cool dataset that could be tremendously useful, but it also presents many hurdles. What, then, should we as social science researchers do about it? My answer is careful theorizing and thinking about the processes under examination. This might be a “well, duh” moment to those in the social sciences, but I think it is worth saying when there are some heralding “The End of Theory”. This type of large-scale data does not reduce theory and the scientific method to irrelevance. Instead, theory is elevated to a position of higher importance. What states do I want to look at? What type of actions? Well, what does the theory say? As Hilary Mason noted in a tweet:

Data tells you whether to use A or B. Science tells you what A and B should be in the first place.

Put into more social-scientific language, data tells us the relationship between A and B, while science tells us what A and B should be and what type of observations should be used. The data under examination in a given study should be driven by careful consideration of the processes of interest. This idea should not, however, be construed as a rejection of “big data” in the social sciences. I personally believe the exact opposite; give me as many features, measures, and observations as possible and let algorithms sort out what is important. Instead, I think the social sciences, and science in general, is about asking interesting questions of the data that will often require more finesse than taking an “ANALYZE ALL THE DATA” approach. Thus, while datasets like GDELT provide new opportunities, they are not opportunities to relax and let the data do the talking. If anything, big data generating processes will require more work on the part of the researcher than previous data sources.

John Beieler is a Ph.D. student in the Department of Political Science at Pennsylvania State University. Additionally, he is a trainee in the NSF Big Data Social Science IGERT program for 2013-2015. His substantive research focuses on international conflict and instances of political violence such as terrorism and substate violence. He also has interests in big data, machine learning, event forecasting, and social network analysis. He aims to bring these substantive and methodological interests together in order to further research in international relations and enable greater predictive accuracy for events of interest. 


Signers to White House secession petitions by county. Color based on proportion of residents signing, with darker colors showing higher levels of secession support. Current as of 1am on Thurday, November 15th. Click here for an interactive version.

Since Election Day, more than 60 petitions have been posted on the White House’s website requesting that states be allowed to withdraw from the United States and create their own government. As of November 13, 2012, the following states had active petitions: Alabama, Alaska, Arizona, Arkansas, California, Colorado, Delaware, Florida, Georgia, Idaho, Illinois, Indiana, Kansas, Kentucky, Louisiana, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Virginia, Virginias, West Virginia, Wisconsin, and Wyoming.

While petitions are focused on particular states, signers can be from anywhere. In order to show where support for these secession was the strongest, a graduate seminar on collecting and analyzing and data from the web in the UNC Sociology Department downloaded the names and cities of each of the petition signers from the White House website, geocoded each of the locations, and plotted the results.

In total, we collected data on 764,387 signatures. Of these, we identified 270,610 unique combinations of names and places, suggesting that a large number of people were signing more than one petition. Approximately 90%, or 244,001, of these individuals provided valid city locations that we could locate with a US county.

The above graphic shows the distribution of these petition signers across the US. Colors are based proportion of people in each county who signed.

We also looked at the distribution of petition signers by gender. While petition signers did not list their gender, we attempted to match first names with Social Security data on the relative frequency of names by sex. Of the 242,823 respondents with gendered names, 62% had male names and 38% had female names. This 24 point gender gap is twice the size of the gender gap for voters in the 2012 Presidential election.

Tools: Python, Yahoo Geocoding API, and Pete Skomoroch’s remix of Nathan Yau’s county thematic map script.

Neal Caren, Ali Eshraghi, Sarah Gaby, Brandon Gorman, Michael Good, Jonathan Horowitz, Ali Kadivar, Rachel Ramsay, Charles Seguin, and Didem Turkoglu.

Note: This is a repost from the original at http://www.unc.edu/~ncaren/secessionists/ the map at the original site is interactive, but I’ve had trouble adding that functionality here. We reposted to make this Bad Hessians Official. Thanks to the Bad Hessians for having us on as guests.

Greetings, everyone.  We are delighted to have been invited to author our first Bad Hessians guest post.  We are a couple of graduate students in the sociology department at University of North Carolina – Brandon Gorman and Charles Seguin.  Our post is about a project we began last year after we noticed that, during the Arab Spring, between January 25th and February 11th 2011, western media completely shifted from describing Hosni Mubarak as a “key US ally” to an “entrenched dictator.”  This made us wonder – what structures US media attention to foreign leaders?

Continue reading