This is a guest post by Randy Zwitch (@randyzwitch), a digital analytics and predictive modeling consultant in the Greater Philadelphia area. Randy blogs regularly about Data Science and related technologies at He’s blogged at Bad Hessian before here.

WordPress Stats - Visitors vs. Views
WordPress Stats – Visitors vs. Views

For those of you with WordPress blogs and have the Jetpack Stats module installed, you’re intimately familiar with this chart. There’s nothing particularly special about this chart, other than you usually don’t see bar charts with the bars shown superimposed.

I wanted to see what it would take to replicate this chart in R, Python and Julia. Here’s what I found. (download the data).

Continue reading

Sadly, we haven’t posted in a while. My own excuse is that I’ve been working a lot on a dissertation chapter. I’m presenting this work at the Young Scholars in Social Movements conference at Notre Dame at the beginning of May and have just finished a rather rough draft of that chapter. The abstract:

Scholars and policy makers recognize the need for better and timelier data about contentious collective action, both the peaceful protests that are understood as part of democracy and the violent events that are threats to it. News media provide the only consistent source of information available outside government intelligence agencies and are thus the focus of all scholarly efforts to improve collective action data. Human coding of news sources is time-consuming and thus can never be timely and is necessarily limited to a small number of sources, a small time interval, or a limited set of protest “issues” as captured by particular keywords. There have been a number of attempts to address this need through machine coding of electronic versions of news media, but approaches so far remain less than optimal. The goal of this paper is to outline the steps needed build, test and validate an open-source system for coding protest events from any electronically available news source using advances from natural language processing and machine learning. Such a system should have the effect of increasing the speed and reducing the labor costs associated with identifying and coding collective actions in news sources, thus increasing the timeliness of protest data and reducing biases due to excessive reliance on too few news sources. The system will also be open, available for replication, and extendable by future social movement researchers, and social and computational scientists.

You can find the chapter at SSRN.

This is very much a work still in progress. There are some tasks which I know immediately need to be done — improving evaluation for the closed-ended coding task, incorporating the open-ended coding, and clarifying the methods. From those of you that do event data work, I would love your feedback. Also if you can think of a witty, Googleable name for the system, I’d love to hear that too.

This is a guest post by Charles Seguin. He is a PhD student in sociology at the University of North Carolina at Chapel Hill.

Sociologists and historians have shown us that national public discourse on lynching underwent a fairly profound transformation during the periods from roughly 1880-1925. My dissertation studies the sources and consequences of this transformation, but in this blog post I’ll just try to sketch some of the contours of this transformation. In my dissertation I use machine learning methods to analyze this discursive transformation, however after reading several hundred lynching articles to train the machine learning algorithms, I think I have a pretty good understanding of key words and phrases that mark the changes in lynching discourse. In this blog post then, I’ll be using basic keyword, bigram (word pair), and trigram searches to illustrate some of the changes in lynching discourse.

Continue reading

This is a guest post by Randy Zwitch (@randyzwitch), a digital analytics and predictive modeling consultant in the Greater Philadelphia area. Randy blogs regularly about Data Science and related technologies at

A few months ago I passed the 10-year point in my analytics/predictive modeling career. While ‘Big Data’ and ‘Data Science’ have only become buzzwords in recent years, hitting the limit on computing resources has been something that has plagued me throughout my career. I’ve seen this problem manifest itself in many ways, from having analysts get assigned multiple computers for daily work, to continuously scraping together budget for more processors on a remote SAS server and spending millions on large enterprise databases just to get processing of data below a 24-hour window.

Luckily, advances in open source software & cloud computing have driven down the cost of data processing & analysis immensely. Using IPython Notebook along with Amazon EC2, you can now procure a 32-core, 60GB RAM virtual machine for roughly $0.27/hr (using a spot instance). This tutorial will show you how to setup a cluster instance at Amazon, install Python, setup IPython as a public notebook server and access this remote cluster via your local web browser.

To get started with this tutorial, you need to have an Amazon Web Services account. I also assume that you already have basic experience interacting with computers via the command line and know about IPython. Basically, that you are the average Bad Hessian reader…

Continue reading

This is a guest post by Karissa McKelvey. She has a BA in Computer Science and Political Science from Indiana University. After graduating, she worked as a research assistant at the Center for Complex Networks and Systems Research at Indiana University on an NSF grant to analyze and visualize the relationship between social media expressions and political events. She is an active contributor to open source projects and continues to publish in computer supported cooperative work and computational social science venues. She currently works as a Software Engineer at Continuum Analytics.

Imagine you are a graduate student of some social or behavioral science (not hard, I assume). You want to collect some data: say I’m going to study the fluctuation of value of products over time on Craiglist, or ping the Sunlight Foundation’s open government data, or use the GDELT to study violent political events. There are a variety of tools I may end up using for my workflow:

  1. Retrieving the data: Python, BeautifulSoup
  2. Storing the data: CSV, Json, MySQL, MongoDB, bash
  3. Retrieving this stored data: SQL, Hive, Hadoop, Python, Java
  4. Manipulating the data: Python, CSV, R
  5. Running regressions, simulations: R, Python, STATA, Java
  6. Presenting the data: R, Excel, Powerpoint, Word, LaTeX

My workflow for doing research now requires a variety of tools, some of which I might have never used before. The number of tools I use seems to scale with the amount of work I try to accomplish. When I encounter a problem in my analysis, or can’t reproduce some regression or simulation I ran, what happened? Where did it break?

Should it really be this difficult? Should I really have to learn 10 different tools to do data analysis on large datasets? We can look at the Big Data problem in a similar light as surveys and regression models. The largest and most fundamental part of the equation is just that this stuff is new – high-priority and well thoughout workflows have yet to be fully developed and stablized.

What if I told you that you could do all of this with the fantastically large number of open source packages in Python? In your web browser, on your iPad?

Continue reading

Over the weekend I led a workshop on basic Twitter processing using Hadoop Streaming (or at least simulating Hadoop Streaming). I created three modules for it.

The first is an introduction of MapReduce that calculates word counts for text. The second is a (very) basic sentiment analysis of political tweets, and the last one is a network analysis of political tweets.

All the code for these workshops is on the site. What other kinds of analysis can/should be done with Twitter data?

Inspired by Neal Caren’s excellent series on Big Data collection and analysis with Python, I want to work on a set of tutorials for some basic collection and analysis as well.

I’m drawing on some of my previous “tworkshops” that are meant to bring people from zero knowledge, to knowing how to move around basic analysis of Twitter data with potential for parallel processing in systems like Hadoop MapReduce.

Let’s start with the basics of what the data look like and how to access it.

Continue reading

Learning to use software always entails some startup cost. I recently had an exchange with one of my colleagues who is relatively new to social network analysis. He asked about my thoughts on a certain network analysis program and mentioned that “it’s easy to get lost with so many [network analysis] programs out there.” His impression is completely understandable. Social network analysis has become immensely popular in recent years. The rise in its popularity has especially been witnessed among gifted people capable of writing good software. Indeed, one Wikipedia list broadly describes about 70 social network analysis programs. Each of these programs have their strengths and weaknesses with regards to its contributions to the field. Given the wealth of options, which programs are worth the time investment to learn?

If you’re new to network analysis then I’d highly recommend learning the packages in R, perhaps supplemented by Pajek and/or Python packages. Here’s why:

Continue reading

Greetings, everyone.  We are delighted to have been invited to author our first Bad Hessians guest post.  We are a couple of graduate students in the sociology department at University of North Carolina – Brandon Gorman and Charles Seguin.  Our post is about a project we began last year after we noticed that, during the Arab Spring, between January 25th and February 11th 2011, western media completely shifted from describing Hosni Mubarak as a “key US ally” to an “entrenched dictator.”  This made us wonder – what structures US media attention to foreign leaders?

Continue reading

Most of the spatial data I work with begins its life as a shapefile. While there are a number of tools available for dealing with shapefiles in R, it is often easier to work in dedicated geographic information system (GIS) software such as ArcMap which is now almost exclusively oriented towards Python-based scripting. With a little help from Alex, I’ve managed to get my head wrapped around Python. The problem is that I now find myself running multiple scripts in multiple languages. When I’m writing code for personal consumption this isn’t really a problem. What usually happens is that I end up running the scripts in the wrong order and I have to start over. When it comes to providing to code to others, however, I am wary of anything that might lead to unintended errors. Consequently, I began looking into ways into which I could better integrate the Python-based scripts I use to work with geographic data with the R-based scripts I use to handle data analysis.

Perhaps the most elegant solution is to use something like RPy. I started down this road while working at home on a MacBook Pro only to have it all fall apart when I got to work where I am on a Windows-based system which isn’t compatible with either rpy or rpy2. As it turns out, the system command in R served as a viable work-around. More specifically, instead of writing a single script in Python using some variant of RPy, I wrote a master script in which I use the system command to call a separate .py file which generates shapefiles that can then be read and analyzed in R. This is basically a modification of a trick I’ve used in the past for organizing .tex files in my dissertation and .do files in Stata. The key difference is that so long as the scripts in questions can executed via the command line, it is relatively easy to use R to organize processes working across multiple platforms. I’d be interested to hear what other solutions people have come up with for dealing with this type of problem.