Bad Hessian « Computational social science blog

Over the weekend I led a workshop on basic Twitter processing using Hadoop Streaming (or at least simulating Hadoop Streaming). I created three modules for it.

The first is an introduction of MapReduce that calculates word counts for text. The second is a (very) basic sentiment analysis of political tweets, and the last one is a network analysis of political tweets.

All the code for these workshops is on the site. What other kinds of analysis can/should be done with Twitter data?

Inspired by Neal Caren’s excellent series on Big Data collection and analysis with Python, I want to work on a set of tutorials for some basic collection and analysis as well.

I’m drawing on some of my previous “tworkshops” that are meant to bring people from zero knowledge, to knowing how to move around basic analysis of Twitter data with potential for parallel processing in systems like Hadoop MapReduce.

Let’s start with the basics of what the data look like and how to access it.

Continue reading →

Following up on the string of posts about software for network analysis, I recently taught a workshop for PhD students in the social sciences here at Stanford on using Python for network analysis. My session was part of a three day series of workshops introducing computational social science to students who are looking to get their feet wet. I’m posting a link (here) to the page on my website where you can download the materials I developed to teach the workshop, including commented scripts, sample datasets, and a few slides.

Some brief impressions: I’ve taught stats/methods for grad students before, but this was a different beast. Computational social science and network analysis are attractive areas for many grad students here, but without a ‘canon’ of some type to fall back on, it’s hard to know what to emphasize for students with little background. I ended up focusing more basic data and control structures in Python, which I thought would be more useful for understanding the way the networkx package handles inputs and outputs. I’m not sure that was the most effective approach, though–at least in terms of conveying why Python is a good choice for network analysis. Next time, I think I’ll try to integrate more substantive examples.

Also, inspired by Ben’s last post–maybe we should put a few network analysis packages to a speed test? I get this question all the time, and I usually just refer to my own anecdotal evidence, but it’s probably worth pitting iGraph, networkx, etc. across platforms against one another in calculating, say, shortest paths in a relatively large network. More on this later…

P.S. It’s only taken me 3 months to write my first post!

Learning to use software always entails some startup cost. I recently had an exchange with one of my colleagues who is relatively new to social network analysis. He asked about my thoughts on a certain network analysis program and mentioned that “it’s easy to get lost with so many [network analysis] programs out there.” His impression is completely understandable. Social network analysis has become immensely popular in recent years. The rise in its popularity has especially been witnessed among gifted people capable of writing good software. Indeed, one Wikipedia list broadly describes about 70 social network analysis programs. Each of these programs have their strengths and weaknesses with regards to its contributions to the field. Given the wealth of options, which programs are worth the time investment to learn, and there are resources as irainvesting.com which could help with this.

If you’re new to network analysis then I’d highly recommend learning the packages in R, perhaps supplemented by Pajek and/or Python packages. Here’s why:

Continue reading →

Greetings, everyone. We are delighted to have been invited to author our first Bad Hessians guest post. We are a couple of graduate students in the sociology department at University of North Carolina – Brandon Gorman and Charles Seguin. Our post is about a project we began last year after we noticed that, during the Arab Spring, between January 25^th and February 11^th 2011, western media completely shifted from describing Hosni Mubarak as a “key US ally” to an “entrenched dictator.” This made us wonder – what structures US media attention to foreign leaders?

Continue reading →

I recently discovered Gary Weissman’s excellent post on Grey’s Anatomy Network of Sexual Relations and I felt inspired. For those who haven’t heard of the television show before, Grey’s Anatomy is a widely popular, award-winning prime-time medical drama airing on ABC which has received no shortage of critical acclaim. Meeting conventional medical drama expectations, the show quite regularly features members of its attractive cast “hooking up.” Or so I am told. In an effort to teach medical students some basic social network lessons, Weissman produced a network data set on the show’s sexual contacts between characters. Though I’m not particularly fond of the show and both sexual and fictional networks lie outside my research interests, Weissman’s post served as a remarkable demonstration of network analysis for pedagogical purposes.

Continue reading →

Working with right-to-left languages like Arabic in R can be a bit of a headache, especially when mixed with left-to-right languages (like English). Since my research involves a great deal of text analysis of Arabic news articles, I find myself with a lot of headaches. Most text analysis methods require some kind of normalization before diving into the actual analyses. Normalization includes things like removing punctuation, converting words to lowercase, stripping numbers out, and so on. This is essential for any kind of frequency-based analysis so that words such as don’t, Don’t, and dont are not considered unique words. After all, when dealing with human-generated text, typos and differences in presentation are bound to occur. Often times, normalizing also includes stemming words so that words such as think, thinking, and thinks are all stemmed to “think” as they all represent (basically) the same concept.

Continue reading →

This got posted at R-bloggers last night, after the men’s 100 meter Olympic event was over. Marcus Gesmann predicted Usain Bolt’s 9.63 second result within 0.05 seconds. Even better, he did it using a simple log-linear model that didn’t control for any other factors.

Check the original article at R-bloggers, which talks more about the progression of faster running times and includes the R code used.

Bad Hessian

Don Knuth meets Charles Tilly

Python, Hadoop Streaming, and Twitter analysis

Collecting real-time Twitter data with the Streaming API

Python for network analysis

Seven Reasons to Use R for Social Network Analysis (and Three Reasons Against)

The distribution of US media attention to foreign leaders: 1950-2008

Lessons on exponential random graph modeling from Grey’s Anatomy hook-ups

Text normalization and Arabic in R

Amazing prediction of 100m men’s final within 0.05s