Benjamin Lind and I have spent the last week and a half teaching at the Social Network Analysis Summer School at HSE-St. Petersburg. We’ve had about 30 students coming from as far as South Africa and Sweden, with all levels of skill and many different research interests, and have had the pleasure of teaching with some great instructors from around the world as well. If you are getting inquisitive for more info – keep reading, it gets exciting. You can read the backchannel chatter the #SNASPb2013 hashtag
I ran two labs on collecting network data from various Internet sources with Python. The first is a mashup of some of my prior workshops on collecting Twitter data via the API, and drawing network data through user mentions. The second shows how to retrieve network data by crawling blogs.
Technology-wise, it was my first time using a cloud service (Amazon EC2) and iPython Notebooks for teaching purposes. A few observations into EC2 for teaching: the t1.micro server level is not quite powerful enough to handle ~30 students running parsing of JSON or scrapy. So you’ll have to up the juice, otherwise. I found iPython Notebooks to be great, though — code highlighting and execution, LaTeX typesetting, and Markdown makes it a winner in my book.
I also put the code for each lab on GitHub: hse-twitter and hse-scrapy. Would love any contributions to these small scripts, especially the scraping code.
Tonight is the airing of the final episode of RuPaul’s Drag Race, Season 5. They are doing what they did last season, which is to delay the final crowning until the reunion show. Apparently I wasn’t wrong last week when I said last time that they tape three different endings to the show, mostly to ward off Twitter leaks by fans in the audience. And apparently the queens themselves don’t know who won until everyone else does, according to Jinkx.
As noted by Ru on the final three episode and the “RuCap,” they encouraged fans to vote by tweeting and by reposting on Facebook. Although I can’t get Facebook data directly, I’m going to look at the Twitter data that I’ve collected. Before delving into the final predictions, I remembered that I have Twitter data from last year’s airing from the Twitter gardenhose. From that, we should be able to get a sense of who had the sway of public opinion on Twitter.
The graph below plots the last week of season 4, between the announcement that the queen would be crowned at the reunion, and the final reunion show. I chose to focus only on mentions of a queen’s Twitter handle, instead of using #TeamWhatever, because there weren’t many counts of those in the gardenhose. The first peak is the final contest show, and the second is the actual crowning.
The case here is rather clear cut — Sharon Needles leads everyone for nearly the whole time period. The raw counts of mentions show no contest there. I’m actually rather surprised that Phi Phi led Chad. Maybe there was another way they showed support for her?
(GIF via Dilettwat)
We’re down to the final episode. This one is for all the marbles. Wait, that’s not the best saying in this context. In any case, moving right along. In the top four episode, Detox was eliminated, but not after Roxxxy threw maybe ALL of the shade towards Jinkx (although, to Roxxxy’s credit, she says a lot of this was due to editing).
Jinkx, however, defended herself well by absolutely killing the lipsync. Probably one of the top three of the season, easy.
Getting down to the wire, it’s looking incredibly close. As it is, the model has ceased to tell us anything of value. Here are the rankings:
1 Alaska 0.6050052 1.6752789
2 Roxxxy Andrews 2.5749070 3.6076899
3 Jinkx Monsoon 3.4666713 3.2207345
But looking at the confidence intervals, all three estimates are statistically indistinguishable from zero. The remaining girls don’t have sufficient variation on the variables of interest to differentiate them from each other in terms of winning this thing.
So what’s drag race forecaster to do? Well, the first thought that came to my mind was — MOAR DATA. And hunty, there’s one place where I’ve got data by the troves — Twitter.
A few weeks ago, Twitter announced that they were releasing a client for their Streaming API. It’s open-source! Get it here: https://github.com/twitter/hbc
This is pretty great news, for a few reasons:
- The Streaming API relies on a consistent connection, so doing all that messy authentication and making sure you’re not going to drop any information is simplified and will comport to Twitter specs.
- Twitter is deprecating v1 of their APIs, including Streaming and RESTful. They haven’t made any dramatic changes in the Streaming API but it still means changing libraries or expecting someone who is maintaining your library of choice to update it.
- It’s all being developed and actively maintained in-house by Twitter. The maintainers, @steven and @kevino (apparently one of the perks of working at Twitter is getting an awesome username), are especially responsive with bug fixes and pull requests.
- There’s a plugin for the Twitter4j library, if you want to implement listeners that do any background data handling or parsing for particular pieces of data (deletes vs. stall_warnings). I haven’t tried this yet but it looks promising.
The downside? It’s in Java. While this used to be a nice insult when I was hacking around in CS 180 and Java was at version 1.4.2, Java has gotten much faster since then. The addition of projects like Apache Maven has made development with dependencies and handling classpaths much easier. But then, you still have to know at least a little Java to get the thing up and running.
I’ve been using this as my primary gardenhose collection device for a few weeks now with only a handful of issues, as bugs are surfacing in development but being squashed soon after.
Over the weekend I led a workshop on basic Twitter processing using Hadoop Streaming (or at least simulating Hadoop Streaming). I created three modules for it.
The first is an introduction of MapReduce that calculates word counts for text. The second is a (very) basic sentiment analysis of political tweets, and the last one is a network analysis of political tweets.
All the code for these workshops is on the site. What other kinds of analysis can/should be done with Twitter data?
Inspired by Neal Caren’s excellent series on Big Data collection and analysis with Python, I want to work on a set of tutorials for some basic collection and analysis as well.
I’m drawing on some of my previous “tworkshops” that are meant to bring people from zero knowledge, to knowing how to move around basic analysis of Twitter data with potential for parallel processing in systems like Hadoop MapReduce.
Let’s start with the basics of what the data look like and how to access it.