Inspired by Neal Caren’s excellent series on Big Data collection and analysis with Python, I want to work on a set of tutorials for some basic collection and analysis as well.

I’m drawing on some of my previous “tworkshops” that are meant to bring people from zero knowledge, to knowing how to move around basic analysis of Twitter data with potential for parallel processing in systems like Hadoop MapReduce.

Let’s start with the basics of what the data look like and how to access it.

Accessing the Twitter API

The way that researchers and other people who want to get large publically available Twitter datasets is through their API. API stands for Application Programming Interface and many services that want to start a developer community around their product usually releases one. Facebook has an API that is somewhat restrictive, while Klout has an API to let you automatically look up Klout scores and all their different facets.

The Twitter API has two different flavors: RESTful and Streaming. The RESTful API is useful for getting things like lists of followers and those who follow a particular user, and is what most Twitter clients are built off of. We are not going to deal with the RESTful API right now, but you can find more information on it here: https://dev.twitter.com/docs/api. Right now we are going to focus on the Streaming API (more info here: https://dev.twitter.com/docs/streaming-api). The Streaming API works by making a request for a specific type of data — filtered by keyword, user, geographic area, or a random sample — and then keeping the connection open as long as there are no errors in the connection.

For my own purposes, I’ve been using the tweepy package to access the Streaming API. I’ve incorporated two changes in my own fork that have worked well for me on both Linux and OSX systems: https://github.com/raynach/tweepy

Understanding Twitter Data
Once you’ve connected to the Twitter API, whether via the RESTful API or the Streaming API, you’re going to start getting a bunch of data back.  The data you get back will be encoded in JSON, or JavaScript Object Notation. JSON is a way to encode complicated information in a platform-independent way.  It could be considered the lingua franca of information exchange on the Internet.  When you click a snazzy Web 2.0 button on Facebook or Amazon and the page produces a lightbox (a box that hovers above a page without leaving the page you’re on now), there was probably some JSON involved.

JSON is a rather simplistic and elegant way to encode complex data structures. When a tweet comes back from the API, this is what it looks like (with a little bit of beautifying):

{
    "contributors": null, 
    "truncated": false, 
    "text": "TeeMinus24's Shirt of the Day is Palpatine/Vader '12. Support the Sith. Change you can't stop. http://t.co/wFh1cCep", 
    "in_reply_to_status_id": null, 
    "id": 175090352598945794, 
    "entities": {
        "user_mentions": [], 
        "hashtags": [], 
        "urls": [
            {
                "indices": [
                    95, 
                    115
                ], 
                "url": "http://t.co/wFh1cCep", 
                "expanded_url": "http://fb.me/1isEdQJSq", 
                "display_url": "fb.me/1isEdQJSq"
            }
        ]
    }, 
    "retweeted": false, 
    "coordinates": null, 
    "source": "Facebook", 
    "in_reply_to_screen_name": null, 
    "id_str": "175090352598945794", 
    "retweet_count": 0, 
    "in_reply_to_user_id": null, 
    "favorited": false, 
    "user": {
        "follow_request_sent": null, 
        "profile_use_background_image": true, 
        "default_profile_image": false, 
        "profile_background_image_url_https": "https://si0.twimg.com/images/themes/theme14/bg.gif", 
        "verified": false, 
        "profile_image_url_https": "https://si0.twimg.com/profile_images/1428484273/TeeMinus24_logo_normal.jpg", 
        "profile_sidebar_fill_color": "efefef", 
        "is_translator": false, 
        "id": 281077639, 
        "profile_text_color": "333333", 
        "followers_count": 43, 
        "protected": false, 
        "location": "", 
        "profile_background_color": "131516", 
        "id_str": "281077639", 
        "utc_offset": -18000, 
        "statuses_count": 461, 
        "description": "We are a limited edition t-shirt company. We make tees that are designed for the fan; movies, television shows, video games, sci-fi, web, and tech. We have it!", 
        "friends_count": 52, 
        "profile_link_color": "009999", 
        "profile_image_url": "http://a0.twimg.com/profile_images/1428484273/TeeMinus24_logo_normal.jpg", 
        "notifications": null, 
        "show_all_inline_media": false, 
        "geo_enabled": false, 
        "profile_background_image_url": "http://a0.twimg.com/images/themes/theme14/bg.gif", 
        "screen_name": "TeeMinus24", 
        "lang": "en", 
        "profile_background_tile": true, 
        "favourites_count": 0, 
        "name": "Vincent Genovese", 
        "url": "http://www.teeminus24.com", 
        "created_at": "Tue Apr 12 15:48:23 +0000 2011", 
        "contributors_enabled": false, 
        "time_zone": "Eastern Time (US & Canada)", 
        "profile_sidebar_border_color": "eeeeee", 
        "default_profile": false, 
        "following": null, 
        "listed_count": 1
    }, 
    "geo": null, 
    "in_reply_to_user_id_str": null, 
    "possibly_sensitive": false, 
    "created_at": "Thu Mar 01 05:29:27 +0000 2012", 
    "possibly_sensitive_editable": true, 
    "in_reply_to_status_id_str": null, 
    "place": null
}

Let’s move our focus now to the actual elements of the tweet. Most of the keys, that is, the words on the left of the colon, are self-explanatory. The most important ones are “text”, “entities”, and “user”. “Text” is the text of the tweet, “entities” are the user mentions, hashtags, and links used in the tweet, separated out for easy access. “User” contains a lot of information on the user, from URL of their profile image to the date they joined Twitter.

Now that you see what data you get with a tweet, you can envision interesting types of analysis that can emerge by analyzing a whole lot of them.

A Disclaimer on Collecting Tweets

Unfortunately, you do not have carte blanche to share the tweets you collect. Twitter restricts publicly releasing datasets according to their API Terms of Service (https://dev.twitter.com/terms/api-terms). This is unfortunately for collaboration when colleagues have collected very unique datasets.  However, you can share derivative analysis from tweets, such as content analysis and aggregate statistics.

Collecting Data

Let’s get to it. The first step is to get a copy of tweepy (either by checking out the repository or just downloading it) and installing it.

The next thing to do is to create an instance of a tweepy StreamListener to handle the incoming data. The way that I have mine set up is that I start a new file for every 20,000 tweets, tagged with a prefix and a timestamp. I also keep another file open for the list of status IDs that have been deleted, which are handled differently than other tweet data. I call this file slistener.py.

from tweepy import StreamListener
import json, time, sys

class SListener(StreamListener):

    def __init__(self, api = None, fprefix = 'streamer'):
        self.api = api or API()
        self.counter = 0
        self.fprefix = fprefix
        self.output  = open(fprefix + '.' 
                            + time.strftime('%Y%m%d-%H%M%S') + '.json', 'w')
        self.delout  = open('delete.txt', 'a')

    def on_data(self, data):

        if  'in_reply_to_status' in data:
            self.on_status(data)
        elif 'delete' in data:
            delete = json.loads(data)['delete']['status']
            if self.on_delete(delete['id'], delete['user_id']) is False:
                return False
        elif 'limit' in data:
            if self.on_limit(json.loads(data)['limit']['track']) is False:
                return False
        elif 'warning' in data:
            warning = json.loads(data)['warnings']
            print warning['message']
            return false

    def on_status(self, status):
        self.output.write(status + "\n")

        self.counter += 1

        if self.counter >= 20000:
            self.output.close()
            self.output = open('../streaming_data/' + self.fprefix + '.' 
                               + time.strftime('%Y%m%d-%H%M%S') + '.json', 'w')
            self.counter = 0

        return

    def on_delete(self, status_id, user_id):
        self.delout.write( str(status_id) + "\n")
        return

    def on_limit(self, track):
        sys.stderr.write(track + "\n")
        return

    def on_error(self, status_code):
        sys.stderr.write('Error: ' + str(status_code) + "\n")
        return False

    def on_timeout(self):
        sys.stderr.write("Timeout, sleeping for 60 seconds...\n")
        time.sleep(60)
        return 

Next, we need the script that does the collecting itself. I call this file streaming.py. You can collect on users, keywords, or specific locations defined by bounding boxes. The API documentation has more information on this. For now, let’s just track some popular keywords — obama and romney (keywords are case-insensitive).

from slistener import SListener
import time, tweepy, sys

## authentication
username = '' ## put a valid Twitter username here
password = '' ## put a valid Twitter password here
auth     = tweepy.auth.BasicAuthHandler(username, password)
api      = tweepy.API(auth)

def main():
    track = ['obama', 'romney']
 
    listen = SListener(api, 'myprefix')
    stream = tweepy.Stream(auth, listen)

    print "Streaming started..."

    try: 
        stream.filter(track = track)
    except:
        print "error!"
        stream.disconnect()

if __name__ == '__main__':
    main()

Given the volume of tweets on the US election right now, you’re bound to be gathering a bunch of data. Hope you’ve got some disk space.