Inspired by Neal Caren’s excellent series on Big Data collection and analysis with Python, I want to work on a set of tutorials for some basic collection and analysis as well.

I’m drawing on some of my previous “tworkshops” that are meant to bring people from zero knowledge, to knowing how to move around basic analysis of Twitter data with potential for parallel processing in systems like Hadoop MapReduce.

Let’s start with the basics of what the data look like and how to access it.

Accessing the Twitter API

The way that researchers and other people who want to get large publically available Twitter datasets is through their API. API stands for Application Programming Interface and many services that want to start a developer community around their product usually releases one. Facebook has an API that is somewhat restrictive, while Klout has an API to let you automatically look up Klout scores and all their different facets.

The Twitter API has two different flavors: RESTful and Streaming. The RESTful API is useful for getting things like lists of followers and those who follow a particular user, and is what most Twitter clients are built off of. We are not going to deal with the RESTful API right now, but you can find more information on it here: https://dev.twitter.com/docs/api. Right now we are going to focus on the Streaming API (more info here: https://dev.twitter.com/docs/streaming-api). The Streaming API works by making a request for a specific type of data — filtered by keyword, user, geographic area, or a random sample — and then keeping the connection open as long as there are no errors in the connection.

For my own purposes, I’ve been using the tweepy package to access the Streaming API. I’ve incorporated two changes in my own fork that have worked well for me on both Linux and OSX systems: https://github.com/raynach/tweepy

Understanding Twitter Data
Once you’ve connected to the Twitter API, whether via the RESTful API or the Streaming API, you’re going to start getting a bunch of data back.  The data you get back will be encoded in JSON, or JavaScript Object Notation. JSON is a way to encode complicated information in a platform-independent way.  It could be considered the lingua franca of information exchange on the Internet.  When you click a snazzy Web 2.0 button on Facebook or Amazon and the page produces a lightbox (a box that hovers above a page without leaving the page you’re on now), there was probably some JSON involved.

JSON is a rather simplistic and elegant way to encode complex data structures. When a tweet comes back from the API, this is what it looks like (with a little bit of beautifying):

{
    "contributors": null, 
    "truncated": false, 
    "text": "TeeMinus24's Shirt of the Day is Palpatine/Vader '12. Support the Sith. Change you can't stop. http://t.co/wFh1cCep", 
    "in_reply_to_status_id": null, 
    "id": 175090352598945794, 
    "entities": {
        "user_mentions": [], 
        "hashtags": [], 
        "urls": [
            {
                "indices": [
                    95, 
                    115
                ], 
                "url": "http://t.co/wFh1cCep", 
                "expanded_url": "http://fb.me/1isEdQJSq", 
                "display_url": "fb.me/1isEdQJSq"
            }
        ]
    }, 
    "retweeted": false, 
    "coordinates": null, 
    "source": "<a href="\&quot;http://www.facebook.com/twitter\&quot;" rel="\&quot;nofollow\&quot;">Facebook</a>", 
    "in_reply_to_screen_name": null, 
    "id_str": "175090352598945794", 
    "retweet_count": 0, 
    "in_reply_to_user_id": null, 
    "favorited": false, 
    "user": {
        "follow_request_sent": null, 
        "profile_use_background_image": true, 
        "default_profile_image": false, 
        "profile_background_image_url_https": "https://si0.twimg.com/images/themes/theme14/bg.gif", 
        "verified": false, 
        "profile_image_url_https": "https://si0.twimg.com/profile_images/1428484273/TeeMinus24_logo_normal.jpg", 
        "profile_sidebar_fill_color": "efefef", 
        "is_translator": false, 
        "id": 281077639, 
        "profile_text_color": "333333", 
        "followers_count": 43, 
        "protected": false, 
        "location": "", 
        "profile_background_color": "131516", 
        "id_str": "281077639", 
        "utc_offset": -18000, 
        "statuses_count": 461, 
        "description": "We are a limited edition t-shirt company. We make tees that are designed for the fan; movies, television shows, video games, sci-fi, web, and tech. We have it!", 
        "friends_count": 52, 
        "profile_link_color": "009999", 
        "profile_image_url": "http://a0.twimg.com/profile_images/1428484273/TeeMinus24_logo_normal.jpg", 
        "notifications": null, 
        "show_all_inline_media": false, 
        "geo_enabled": false, 
        "profile_background_image_url": "http://a0.twimg.com/images/themes/theme14/bg.gif", 
        "screen_name": "TeeMinus24", 
        "lang": "en", 
        "profile_background_tile": true, 
        "favourites_count": 0, 
        "name": "Vincent Genovese", 
        "url": "http://www.teeminus24.com", 
        "created_at": "Tue Apr 12 15:48:23 +0000 2011", 
        "contributors_enabled": false, 
        "time_zone": "Eastern Time (US &amp; Canada)", 
        "profile_sidebar_border_color": "eeeeee", 
        "default_profile": false, 
        "following": null, 
        "listed_count": 1
    }, 
    "geo": null, 
    "in_reply_to_user_id_str": null, 
    "possibly_sensitive": false, 
    "created_at": "Thu Mar 01 05:29:27 +0000 2012", 
    "possibly_sensitive_editable": true, 
    "in_reply_to_status_id_str": null, 
    "place": null
}

Let’s move our focus now to the actual elements of the tweet. Most of the keys, that is, the words on the left of the colon, are self-explanatory. The most important ones are “text”, “entities”, and “user”. “Text” is the text of the tweet, “entities” are the user mentions, hashtags, and links used in the tweet, separated out for easy access. “User” contains a lot of information on the user, from URL of their profile image to the date they joined Twitter.

Now that you see what data you get with a tweet, you can envision interesting types of analysis that can emerge by analyzing a whole lot of them.

A Disclaimer on Collecting Tweets

Unfortunately, you do not have carte blanche to share the tweets you collect. Twitter restricts publicly releasing datasets according to their API Terms of Service (https://dev.twitter.com/terms/api-terms). This is unfortunately for collaboration when colleagues have collected very unique datasets.  However, you can share derivative analysis from tweets, such as content analysis and aggregate statistics.

Collecting Data

Let’s get to it. The first step is to get a copy of tweepy (either by checking out the repository or just downloading it) and installing it.

The next thing to do is to create an instance of a tweepy StreamListener to handle the incoming data. The way that I have mine set up is that I start a new file for every 20,000 tweets, tagged with a prefix and a timestamp. I also keep another file open for the list of status IDs that have been deleted, which are handled differently than other tweet data. I call this file slistener.py.

from tweepy import StreamListener
import json, time, sys

class SListener(StreamListener):

    def __init__(self, api = None, fprefix = 'streamer'):
        self.api = api or API()
        self.counter = 0
        self.fprefix = fprefix
        self.output  = open(fprefix + '.' 
                            + time.strftime('%Y%m%d-%H%M%S') + '.json', 'w')
        self.delout  = open('delete.txt', 'a')

    def on_data(self, data):

        if  'in_reply_to_status' in data:
            self.on_status(data)
        elif 'delete' in data:
            delete = json.loads(data)['delete']['status']
            if self.on_delete(delete['id'], delete['user_id']) is False:
                return False
        elif 'limit' in data:
            if self.on_limit(json.loads(data)['limit']['track']) is False:
                return False
        elif 'warning' in data:
            warning = json.loads(data)['warnings']
            print warning['message']
            return false

    def on_status(self, status):
        self.output.write(status + "\n")

        self.counter += 1

        if self.counter >= 20000:
            self.output.close()
            self.output = open('../streaming_data/' + self.fprefix + '.' 
                               + time.strftime('%Y%m%d-%H%M%S') + '.json', 'w')
            self.counter = 0

        return

    def on_delete(self, status_id, user_id):
        self.delout.write( str(status_id) + "\n")
        return

    def on_limit(self, track):
        sys.stderr.write(track + "\n")
        return

    def on_error(self, status_code):
        sys.stderr.write('Error: ' + str(status_code) + "\n")
        return False

    def on_timeout(self):
        sys.stderr.write("Timeout, sleeping for 60 seconds...\n")
        time.sleep(60)
        return 

Next, we need the script that does the collecting itself. I call this file streaming.py. You can collect on users, keywords, or specific locations defined by bounding boxes. The API documentation has more information on this. For now, let’s just track some popular keywords — obama and romney (keywords are case-insensitive).

from slistener import SListener
import time, tweepy, sys

## authentication
username = '' ## put a valid Twitter username here
password = '' ## put a valid Twitter password here
auth     = tweepy.auth.BasicAuthHandler(username, password)
api      = tweepy.API(auth)

def main():
    track = ['obama', 'romney']
 
    listen = SListener(api, 'myprefix')
    stream = tweepy.Stream(auth, listen)

    print "Streaming started..."

    try: 
        stream.filter(track = track)
    except:
        print "error!"
        stream.disconnect()

if __name__ == '__main__':
    main()

Given the volume of tweets on the US election right now, you’re bound to be gathering a bunch of data. Hope you’ve got some disk space.

  • basic auth has been deprecated for a while now – I’m pretty sure you have to do Oauth instead for the streaming API. Straightforward, but means a slight change to your code in streaming.py

    • Good call, Toby. I’ll update the post a little later with how to do this with OAuth.

      • gigih_septianto

        Hai Alex, I got 401 error. Is there anything to do with the OAuth? Have you already published the one to do this with OAuth? Really appreciate that, it will be so much helpful, Thanks!

        • A few others have posted the OAuth instructions in the comments.

  • Pingback: Mining Twitter with Python | twee as fuck()

  • ghrossman

    Does this allow you to get API RTs? (As compared to old-school manual RT)

    • According to the Twitter API, if you’re tracking by user, you get: tweets
      created by the user, tweets which were retweeted by the user, replies to any
      tweet created by the user, retweets of any tweet created by the user, and
      “manual” replies to the user created without using Twitter’s “reply” button.

      So yes.

      • Logesh

        The code is working fine. But not able to convert the JSON to xml or not able to import to mysql. It is throwing the error.

    • I’m also going to write a post (hopefully tomorrow) which suggests transitioning away from the tweepy package to the Twitter-created “hosebird” package, which is written in Java.

      https://github.com/twitter/hbc

  • andra

    this is so much helpful for my final paper research. thank you so much!!!

  • Brad

    Super helpful, thanks. Any chance you could do the update for Oauth?

    • Ahh, yeah. I will try to get to this soon and post an update.

      • Till

        With tweepy, it´s dead simple: You need to create an app at the twitter developer site and change the code to:

        consumer_key =”whateever it is”

        consumer_secret = “dito”

        access_token = “dito”

        access_token_secret = “dito”

        # OAuth process, using the keys and tokens

        auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

        auth.set_access_token(access_token, access_token_secret)

        api = tweepy.API(auth)

        BTW: How do you handle disconnects and alike? Currently, I´m restarting my process with supervisord, but that´s more or less the brutal version of it 🙂

        • Nice! Thanks for posting.

          I handle them in a very ashamed way — putting the whole thing in a bash loop.

  • Shakya D Ganguly

    Hi, do you think with appropriate credit to you, I can use this code in my project? Many thanks.

  • B

    Hi,

    Thanks for the tutorial! Can you explain how I can adjust the code to track something else than words (e.g., locations, users)? Would it be possible to just track the latest tweets too?

    • B

      Btw I tried to change track to locations, but I’m getting an error when running

      locations=[“-180.-90.180.90”]

      listen = SListener(api, ‘myprefix’)

      stream = tweepy.Stream(auth, listen)

      print “Streaming started…”

      try:

      stream.filter(locations=locations)

      • You need to make that locations list contain four decimal values:

        locations = [-180, -90, 180, 90]

    • Sure, if you wanted to do users, you could do:

      users = [‘alexhanna’, ‘barackobama’]

      stream.filter(follow = users)

      You could get a random sample with stream.sample()

  • anastasiasf

    Hi, I found your code while trying to figure out twitter OAUTH and how to collect tweets. However, it seems that I can’t create .json file. I get the following error: No such file or directory: ‘../data/test.20131211-130817.json’ (or whatever the timestamp is). Is there something specific that I can do to fix this? Thanks!

    • Hi, you probably need to create a “data” directory.

      So if you’re running this file in /twitter/bin/streaming.py, you need to create a /twitter/data/ directory.

    • anastasiasf

      Actually, it was my error! However, I have one question. I want to track all users that post links (so they have ‘http’ in the tweet). Is it possible to achieve with your code?

      • You could probably add something to the listener to check for a link in the tweet data structure.

        So instead of
        if ‘in_reply_to_status’ in data:
        self.on_status(data)

        you could write

        if len(data[‘entities’][‘urls’]) > 0:
        self.on_status(data)

  • Sri

    hey alex!! can we exclude the use of twitter api and still data mine the information we want. For example with out using the streaming api can i get the tweets from a particular location??

    • You probably have to use one of the Twitter APIs. Why do you want to avoid using them?

  • Charly Carrillo

    Hi, I’d like get just tweets for a specific user, I’m trying to add the user name in the track variable but it didn`t work. How can I achieve this? Thanks!

  • lsk26

    Excellent tutorial! Many thanks for this. I am collection some location based data atm. Actually I would only need some information out of the whole tweet (i.e. timestamp, user ID and location). What would be a way to modify the script so only those attributes are saved (rather than everything to save me disk space)?

    • Sure. You could change what information you actually return in the listener. So instead of just writing the whole JSON bit, you could do something like:

      def on_status(self, status):
      self.output.write(“t”.join([ status[‘created_at’], status[‘user’][‘id’], status[‘geo’][‘coordinates’] ]) + “n”)

      self.counter += 1

      if self.counter >= 20000:
      self.output.close()
      self.output = open(‘../streaming_data/’ + self.fprefix + ‘.’
      + time.strftime(‘%Y%m%d-%H%M%S’) + ‘.json’, ‘w’)
      self.counter = 0

      return

      Except there are multiple places where Twitter reports geolocation, I think? So replace status[‘geo’][‘coordinates’] with something which will be more accurate.

      • Juan Valladolid

        Hi Alex, thank you for the tutorial.

        when modifying the output.write to a specific feature I get the error :
        string indices must be integers, not str.

        Any ideas? Thanks !

        • Sounds like you have to cast an integer to string. Can you post your code?

          • Juan Valladolid

            well I would like to get the text and geo coordinates features only.. any idea? thanks!

            def on_status(self, status):
            self.output.write(“t”.join([ status[‘text’], status[‘geo’][‘coordinates’] ]) + “n”)

            self.counter += 1

            if self.counter >= 20000:

          • okay. so you probably want to use

            self.output.write(“t”.join([ status[‘text’], str(status[‘geo’][‘coordinates’]) ]) + “n”)

          • Juan Valladolid

            I thought that could work but still the same error..

            def on_status(self, status):
            42
            —> 43 self.output.write(“t”.join([ status[‘text’], str(status[‘geo’][‘coordinates’]) ]) + “n”)
            44
            45 self.counter += 1

            TypeError: string indices must be integers, not str

          • Oh, I read that wrong. I’m not sure — are you sure the variable status is a dictionary and not a string?

  • Barry

    How do you limit your tweet searches to the USA only? I know the US WOE ID is: 23424977
    Any snippet of code of how to do this will be great.

    • Hi Barry, you can use bounding boxes to check for location — https://dev.twitter.com/docs/platform-objects/places

      • Barry

        Is that example python code using the tweepy library?

        • You’ll have to do some bounding box math. Here’s some code to check for this: http://stackoverflow.com/questions/18295825/determine-if-point-is-within-bounding-box

          • Barry

            I see, but bounding the USA I could do that for mainland USA. I was just wondering if there’s a better way than bounding to get the whole USA. You may be able to bound the mainland USA except hawaii and alaska, which will be fine. I was wondering if one can just use the US WOE ID but wasn’t sure how that can be done using tweepy. If not, then I’ll do some bounding box math.

          • Not that I know of. You can do some string matching in user-defined descriptions and may catch un-geolocated tweets but that presents its own set of issues.

  • Nidhi

    Thanks for the post. Is it possible to extract data for a previous time period like last 3 months sample, as when I am trying to pull data it is giving me only data for a single day.

    • This only gives real-time data. You can’t get historical data with this method.

  • Nancy Aisosa

    I get a syntax error with

    elif ‘delete’ in data:

  • vikrant

    Hey i want to analyse tweet of soccer where can i get it and how can i create offline realtime system

  • Ahlem

    Hi Alex and Hi everyone

    I am begin in python mysql, I have a project and I want to stream data tweet from a specific country, As a step one I want to store my text file who contain tweets in a database.

    I have no problem with user table, But with location table I have an type error I thik with geo and coordinate

    this is my code:

    import json
    import codecs
    import MySQLdb
    import _mysql

    db = MySQLdb.connect(host=’127.0.0.1′, user=”root”,passwd=”mysql”,db=”collection”, charset=’utf8′, use_unicode=True)
    cur = db.cursor()

    sql_request=’insert ignore into tweetlocation (Tid,Ttext,Uid,geo_enabled,geo,coordinates) Values (%s,%s,%s,%s,%s,%s)’
    #sql_request = ‘insert into aaa (followers_count,friends_count) Values (%s,%s)’
    print ‘hello’
    f = codecs.open(“test22 ok.txt”,’r’,’utf-8′)

    cpt = 0
    for line in f:
    if len(line) 100000:
    break
    db.commit()
    cur.close()
    db.close()
    f.close()

    please give me what is wrong with my code?????????????????????????

  • Jenny Gnil

    Is there a posibility to search for certain hashtags and save only tweets containing this hashtag?

    • There’s a field in the Twitter documentation which encodes for language, I believe. And you could check if the string matched Obama using str.find or whatever. And you can change the self.counter >= 20000 line to exit after reaching 300.

      • Jenny Gnil

        Thanks! Can I restirct the output that i only get the tweet text with the user name?
        I tried this but i get an error (it’s the error of the main class: print “error!”)
        my code:

        def on_data(self, data):

        if ‘in_reply_to_status’ in data:
        self.on_status(data)

        elif ‘delete’ in data:
        delete = json.loads(data)[‘delete’][‘status’]
        if self.on_delete(delete[‘id’], delete[‘user_id’]) is False:
        return False
        elif ‘limit’ in data:
        if self.on_limit(json.loads(data)[‘limit’][‘track’]) is False:
        return False
        elif ‘warning’ in data:
        warning = json.loads(data)[‘warnings’]
        print warning[‘message’]
        return false

        def on_status(self, status):
        dictionary = json.loads(status)
        text = dictionary[‘text’]
        user_name = dictionary[‘user’][‘name’]
        self.output.write(text + “, ” + user_name + “n”)

        • Hrm — maybe you should try removing the try / except in the main class and see what the actual error is.

          • Jenny Gnil

            It’s the same error (the error from the main class) and I tried to use an additional method for the output but i got the same error. Do I have to return something that the method on_status will be closed or do you know another reason for the error?

          • Not sure — I’d have to see the whole script.

  • hello alex, you have posted very nice tutorial
    i wanted to know how to track word in last 5-min tweets

  • lsk26

    Hi Alex,
    I am running into some issues with the Slistener. It used to work super smooth, but currently I am getting some errors, whenever the self.counter limit is reaches (so technically a new Json file should be created. I keep getting an “error!” message. Do you have any idea what this could be about? I kept your scripts largely unchanged, so I do not expect it is related to my changes in code. Would be super grateful for your response! Many thanks

    • Hrm, not. Sure. You could edit the error handling to make it more informative and go from there.

      • I’ve been running into an error at the same point. Editing the error handling in streamin.py main() I’m getting an IOError and a further edit reveals “no such file or directory”.

  • MONIKA BANSAL

    i have created a small program for streaming twitter data in a specified data range . earlier it was running perfectly but now not its not creating any file it shows count 0

    import tweepy

    import csv

    access_token = “3922189213-dojmvufY0yVqdMt8BJEm4dXefP3BhQVhkD”

    access_token_secret = “MGUrD5y4bTPxtgbcP96lsSOv202XFivVJCQqaMj”

    consumer_key = “CQGnx5DY5DgdNRnb74Xgk”

    consumer_secret = “5otBreM8LDVnKTnJEtCc1ISMFrpp7V8mi8vGRKrX2P6″

    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    auth.set_access_token(access_token, access_token_secret)

    api = tweepy.API(auth)

    # Open/Create a file to append data

    textFile = open(‘fetched_tweets_baaghi4.txt’, ‘a’)

    #Use csv Writer

    #csvWriter = csv.writer(csvFile)

    count=0;

    for tweet in tweepy.Cursor(api.search,q=”#Baaghi”,lang=”en”,

    since_id=”2016-03-16″,until=”2016-04-15″,).items():

    print (tweet.created_at, ascii(tweet.text))

    count=count+1

    #csvWriter.writerow([tweet.created_at, tweet.text.encode(‘utf-8′)])

    textFile.write(ascii(tweet.text)+’n’)

    print(count)

    earlier data was getting streamed, file was also getting created but now it shows count zero, no data in file and program stops without any error.
    tried whole process in other pc too but no result.
    urgently need help what could be the problem.

    • I’d check to see how far back you can go with tweepy with regards to dates. I think if it’s too far back you have to get hold of the historic data instead.

      • MONIKA BANSAL

        i changed the dates but same issues. i tried doing some debugging and its not going inside the for loop. what could be possible reason for this when the same was running earlier?

        • MONIKA BANSAL

          tried few more dates, problem solved.
          thanks a lot 🙂

  • Mangnier

    Excellent Tutorial.
    I just have one question about the track, when i receive tweet with key word inside, my stream is disconnected. Why?

    Can u help me ?

  • Andrew Emil

    hello,
    is there way to find whether a specific user is mentioned in other tweets or not?
    thanks

    • Yeah, you could just use the username as the keyword.