Over the weekend I led a workshop on basic Twitter processing using Hadoop Streaming (or at least simulating Hadoop Streaming). I created three modules for it.
The first is an introduction of MapReduce that calculates word counts for text. The second is a (very) basic sentiment analysis of political tweets, and the last one is a network analysis of political tweets.
All the code for these workshops is on the site. What other kinds of analysis can/should be done with Twitter data?
Working with right-to-left languages like Arabic in R can be a bit of a headache, especially when mixed with left-to-right languages (like English). Since my research involves a great deal of text analysis of Arabic news articles, I find myself with a lot of headaches. Most text analysis methods require some kind of normalization before diving into the actual analyses. Normalization includes things like removing punctuation, converting words to lowercase, stripping numbers out, and so on. This is essential for any kind of frequency-based analysis so that words such as don’t, Don’t, and dont are not considered unique words. After all, when dealing with human-generated text, typos and differences in presentation are bound to occur. Often times, normalizing also includes stemming words so that words such as think, thinking, and thinks are all stemmed to “think” as they all represent (basically) the same concept.