Over the weekend I led a workshop on basic Twitter processing using Hadoop Streaming (or at least simulating Hadoop Streaming). I created three modules for it.
The first is an introduction of MapReduce that calculates word counts for text. The second is a (very) basic sentiment analysis of political tweets, and the last one is a network analysis of political tweets.
All the code for these workshops is on the site. What other kinds of analysis can/should be done with Twitter data?
Inspired by Neal Caren’s excellent series on Big Data collection and analysis with Python, I want to work on a set of tutorials for some basic collection and analysis as well.
I’m drawing on some of my previous “tworkshops” that are meant to bring people from zero knowledge, to knowing how to move around basic analysis of Twitter data with potential for parallel processing in systems like Hadoop MapReduce.
Let’s start with the basics of what the data look like and how to access it.
This got posted at R-bloggers last night, after the men’s 100 meter Olympic event was over. Marcus Gesmann predicted Usain Bolt’s 9.63 second result within 0.05 seconds. Even better, he did it using a simple log-linear model that didn’t control for any other factors.
Check the original article at R-bloggers, which talks more about the progression of faster running times and includes the R code used.
Going to be at ASA? Come hang out with the Bad Hessians!
Friday, August 17, 8 PM.
Euclid Hall Bar & Kitchen, 1317 14th Street
RSVP on the Facebooks
I briefly talked about GitHub, the version control system, in my last post on taking notes in Markdown. A few days ago John Norman wrote a post, calling GitHub “the most important social network“. He says this by virtue of discussion features built into the system, discussions can occur around code and changes can be incorporated rather easily. But the more intriguing part of his discussion, I think, lies at the potential of changing the nature of knowledge production, not only for code:
Let me tell you about knowledge production: much of it is private. I have a PhD in English and wrote a dissertation on the interaction between literary and medical knowledge in the sixteenth and seventeenth centuries. My research notes and revisions were essentially private. My drafts were my property. In certain highly ceremonial performances, I might share my “work in progress” with an individual (a faculty advisor or an eminent scholar or a friend who could provide feedback), or with a study group interested in the project, or from the lectern at a conference. But for the most part, sharing to the entire world happened at the moment of final “production,” when the artifact was safely ensconced in the library or computer, and indexed by domain experts. This pattern is much the same in the social sciences and the sciences (the sciences are circulating more papers in pre-publication form, but the door is closed to full access to the laboratory).
This is actually a very intriguing prospect for me. Is there the potential to share and think through research notes in the actual process of writing them up? Does the same kind of system hold promise for writing articles and research reports? And are scholars willing to show that much of their Goffmanian “back stage” to public audiences?
As a token of my commitment to this experiment, here are my own notes for the prelim exam I’m studying for. http://github.com/raynach/comparative-historical. I have a number of apprehensions about doing this but I am very curious about the degree to which we can bring the collaboration of open-source code projects to other domains of knowledge production.
What other projects could social scientists use version control systems for?
We’re really excited to launch a new portion of the site today, what we are calling Ask the Bad Hessians. The name is a bit of a misnomer — it’s actually a crowdsourced site in which anyone can ask — and answer — the questions posted there. If you are familiar with Stack Overflow, the software we’re using is a clone of that. When you post a question, anybody can reply with an answer to it. Answers are voted “up” or “down” by other users, and the original asker can pick what s/he deems as the correct answer.
Stack Overflow is where I know I go for a bunch of my own programming questions, and from my conversations with Adam and Trey I know they do as well. We hope this can be as useful as a resource for social scientists. Feel free to ask questions about Stata, surveys, R, LaTeX, data cleaning, etc etc. Someone’s gotta have an answer, right?
I’m taking a preliminary exam in about a month so I’m very much embedded in the classics of comparative-historical sociology, as well as more recent revisionist works. As such I haven’t been able to dirty up with my usual nerdery.
But I do tend to have a somewhat unorthodox approach to taking notes, partially inspired by my avoidance of anything formatted in Word, partially rooted in my love of emacs (sorry vi fans). I wanted a lightweight syntax (read, in plain text) for keeping notes that wouldn’t get outdated quickly and wasn’t just some terrible hack that I threw together but wouldn’t understand down the line.
Enter Markdown. Markdown is a simple syntax that stores in plain-text and converts to valid HTML. It allows you to created ordered and unordered lists, define bold and italic words, headings and subheadings, and all other nifty features. What I really like about Markdown is that it makes it very easy to take outlines written in any text editor and turn them into attractive, easy-to-read webpages. It’s flexible enough to work on any computer and quick enough to use in lecture without having to fudge with formatting and idiosyncratic word processor errors.