R users know it can be finicky in its requirements and opaque in its error messages. The beginning R user often then happily discovers that a mailing list for dealing with R problems with a large and active user base, R-help, has existed since 1997. Then, the beginning R user wades into the waters, asks a question, and is promptly torn to shreds for inadequate knowledge of statistics and/or the software, for wanting to do something silly, or for the gravest sin of violating the posting guidelines. The R user slinks away, tail between legs, and attempts to find another source of help. Or so the conventional wisdom goes. Late last year, someone on Twitter (I don’t remember who, let me know if it was you) asked if R-help was getting meaner. I decided to collect some evidence and find out.

Our findings are surprising, but I think I have some simple sociological explanations.
Continue reading

We’re in the Final Four now, the actual final four that matters (sorry sports forecasters).

Last week, Coco got the chop, which made sense statistically (she had a huge relative risk AND had been the first queen to have had to lipsync four times) and from a narrative standpoint — Alyssa got eliminated the week before so they didn’t really need to keep Coco around to continue all that drama.

So now the biggest question is who is getting kicked off this week and will leave us with our top three? Before I get to my predictions, I want to point readers to Dilettwat’s analysis, which, while uses no regressions, is still chock full of some interesting statistics about the top four and makes some predictions about who needs to do what in this episode to win.

For my own analysis, it’s looking very close here. Here are the numbers.

1  Jinkx Monsoon  1.2170057  1.5580338
2         Alaska  1.2509423  2.0045457
3 Roxxxy Andrews  3.2466063  3.3926072
4          Detox  5.5899580  1.5527694

Jinkx and Alaska are neck-and-neck in this model, and confidence intervals make pairwise comparisons rather hard to make here. But this order is the same as Homoviper’s Index.

Just to get a sense of how close this is, here is a plot that tracks the relative risks across the last few weeks.

predict-change-20130415

Last week, Coco’s relative risk was incredibly high, the highest it has been. This week, Detox has the only relative risk that is indistinguishable from zero, which makes me think she’s about to go. Which makes sense — she’s the only one in the group who has lipsynced more than once. So all I’m willing to say more confidently is that Detox goes home tonight.

On a totally different note, Jujubee came to Madison on Thursday and I got a chance to tell her about my forecasting efforts when we were taking some pictures…

Then I saw Nate Silver at the Midwest Political Science Association conference in Chicago. Unfortunately, there was not an opportunity for a Kate Silver/Nate Silver photo op. Maybe next time.

I’m scribbling this furiously because I had a busy weekend, but stay tuned for next week’s extra special analysis.

Last week, Alyssa got the boot and Jinkx kept her place. And I totally called it with my first model that accounted for the proportional hazards assumption. I think the model is having a little more success as the season plods on.

Before I get to the predictions for episode 10, there’s two really interesting prospects that may either give this model some more predictive power or become very interesting projects in their own right.

Continue reading

Last week, Alaska took it home with her dangerous performance, while Ivy Winters was sent home after going up against Alyssa Edwards. This is sad on many fronts. First, I love me some Ivy Winters. Second, Jinkx had revealed that she had a crush on Ivy, and the relationship that may have flourished between the two would have been too cute. But lastly, and possibly the worst, both of my models from last week had Ivy on top. Ugh.

What went wrong? Well, this certainly wasn’t Ivy’s challenge. But it’s high time that I started interrogating the models a little further.
Continue reading

Wow, last week’s Drag Race post made the rounds in the stats and Drag Race circles. It was cross-posted to Jezebel and has been getting some pretty high-profile links. A little birdy told me that Ms. Ru herself has read it. I think I can die a happy man knowing that RuPaul has visited Bad Hessian.

Anyhow, last week I tried to count Coco out. I was reading her like the latest AJS. The library is open. But her response to me was simple — girl, please:

(Also this happened. Wig under a wig.)


(both of these gifs courtesy of f%^@yeahdragrace)

Can that win safeguard Coco from getting eliminated? Let’s look at the numbers after the jump.
Continue reading

lipsync_header-ed

If you follow me on Twitter, you know that I’m a big fan of RuPaul’s Drag Race. The transformation, the glamour, the sheer eleganza extravanga is something my life needs to interrupt the monotony of grad school. I was able to catch up on nearly four seasons in a little less than a month, and I’ve been watching the current (fifth) season religiously every Monday at Plan B, the gay bar across from my house.

I don’t know if this occurs with other reality shows (this is the first I’ve been taken with), but there is some element of prediction involved in knowing who will come out as the winner. A drag queen we spoke with at Plan B suggested that the length of time each queen appears in the season preview is an indicator, while Homoviper’s “index” is largely based on a more qualitative, hermeneutic analysis. I figured, hey, we could probably build a statistical model to know which factors are the most determinative in winning the competition.
Continue reading

Over the weekend I led a workshop on basic Twitter processing using Hadoop Streaming (or at least simulating Hadoop Streaming). I created three modules for it.

The first is an introduction of MapReduce that calculates word counts for text. The second is a (very) basic sentiment analysis of political tweets, and the last one is a network analysis of political tweets.

All the code for these workshops is on the site. What other kinds of analysis can/should be done with Twitter data?

Learning to use software always entails some startup cost. I recently had an exchange with one of my colleagues who is relatively new to social network analysis. He asked about my thoughts on a certain network analysis program and mentioned that “it’s easy to get lost with so many [network analysis] programs out there.” His impression is completely understandable. Social network analysis has become immensely popular in recent years. The rise in its popularity has especially been witnessed among gifted people capable of writing good software. Indeed, one Wikipedia list broadly describes about 70 social network analysis programs. Each of these programs have their strengths and weaknesses with regards to its contributions to the field. Given the wealth of options, which programs are worth the time investment to learn?

If you’re new to network analysis then I’d highly recommend learning the packages in R, perhaps supplemented by Pajek and/or Python packages. Here’s why:

Continue reading

I recently discovered Gary Weissman’s excellent post on Grey’s Anatomy Network of Sexual Relations and I felt inspired.  For those who haven’t heard of the television show before, Grey’s Anatomy is a widely popular, award-winning prime-time medical drama airing on ABC which has received no shortage of critical acclaim.  Meeting conventional medical drama expectations, the show quite regularly features members of its attractive cast “hooking up.”  Or so I am told.  In an effort to teach medical students some basic social network lessons, Weissman produced a network data set on the show’s sexual contacts between characters.  Though I’m not particularly fond of the show and both sexual and fictional networks lie outside my research interests, Weissman’s post served as a remarkable demonstration of network analysis for pedagogical purposes.

Continue reading

Working with right-to-left languages like Arabic in R can be a bit of a headache, especially when mixed with left-to-right languages (like English). Since my research involves a great deal of text analysis of Arabic news articles, I find myself with a lot of headaches. Most text analysis methods require some kind of normalization before diving into the actual analyses. Normalization includes things like removing punctuation, converting words to lowercase, stripping numbers out, and so on. This is essential for any kind of frequency-based analysis so that words such as don’t, Don’t, and dont are not considered unique words. After all, when dealing with human-generated text, typos and differences in presentation are bound to occur. Often times, normalizing also includes stemming words so that words such as think, thinking, and thinks are all stemmed to “think” as they all represent (basically) the same concept.

Continue reading