If you follow me on Twitter, you know that I’m a big fan of RuPaul’s Drag Race. The transformation, the glamour, the sheer eleganza extravanga is something my life needs to interrupt the monotony of grad school. I was able to catch up on nearly four seasons in a little less than a month, and I’ve been watching the current (fifth) season religiously every Monday at Plan B, the gay bar across from my house.

I don’t know if this occurs with other reality shows (this is the first I’ve been taken with), but there is some element of prediction involved in knowing who will come out as the winner. A drag queen we spoke with at Plan B suggested that the length of time each queen appears in the season preview is an indicator, while Homoviper’s “index” is largely based on a more qualitative, hermeneutic analysis. I figured, hey, we could probably build a statistical model to know which factors are the most determinative in winning the competition.

If you’ve never seen the show, here’s the deal: anywhere from 9 to 13 contestants start the season, selected from a large pool of aspiring drag queens. Every episode, the queens participate in two challenges: a mini-challenge and a main challenge. The mini-challenge is usually some small scale task (dressing up a doll in drag, lipsyncing to one of RuPaul’s songs). The winner(s) of these challenges get some sort of leg up into some part of the main challenge. The main challenge usually has some performative and/or creative element (dancing, singing, making an outlandish costume out of mundane items), which is often combined with a runway performance. At the end of the show, one queen wins, while two are “up for elimination” and forced to “lipsync for their lives.” In dramatic fashion, the queens attempt to outdo one another by lipsyncing to a song they’ve prepared for the week. RuPaul then allows one queen to stay and forces one to “sashay away” (although note that there has been one “double elimination” in the current season when both queens performed subpar, and sometimes Ru lets both queens stay). The judges also note who came close to winning and who came close to being forced to lipsync.

The (Super)model (of the world)

Taking some inspiration from Brett Keller’s Hunger Games post, I figured that a survival analysis (also called event history analysis) would be the best suited for this task. Survival analysis attempts to model which elements are the most determinative in having an impact on the duration of “survival.” I’ve never really delved into the world of survival analysis so I spent the last month reading up with Box-Steffensmeier and Jones’s excellent book. Survival coefficients can be read as impacting the “hazard rate” — the rate at which units “fail” — usually denoted as h(t) (hence the title of this post). As is most common nowadays, I used a Cox proportional hazard model given that it is a semiparametric method and you don’t have to make assumptions about the shape of the distribution of hazard rates (like the Weibull and Gompertz models).

As per standard practice for survival analysis, the dependent variable is “duration” or the length of time that the unit stays alive. For Drag Race, this is the number of episodes that the queen stays on the show. There is a set of covariates that I’ve decided may have potential predictive power. These covariates fall into two classes — time-invariant and time-variant. There may be others, but I’ll discuss those more at the end of this article.

Time-invariant covariates

These variables are ones that we can get before the season even starts. If they turn out to be significant then it might give us some insight into forecasting how the season will go before it even starts.

  1. Age – The queens who’ve participated on RuPaul’s Drag Race have been anywhere between 21 and 39 years old. Like any performance-based form, the quality of one’s drag usually gets better with age. However, you could also argue that being a better drag queen is highly contingent on having a youthful look and being able to muster a young demeanor. There’s been some animosity between young and old queens (e.g. 37-year old Coco Montrese calling the 21-year old Serena ChaCha a “little boy”), but it’s not clear if there’s an impact of age on actually winning the competition.
  2. Plus size – Every season has had at least one “plus size” queen. These “big girls” have generally not gone far in the competition. The furthest any one has gotten has been Latrice Royale, who placed 4th in season 4. There’s also a healthy amount of “fat-phobia” from other queens. I’m throwing this in to see if there is any sort of plus size penalty.
  3. Puerto Rican – Along with being a big girl, each season has had at least one Puerto Rican queen. I chose to focus on Puerto Rican as a variable instead of race/ethnicity more generally because the Puerto Rican queens often come up against a challenge with language and participating in particular challenges that require more linguistic nuance and humor. That said, while none of these queens have ever taken the crown, they often have gotten very far — Nina Flowers placed 2nd in Season 1 and Alexis Mateo placed 3rd in season 3.

Time-variant covariates

These variables change as the season goes on. They serve as indicators of the queen’s performance throughout the competition.

  1. Wins – Winning a main challenge is a pretty big deal. It generally earns the queens respect amongst the judges and the other queens. It makes a bit of sense that a queen with a consistent track record of winning challenges will generally stay on the show longer.
  2. Lipsyncs – “Lipsyncing for your life” is a generally harrowing affair. It is the last chance to impress RuPaul, so the queens better make it good. Having to lipsync is the opposite of a win — it looks bad to the judges and to the other queens, and shakes one’s own confidence.
  3. Highs and lows – Apart from actually winning challenges or having to lipsync, the judges will also note whether the queen was in the top or the bottom of the pack but (un)fortunately was not great/mediocre enough to merit any more notice.

I should say a word about the actual “lipsync” variable. I’ve been going back and forth on whether to calculate that variable based on the raw number of lipsyncs (excluding the final one that judges the season winner) or to use one that only includes lipsyncs which do not result in eliminating the contestant. In the models below, I play around with both.

I compiled the data from the detailed Wikipedia charts for each season of the show and put them into a public Google doc. You can access the Google doc as a comma-separated value file as well for easy data wrangling. There’s a row for each contestant and each episode.

The main challenge (now don’t f*$k it up!)

First thing is to load the data. Thanks to the RCurl, I can pull this straight from the Google Doc into R. I’ll also load the survival package and ggplot2 for some plots. I’m also training the model on the first four seasons and am saving the fifth season for the out-of-sample prediction. (To make this post less noisy and more pretty I’m putting most the R code in this gist rather than posting it inline).

Before estimating the Cox model, I want to start with looking at a few univariate analyses for things that may be important predictors. The y-axis on these graphs is the survival rate, while the x-axis represents time measured in episodes. The closer the different strata are to each other, the more similar they are to the effect they have on the survival rate. Because I like pretty ggplot2 graphs I rigged up a function to convert the Survival fit object to something that it could use, with a heavy amount of help from this pdf from Ramon Saccilotto of the Basel Biometric Section.


First, let’s look at age. To make the graph a little cleaner, I’m put age into three strata — less than 25 (coded 1), 25 through 30 (coded 2), and greater than 30 (coded 3). Each of these strata has roughly the same number of queens in it.

Kaplan-Meier age

We actually don’t see any major differences between the age strata. Their survival rates seem to decline together, with a slight advantaged gained by the young’ins.


Turning to the time-variant covariates, let’s look first at the number of main challenges each queen won, and then the number of times that she lipsynced.


This image is pretty stark. If the queen has won four challenges, she is guaranteed to win (to date, only the great Sharon Needles has accomplished such a feat). Winning three challenges will take you a great deal of the way, but this all changes very late in the game. This graph shows the tense competition of season three, when all three finalists (Raja, Manila Luzon, and Alexis Mateo) had three wins under their… um, petticoats? by the final episodes. Below that, having two wins bodes better, but not much better, than have zero or one.


The last graph we look at is

This one is also rather dramatic. With no lipsyncs, chances don’t drop very dramatically. To date, there are only two queens — Nina Flowers and Tyra Sanchez — who have managed to never lipsync. Survival rates drop off perspicaciously after that, however. Contestants can almost guarantee that they will have to sashay away if it’s their third time in the bottom two.

Now for the models. I’m going to start by considering only the time invariant models. If there is some stable pattern that emerges there, then it may be possible to get some purchase on who the next drag superstar will be before the season begins.

coxph(formula = t.surv ~ (Age + PlusSize + PuertoRico), data = df)
  n= 314, number of events= 41 
               coef exp(coef) se(coef)      z Pr(>|z|)  
Age        -0.03656   0.96410  0.03262 -1.121   0.2623  
PlusSize    0.89174   2.43938  0.44215  2.017   0.0437 *
PuertoRico -0.29471   0.74475  0.44815 -0.658   0.5108  
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
           exp(coef) exp(-coef) lower .95 upper .95
Age           0.9641     1.0372    0.9044     1.028
PlusSize      2.4394     0.4099    1.0255     5.803
PuertoRico    0.7448     1.3427    0.3094     1.793
Concordance= 0.592  (se = 0.058 )
Rsquare= 0.014   (max possible= 0.542 )
Likelihood ratio test= 4.48  on 3 df,   p=0.2142
Wald test            = 4.79  on 3 df,   p=0.1876
Score (logrank) test = 4.96  on 3 df,   p=0.1746

This model does not seem to hold up very well on the whole. The LR and Wald tests don’t let us reject the hypothesis that all of the coefficients are equal to zero. The only variable that turns up as statistically significant is plus size. Being a big girl increases her hazard rate by a factor of 2.43 on average, or 143%. The model also shows weak effect sizes for age and the Puerto Rican cases. Age goes in the direction we expected, but being Puerto Rican does not. It actually reduces the hazard rate.

So that model isn’t terribly helpful. Let’s start to incorporate time variant covariables. First, I include all time invariant covariants and add the time variant covariates. Note, to account for temporal dependence between observations, I’ve included the cluster term which produces robust standard errors, which is what Box-Steffensmeier and Jones advise.

coxph(formula = t.surv ~ (Age + PlusSize + PuertoRico + Wins + 
    Highs + Lows + Lipsyncs) + cluster(ID), data = df)
  n= 314, number of events= 41 
                coef exp(coef)  se(coef) robust se      z Pr(>|z|)    
Age        -0.005792  0.994225  0.044982  0.037835 -0.153    0.878    
PlusSize    0.186822  1.205413  0.487592  0.542285  0.345    0.730    
PuertoRico -0.465269  0.627966  0.615109  0.695995 -0.668    0.504    
Wins       -0.210350  0.810300  0.328441  0.369647 -0.569    0.569    
Highs       0.269315  1.309067  0.290462  0.339070  0.794    0.427    
Lows        0.100976  1.106250  0.456684  0.548226  0.184    0.854    
Lipsyncs    1.436659  4.206618  0.302593  0.319964  4.490 7.12e-06 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
           exp(coef) exp(-coef) lower .95 upper .95
Age           0.9942     1.0058    0.9232     1.071
PlusSize      1.2054     0.8296    0.4164     3.489
PuertoRico    0.6280     1.5924    0.1605     2.457
Wins          0.8103     1.2341    0.3926     1.672
Highs         1.3091     0.7639    0.6735     2.544
Lows          1.1062     0.9040    0.3777     3.240
Lipsyncs      4.2066     0.2377    2.2469     7.876
Concordance= 0.882  (se = 0.058 )
Rsquare= 0.17   (max possible= 0.542 )
Likelihood ratio test= 58.43  on 7 df,   p=3.1e-10
Wald test            = 104.6  on 7 df,   p=0
Score (logrank) test = 72.23  on 7 df,   p=5.226e-13,   Robust = 33.47  p=2.167e-05
  (Note: the likelihood ratio and score tests assume independence of
     observations within a cluster, the Wald and robust score tests do not).

This model does much better compared to the first model. LR and Wald tests allow us to reject the null that all coefficients are not different from zero. Lipsyncs here come out as the most obvious predictor. The effect is large and highly statistically significant. On average, an additional lipsync will increase a queen’s hazard rate by a factor of 4.21 — 321%! Lipsyncing has a huge penalty, hunty. It doesn’t bode well for winning the crown.

But maybe we are overfitting. Lipsyncing is the only thing that can knock a queen out of the contest (except if the queen is Willam Belli, who was disqualified for breaching the show’s contract). It may be possible that it has such close correlation with the instance of the event that we’re losing more nuanced trends in the data. For this fact, I calculated a variable that counted the number of lipsyncs excluding those instances in which it knocked the queen out of the contest and replaced the existing lipsync variable with it.

coxph(formula = t.surv ~ (Age + PlusSize + PuertoRico + Wins + 
    Highs + Lows + LipsyncWithoutOut) + cluster(ID), data = df)
  n= 314, number of events= 41 
                      coef exp(coef) se(coef) robust se      z Pr(>|z|)    
Age               -0.04716   0.95393  0.04104   0.04040 -1.167 0.243043    
PlusSize           0.10505   1.11076  0.49714   0.56345  0.186 0.852102    
PuertoRico        -0.07008   0.93232  0.48455   0.39676 -0.177 0.859791    
Wins              -1.13417   0.32169  0.28403   0.22578 -5.023 5.08e-07 ***
Highs             -0.63124   0.53193  0.21924   0.18725 -3.371 0.000749 ***
Lows              -1.17747   0.30806  0.37894   0.36645 -3.213 0.001313 ** 
LipsyncWithoutOut -0.07052   0.93191  0.26342   0.26588 -0.265 0.790835    
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
                  exp(coef) exp(-coef) lower .95 upper .95
Age                  0.9539     1.0483    0.8813    1.0325
PlusSize             1.1108     0.9003    0.3681    3.3514
PuertoRico           0.9323     1.0726    0.4284    2.0290
Wins                 0.3217     3.1086    0.2067    0.5008
Highs                0.5319     1.8799    0.3685    0.7678
Lows                 0.3081     3.2462    0.1502    0.6318
LipsyncWithoutOut    0.9319     1.0731    0.5534    1.5692
Concordance= 0.713  (se = 0.058 )
Rsquare= 0.093   (max possible= 0.542 )
Likelihood ratio test= 30.82  on 7 df,   p=6.712e-05
Wald test            = 41.27  on 7 df,   p=7.188e-07
Score (logrank) test = 32.46  on 7 df,   p=3.333e-05,   Robust = 24.62  p=0.0008875
  (Note: the likelihood ratio and score tests assume independence of
     observations within a cluster, the Wald and robust score tests do not).

With this model, we lose some explanation in terms of variance (R-squared goes from 0.17 to 0.093), but three variables — wins, highs, and lows — gain statistical significance. Wins and highs go in the expected direction: for each additional win or high, the hazard rate decreases by a factor of 3.11 and 1.88, respectively. But, oddly enough, for each additional low, the hazard rate decreases by a factor of 3.25! That doesn’t really seem right. It may be that what’s happening is that those queens who accumulate low judging are still “safe” but aren’t getting into any real trouble.

So it may be that the original lipsync variable is a better model. A little toying with some k-fold-like validation in which I train the model on the first three seasons and predict the fourth season (not shown, saving it for next week) says that it seems like a better model.

The future of drag!

This is a rough first attempt at quantifying the sheer sickening amount of glamour that RuPaul provides. There are three outstanding topics to address — prediction, repeated events, and where to get more data.

Using the predict.coxph() function included in the survival package, we can compute the relative risks of the remaining queens in season 5 (this StackOverflow article has a nice explanation of what this function does. Thanks to Trey for pointing this out). I’m using the second model, with the full lipsync count.

            Name   relRisk se.relRisk
1 Alyssa Edwards 0.2931656  0.1887553
2  Jinkx Monsoon 0.6226868  0.3119846
3 Roxxxy Andrews 0.6691839  0.6217982
4    Ivy Winters 0.6809134  0.6087346
5         Alaska 0.7060275  0.5789992
6          Detox 1.1475463  0.5733333
7  Coco Montrese 5.6221514  1.4578057

The model has Coco Montrese down, which makes sense: one more lipsync and she’s up to three, meaning she has nearly no chance of staying in the race. Detox is the only other contestant that’s lipsynced this season so she’s not looking great either unless she pulls a win soon. Judging from the standard errors, Roxxxy, Ivy, and Alaska may bounce … around a bit. Alyssa and Jinkx look unusually strong. Homoviper’s latest analysis projections look a bit different: Ivy, Jinkx, Alaska, Alyssa, Roxxxy, Detox, and finally Coco. In next week’s post, I’ll focus more on exploring prediction.

Another concern is that of repeated events. Given that in the last two seasons, Ru has brought back one queen who had been previously eliminated, there’s no reason to suspect that she won’t do it again. So far I’ve dealt with it by only counting a queen’s final exit. But this may be biasing the model in unexpected ways. One obvious way to treat this is a case of repeated events. There are a number of methods for treating repeated events but I’m not terribly familiar with them.

Lastly, there are other sources of data that we could be pulling from. I mentioned the potential of preview appearances, which may indicate a number of things — the amount of airtime that each queen gets, how catty she is, etc etc. But they at least need enough footage to do so. I did a rough count from this season’s trailer and it roughly correlates with the top eight at least. A few other time invariant covariates could include if a queen is a pageant girl and actual time in the drag industry (as opposed to age). Barring a solid measure of fierceness this is what we’ve got to work with.

Anyhow, until next time — if you can’t love yourself, how the hell you gonna love somebody else. Can I get an amen?!

(Thanks a bunch to Martha and Trey for help putting together this post! The former claims 3% authorship on it.)

  • CCC

    Awesome post.

    All if have to contribute is OUT Mag’s very good history of World of Wonder (which produces DRAG RACE, and also “discovered” RuPaul):

  • SmallishButt

    There is something magical about survival analyses with N=314.

  • Interesting! But there are some additional Puerto Rican girls not accounted for in your data:
    * Rebecca Glasscock (born in PR, came to the US as adolescent, identified as Puerto Rican on show)
    * Carmen Carrera (Puerto Rican from NJ)
    * Roxxxy Andrews (Puerto Rican from Florida; she speaks in Spanish to Lineysha on the Snatch Game episode. I had to ask Michelle Visage on Twitter to figure this one out.)
    * Monica Beverly Hillz (Puerto Rican from Chicago; her mother speaks to her in Spanish on Untucked)
    I take it your definition of Puerto Rican is a person from Puerto Rico who came from the island either for the show or shortly before (like Nina Flowers, who was living in Denver Colorado) and who has linguistic challenges. In my work (Queer Ricans: Cultures and Sexualities in the Diaspora) I have advocated for a more expansive understanding of the category of Puerto Rican. I don’t know how this framework for Puerto Ricanness would affect your analysis.

    • Thanks Larry. Yeah, this was definitely a sticky topic to deal with in quantitative terms. The linguistic difficulty seems to be the thing remarked upon mostly by the judges so I included it in the analysis. There are surely myriad different ways that it is constructed and reproduced by both the judges and the other queens.

  • i landed here by coincidence and i understood maybe 1% of everything here, but this is still all kinds of awesome.

    but i wonder. doesnt every season with a “average sized” winner increase the chance of a plus sized queen winning the following season? doesnt every season with a “glamorous” winner increase the chance of a “comedy queen” winning the following season? assuming that rupaul and her producers dont want to come across as biased against certain body types/forms of drag/ethnicities etc., and assuming that the game is at least partly fixed, and not an entirely fair competition.

    can this kind of thing be factored in?

    • I see what you mean, and it’s a good question. I am assuming that the seasons are independent of each other — what happens in one season doesn’t affect what happens in the next season. I’m also assuming that the game isn’t “rigged” even though there’s a good chance that Ru and her crew want particular people to win at the outset.

      One way to factor in prior seasons would be to have some lagged “season winner” variable that corresponds to attributes of past winners and whether it weighs more or less on current contestants. There have been no plus sized or PR winners, though, so it might have to be some other, eccentric variable (goth for Ms. Needles, fashionista for Raja). But I feel like with the current contestants don’t have many of the same attributes to merit constructing those variables (no “goth” queen exists in this season, although I think Alaska incurs a penalty because she is associated with Sharon Needles; you could call Ivy a fashionista but in a different way than Raja, etc).

  • Sweet! Do you plan on updating this with Season 5’s new data each week to see if you’re right?

    • Hopefully! I wanted to do a post next week that also incorporated some out-of-sample validation to see how well the model predicts one past season from other ones.

      • fantastic work. We’ve tried running the predictions after episode 7 (in which Coco won and Alaska and Jinkx each got a High). It seems that Detox has suddenly floated to the top position right before Ivy and Jinkx:
        1 Detox
        2 Ivy Winters
        3 Jinkx Monsoon
        4 Alaska
        5 Alyssa Edwards
        6 Roxxxy Andrews
        7 Coco Montrese
        I’m perplexed at this result as she has less highs and wins than Ivy and Jinkx (my two favourite queens this season). What could explain this?

        • That’s right. The model weighs the most heavily lipsyncs, both in terms of effect size and statistical significance. Highs and wins actually have a relatively negligible effect, and have larger standard errors.

          • Oh well I was thinking since Detox has already lipsync’ed once it should be to her disadvantage. So I checked the google spreadsheet and there seems to be an error in the data on line 343. Detox has lipsync’ed once so shouldn’t there be a 1 there in the Lipsyncs column? (or did i get it all wrong!)

          • Good catch! That changes the predictions to make much more sense. Ivy and Jinkx are in the lead now —

            Episode Name relRisk se.relRisk
            1 Ivy Winters 0.7532603 1.0817218
            2 Jinkx Monsoon 0.8151390 0.5921149
            3 Alaska 0.9242376 0.9452547
            4 Detox 1.1475463 0.5733333
            5 Alyssa Edwards 1.2332356 0.3091683
            6 Roxxxy Andrews 2.8150008 1.7322992
            7 Coco Montrese 4.5556308 0.8606941

  • Snuffles

    You’re the Nate Silver of RuPaul’s Drag Race!

    • Gaining this title was basically my goal all along.

      But really, I could only dream of having that level of precision.

      • Oscar Mario Soto Jr.

        Sooooo when will you renew models for Season 8? ^_^ Also, did the one All Stars season reflect these same models, or was everything wonky because of losing two queens a lipsync?

        • I stopped watching a while ago. Also, I didn’t use All-Stars data.

  • Pingback: What do the Hunger Games, RuPaul, and statistics have in common? | Chris Blattman()

  • HomoViper

    Love this. Thanks for the shout out, hunty. xx

  • Brett

    Cox model, hehehe. (Good stuff!)

    • Took long enough for someone to make that joke!

  • viqule

    Very fun models. I noticed your indicator function on “Lows” to be one of the weakest terms for significance, especially for the earlier model. One possible explanation is that the “Low” indicator on the wiki often doesn’t mean that the queen is remotely close to being sent to the bottom 2, but instead it is Ru giving them a slap on the wrist or telling them to bring out a fire instead of being safe. Examples of this include Alaska’s instances in the bottom, Jujubee’s “Low” rank in Snatch Game (although clearly safe from the judges’ critiques). Furthermore, once we get past Snatch Game, there are at times some ambiguity in who is high and who is low and wiki often goes by what viewers perceive from the edit. That in itself provides measurement bias. I would suggest just removing the “Low” indicator and seeing how it improves the model’s performance.

    • That’s a very good point. I ran the model again without Lows and it didn’t really change much. Coefficients were rather stable and didn’t change R-squared much.

  • Nilay Engin

    Great stuff, however had to mention this little detail which is actually “chassé away”.
    Chassé: A ballet movement consisting of one or more quick
    gliding steps with the same foot always leading.
    It’s French, so is written like this.

    • LinguisticsBA

      “sashay” is an English word describing an exaggerated motion of the hips while walking. Men must exaggerated this motion to walk more like women do naturally, hence the references in the world of female impersonation.

  • you get an A+ … see me after class and, um, wear the school’s uniform 😉

  • Pingback: How to get to the Project Runway finale | Let's Talk Data()

  • You totally need to submit this to AJS…

  • Pingback: In Praise of Fun Projects | Dart-Throwing Chimp()

  • Curiously Statistical

    Any correlation with screen time in the “confessional booth”? it would make sense that those eliminated early would get more booth time as a “consolation prize” since the final contestants would get the most actual exposure over time.

    • Yeah, that’s more work than I’m willing to put into this.

  • Pingback: Brett Keller – global health & development » Fun projects are fun()

  • Pingback: Briefly | Stats Chat()

  • Laura

    I love RuPaul, but ever hear of immortal time bias?

    • Now I have. I’m not sure how it applies to this model, though? Please explain.

  • Pingback: Screwing Up the National Anthem | Miles Ott()

  • Jay

    I’m a statistics student and a big Drag Race fan, this has absolutely made my day!

  • Dejahthoris

    Put Max into the equation and see how it skews. He won the most challenges up until the day he was eliminated.

  • Harry Millward

    Aslaska, Coartney Act, and Bianca Del Rio also never had to lipsinc for their lives…

  • Pingback: The Library is Open |

  • Pingback: The Library is Open: Season 9 |

  • Pingback: Ggplot2 is 10 years old: The program that brought data visualization to the masses — Quartz()