R users know it can be finicky in its requirements and opaque in its error messages. The beginning R user often then happily discovers that a mailing list for dealing with R problems with a large and active user base, R-help, has existed since 1997. Then, the beginning R user wades into the waters, asks a question, and is promptly torn to shreds for inadequate knowledge of statistics and/or the software, for wanting to do something silly, or for the gravest sin of violating the posting guidelines. The R user slinks away, tail between legs, and attempts to find another source of help. Or so the conventional wisdom goes. Late last year, someone on Twitter (I don’t remember who, let me know if it was you) asked if R-help was getting meaner. I decided to collect some evidence and find out.

Our findings are surprising, but I think I have some simple sociological explanations.

I began by using Scrapy to download all the e-mails sent to R-help between April 1997 (the earliest available archive) and December 2012. I did not collect all of the data for December 2012. The idea for this project and the scraping took place a long time before the analysis. This produced a total of 318,733 emails. As you can imagine, participation on the list seems to have mirrored the increased adoption of R itself, ranging from 102 posts in April of 1997, to 2144 posts in April 2007, to 2995 posts in November 2012.

Number of posts per month between 1997 and 2012
Number of posts per month between 1997 and 2012.

Sentiment analysis is well-known in natural language processing circles as being difficult to do well because of the complexity of language. Things like sarcasm (“Oh, your code is just great! I’m sure you spent hours on it!”), polysemy, and domain- and context-specific vocabularies all make fully automated sentiment analysis a thorny endeavor. However, 300,000+ emails is prohibitive for fully supervised tagging, so, I took a page from Gary King and used the ReadMe package. Fellow Bad Hessian Alex Hanna has also used this package in his own research on social media in Egypt. ReadMe allows users to code a small portion of documents from a larger corpus into a set of mutually exclusive and exhaustive categories. The approach differs somewhat from sentiment analysis procedures developed in computer science and information retrieval which are based on the classification of individual documents. King points out in an accompanying paper [PDF] that social scientists are more often interested in proportions and summary statistics (such as average sentiment) and that optimizing for individual document classification can lead to bias in estimating these statistics. I believe that ReadMe is also the basis for the technology that Crimson Hexagon uses.

Next, my extremely helpful research assistant (and wife) Erin Gengo helped me code a training set of 1000 messages mostly randomly sampled from the corpus of emails. I say “mostly randomly sampled” because I also made sure to include a subset of emails from a notoriously prickly R-help contributer to ensure we had some representative mean emails. We decided to code responses to questions rather than questions themselves as it wasn’t immediately clear what a “mean question” would look like. We each read 500 messages and coded them in the following categories:

  • -2 Negative and unhelpful
  • -1 Negative but helpful
  • 0 No obviously valence or request for additional information
  • 1 Positive or helpful
  • 2 Not a response

An example of a response coded -2 would be responses that do not answer the question, along with simply telling the user to RTFM, that they have violated the posting guidelines, or offer “?lm” as the only text when the question is about lm().

An example of a response coded -1 would be a response along the lines of “I’m not sure why you’d even want to do that, but here’s how you would” or “You’ve violated the posting guidelines, but I’ll help you this once.”

Responses coded 0 have no obvious positive or negative aspects or are follow-ups to questions asking for clarification or additional information. That is, they are not helpful in the sense of answering a question but are also not negative in sentiment.

Responses coded as a 1 are those responses that answer a question positively (or at least non-negatively). You can quibble with the fact that I have two categories for negative responses and only one category for positive responses, but the goal of the analysis was to determine if R-help was getting meaner over time, not to determine the overall levels of different sentiments.

Finally, responses coded as 2 are not responses to questions. This category is needed to make the set mutually exclusive and exhaustive. The vast majority of messages that we coded in this category were questions, although there were a few spam emails as well as job announcements.

Results

Proportions of emails in each category in the test set were estimated on a monthly basis. Much to my surprise, R-help appears to be getting less mean over time! The proportion of “negative and unhelpful” messages has fallen steadily over time, from a high of 0.20 in October of 1997 to a low of 0.015 in January of 2011. The sharpest decline in meanness seems to have occurred between roughly 2003 and 2008.

Proportion of R-help messages estimated as "negative and unhelpful."
Proportion of R-help messages estimated as “negative and unhelpful.”

Even though R-help is getting less mean over time, that doesn’t mean it’s getting nicer or more helpful. Let’s take a look at the proportions for each category of message.

Overall estimated proportion of each coding category
Overall estimated proportion of each coding category

These are interesting findings indeed. While the proportion of mean responses, both helpful and not, has fallen over time, so has the proportion of positive and helpful responses. At the same time, the proportion of non-responses and questions have really exploded, comprising more than half of the posts to the list. Simple logic reveals that if more than half of all posts are questions, many questions are going unanswered. I’ll have more thoughts on this in a minute.

R-help decreasing in meanness over time is surprising. In a discussion on Twitter, Drew Conway and I had the same thought — what if the composition of the user base is driving this change; i.e., what if a cadre of prickly users driving the meanness begin to be outnumbered by kindler, gentler users? I plotted the “negative and unhelpful” proportions against the Shannon entropy over time. As the R-help (contributing) user base becomes more diverse, the proportion of negative and unhelpful responses decreases. Note that this is a measure of diversity, not of overall size of user base (though that is a consideration). The diversity is estimated on a monthly basis using the number of unique posters that month and the number of emails contributed by each. It certainly makes sense that as the R-help user base expanded into a a more diverse population of generalists and newbies, the proportion of negative and unhelpful posts is bound to fall.

Proportion of negative posts vs. Shannon entropy of user base
Proportion of negative posts vs. Shannon entropy of user base

Let’s return to the puzzle of falling meanness, somewhat stable helpfulness, and growing numbers of unanswered questions. I think Mancur Olson has something to offer here. R-help is essentially a public good. Anyone can free ride by reading the archives or asking a question with minimal effort. It is up to the users to overcome the collective action problem and contribute to this public good. As the size of the group grows, it becomes harder and harder to overcome the collective action problem and continue to produce the public good. Thankfully, as we’ve seen in a number of free and open source software communities, there is usually a core of interested, alert, and resourceful individuals willing to take on these costs and produce. Maintaining the quality of the public good requires individuals willing to sanction rule-breakers. This was accomplished in early days by chiding people about the content of their posts or generally being unpleasant enough to keep free-riders away. As the R user base grew, however, it became more and more costly to sanction those diluting the quality of the public good and marginal returns to sanctioning decreased. Thus, we get free-riders (question askers) leading to unanswered question and a decline in sanctioning without a concomitant increase in quality of the good.

Note: All of the code that I used for this project is available on GitHub. I did not make the messages publicly available because they are all available in the R-help archives, but I did post all of my scraping code. I also did not post the hand-coding of the training set because they are not really that useful without the test set and because the coding contains some non-random sampling of “mean” users.

  • http://twitter.com/emorisse Erich Morisse

    First time I’ve come across the use of Shannon entropy as a comparison, very nice. BTW – I found using conditional entropy more telling than entropy or relative entropy: http://www.howweknowus.com/2013/03/27/communication-method-scale-and-entropy/

  • http://twitter.com/mcdickenson Matt Dickenson

    This is cool. The only thing I don’t understand is why there is an uncertainty band around “number of posts per month” in the first figure.

    • http://twitter.com/treycausey Trey Causey

      Thanks Matt! Two reasons: It’s a loess rather than the raw data, and because I was lazy and used default settings.

  • http://twitter.com/tslumley Thomas Lumley

    Nice. It’s a pity this sort of analysis can’t distinguish posts that are actually helpful (eg, accurate) from those that just look helpful but are wrong or misleading.

    • http://twitter.com/treycausey Trey Causey

      Thanks! Yes, a disadvantage of the method is that it does not produce individual classifications, although building a message-based classifier might be in my near future.

  • EN

    Very neat stuff! Its also interesting that the average monthly posts is decreasing towards the end of the time sample. Maybe the collective action problems are overwhelming the R-help forum and people have begun to migrate to other help fora like Stackoverflow and CrossValidated? Perhaps these other fora, with their more sophisticated tools to rate contributors, answers, confirm results, etc. are better at overcoming the collective action problems associated with the huge growth a community like R users has experienced in the past several years? It would be cool if it was possible to compare “meanness” in these other places too…

  • http://www.andrewheiss.com/ Andrew

    Ooh. Awesome stuff.

    If I knew the SO schema better, I’d totally dump all the R answers and comments from http://data.stackexchange.com/ and run this same sentiment analysis with that data. Maybe there’s a connected trend in meanness with users migrating from R-help, or maybe, like you said, they’re nicer because of Olsonian dynamics.

  • http://www.facebook.com/solomon.messing Solomon Messing

    Hey Trey, good stuff. Why not remove the “Not a response” category from the analysis?

    • http://twitter.com/treycausey Trey Causey

      This has been suggested a few times on Twitter. I’m thinking about how to do it. The not a response category needs to be there to make the categories exhaustive and mutually exclusive but I could recalculate proportions excluding that category. Not sure how much that would bias my results.

  • Laura

    maybe your results are due to the fact that the R-newbies just go somewhere else to ask their questions, where people are actually both nice and helpful…

    • Laura

      that been said, your analysis is really cool!

      • http://twitter.com/treycausey Trey Causey

        Thanks! A lot of people have suggested that Stack Overflow is nicer (which aligns with my own experiences) but that still doesn’t explain the growing proportion of non-responses.

  • JD Long

    Really interesting analysis. The level of pissyness on the R mailing list was a huge reason why I instigated so heavily to bootstrap the R tag on StackOverflow. I managed to do that and also troll Drew about the same time. That may be my lifetime contribution to open source software :)

    • http://twitter.com/treycausey Trey Causey

      Truly an accomplishment! :)

    • welshrarebit

      The pissyness on the R mailing list was the reason I didn’t use R. Now that StackOverflow has an R tag I use R for everything (having abandoned matlab) and ignore the mailing list completely.

  • http://twitter.com/HaphazardSoc Neal Caren

    This is cool. Very well done. One additional way to slice the time variable would be to look at cohort effects by making the X axis “Date poster first observed posting” Or just split folks by whether or not they were around before 2005 and then compute the proportion of comments over the last five years that were unhelpful for each group.

  • Ben Bolker

    It would be interesting to do a multi-level analysis as well (i.e. try to get average trends within and between respondents). I’m not sure I agree with it, but there is a point seriously raised by “negative unhelpful” respondents that what they are doing is sanctioning bad question-askers (unlike on StackOverflow, there is no more neutral way to ‘downvote’ bad questions …)

  • http://yihui.name/ Yihui Xie

    When I grade the homework as TA, I deduct 0.5pt for each plot with so tiny labels that I cannot read. You lose 2 points for this post. I do not know how you managed to make all the axis labels consistently small. They do not seem to be from png().

    But the analysis is brilliant anyway! :)

    • http://treycausey.com Trey Causey

      You can click on the images for larger versions of the plots. I am debating between coding this reply as a -2 or a -1. :)

  • Pingback: Emacs IPython Notebook and “ESS in the Cloud” | Measure of Justice