Pablo Barberá, Dan Cervone, and I prepared a short course at New York University on Data Science and Social Science, sponsored by several institutes at NYU. The course was intended as an introduction to R and basic data science tasks, including data visualization, social network analysis, textual analysis, web scraping, and APIs. The workshop is geared towards social scientists with little experience in R, but experience with other statistical packages.

You can download and tinker around with the materials on GitHub.

Posted in R.

If you’re looking for a good outlet for some computationally-oriented social science work, check out the International Conference on Computational Social Science (IC^2S^2). (Disclaimer: I am on the program committee for this conference). Last year, as the Computational Social Science Summit, the conference attracted 200 participants and had a very vibrant set of panels.

Abstracts are due on January 31, 2016. Hoping to see many of you there!


The graph above recently appeared as part of Scott Walker’s Twitter feed. Presumably, the idea is to suggest that under Walker’s leadership, Wisconsin has done better than the country as a whole when it comes to unemployment, though an alternative version of the ad makes it somewhat more personal, using the same basic figures to suggest that Walker—a Republican presidential candidate—is outperforming sitting Democratic president Barack Obama. In these ads, the Walker campaign repeatedly highlights the fact that the unemployment rate in Wisconsin is lower than the national average. Note, however, that the unemployment rate in Wisconsin was already lower than the national average when Walker took office. In other words, Walker inherited a good labor market. If we want to measure Walker’s effect on the Wisconsin economy, we need to look at changes in the unemployment rate over time.

Continue reading

A few weeks ago I helped organize and instruct a Software Carpentry workshop geared towards social scientists, with the great help from folks at UW-Madison’s Advanced Computing Institute. Aside from tweaking a few examples (e.g. replacing an example using fake cochlear implant data with one of fake survey data), the curriculum was largely the same. The Software Carpentry curriculum is made to help researchers, mostly in STEM fields, to write code for reproducibility and collaboration. There’s instruction in the Unix shell, a scripting language of your choice (we did Python), and collaboration with Git.

We had a good mix of folks at the workshop, many who had some familiarity with coding to those who had zero experience. There were a number of questions at the workshop about how folks could use these tools in their research, a lot of them coming from qualitative researchers.

I was curious about what other ways researchers who use qualitative methods could incorporate programming into their research routine. So I took to Facebook and Twitter.

Continue reading

In network analysis, blockmodels provide a simplified representation of a more complex relational structure. The basic idea is to assign each actor to a position and then depict the relationship between positions. In settings where relational dynamics are sufficiently routinized, the relationship between positions neatly summarizes the relationship between sets of actors. How do we go about assigning actors to positions? Early work on this problem focused in particular on the concept of structural equivalence. Formally speaking, a pair of actors is said to be structurally equivalent if they are tied to the same set of alters. Note that by this definition, a pair of actors can be structurally equivalent without being tied to one another. This idea is central to debates over the role of cohesion versus equivalence.

In practice, actors are almost never exactly structural equivalent to one another. To get around this problem, we first measure the degree of structural equivalence between each pair of actors and then use these measures to look for groups of actors who are roughly comparable to one another. Structural equivalence can be measured in a number of different ways, with correlation and Euclidean distance emerging as popular options. Similarly, there are a number of methods for identifying groups of structurally equivalent actors. The equiv.clust routine included in the sna package in R, for example, relies on hierarchical cluster analysis (HCA). While the designation of positions is less cut and dry, one can use multidimensional scaling (MDS) in a similar manner. MDS and HCA can also be used in combination, with the former serving as a form of pre-processing. Either way, once clusters of structurally equivalent actors have been identified, we can construct a reduced graph depicting the relationship between the resulting groups.

Yet the most prominent examples of blockmodeling built not on HCA or MDS, but on an algorithm known as CONCOR. The algorithm takes it name from the simple trick on which it is based, namely the CONvergence of iterated CORrelations. We are all familiar with the idea of using correlation to measure the similarity between columns of a data matrix. As it turns out, you can also use correlation to measure the degree of similarity between the columns of the resulting correlation matrix. In other words, you can use correlation to measure the similarity of similarities. If you repeat this procedure over and over, you eventually end up with a matrix whose entries take on one of two values: 1 or -1. The final matrix can then be permuted to produce blocks of 1s and -1s, with each block representing a group of structurally equivalent actors. Dividing the original data accordingly, each of these groups can be further partitioned to produce a more fine-grained solution.

Insofar as CONCOR uses correlation as a both a measure of structural equivalence as well as a means of identifying groups of structurally equivalent actors, it is easy to forget that blockmodeling with CONCOR entails the same basic steps as blockmodeling with HCA. The logic behind the two procedures is identical. Indeed, Breiger, Boorman, and Arabie (1975) explicitly describe CONCOR as a hierarchical clustering algorithm. Note, however, that when it comes to measuring structural equivalence, CONCOR relies exclusively on the use of correlation, whereas HCA can be made to work with most common measures of (dis)similarity.

Since CONCOR wasn’t available as part of the sna or igraph libraries, I decided to put together my own CONCOR routine. It could probably still use a little work in terms of things like error checking, but there is enough there to replicate the wiring room example included in the piece by Breiger et al. Check it out! The program and sample data are available on my GitHub page. If you have devtools installed, you can download everything directly using R. At the moment, the concor_hca command is only set up to handle one-mode data, though this can be easily fixed. In an earlier version of the code, I included a second function for calculating tie densities, but I think it makes more sense to use concor_hca to generate a membership vector which can then be passed to the blockmodel command included as part of the sna library.




m0 <- cor(, bank_wiring))
round(m0, 2)

blks <- concor_hca(bank_wiring, p = 2)

#code below fails unless glabels are specified
blk_mod <- blockmodel(bank_wiring, blks$block, 
     glabels = names(bank_wiring),
     plabels = rownames(bank_wiring[[1]]))

The results are shown below. If you click on the image, you should be able to see all the labels.


[Note: I do realize that this event was nearly two months ago. I have no one to blame but the academic job market.]

On August 15 and 16, we held the first annual ASA Datathon at the D-Lab at Berkeley. Nearly 25 people came from academia, industry, and government participated during the 24-hour hack session. The datathon focused on open city data and methods, and questions surrounded issues such as gentrification, transit, and urban change.

Two of our sponsors kicked off the event by giving some useful presentations on open city data and visualization tools. Mike Rosengarten from OpenGov presented on OpenGov’s incredibly detailed and descriptive tools for exploring municipal revenues and budgets. And Matt Sundquist from showed off the platform’s interactive interface which works across multiple programming environments.

Fueled by various elements of caffeine and great food, six teams hacked away through the night and presented their work on the 16th at the Hilton San Francisco. Our excellent panel of judges picked the three top presentations which stood out the most:

Honorable mention: Spurious Correlations

The Spurious Correlations team developed a statistical definition for gentrification and attempted to define which zip codes had been gentrified by their definition. Curious about those doing the gentrifying, they asked if artists acted as “middle gentrifiers.” While this seemed to correlate in Minneapolis, it didn’t hold for San Francisco.

Second place: Team Vélo 

Team Vélo, as the name implies, was interested in bike thefts in San Francisco and crime in general. They used SFPD data to rate crime risk in each neighborhood and tried to understand which factors may be influencing crime rates, including racial diversity, income, and self-employment.

First place: Best Buddies Bus Brigade

Lastly, our first place winners asked “Does SF public transportation underserve those in low-income communities or without cars?” Using San Francisco transit data, they developed a visualization tool to investigate bus load and how this changes by location, conditional on things like car ownership.

You can check out all the presentations at the datathon’s GitHub page.

Laura Nelson, Laura Norén, and I want to give a special thanks to our sponsors: OpenGov, UC Berkeley Sociology, UW Madison Sociology, the D-Lab, SurveyGizmo, the Data Science Toolkit, Duke Network Analysis Center,, orgtheory, Fabio Rojas, Neal Caren, and Pam Oliver.

This is a guest post by Matt Sundquist. Matt studied philosophy at Harvard and is a Co-founder at Plotly. He previously worked for Facebook’s Privacy Team, has been a Fulbright Scholar in Argentina and a Student Fellow of the Harvard Law School Program on the Legal Profession, and wrote about the Supreme Court for

Emailing code, data, graphs, files, and folders around is painful (see below). Discussing all these different objects and translating between languages, versions, and file types makes it worse. We’re working on a project called Plotly aimed at solving this problem. The goal is to be a platform for delightful, web-based, language-agnostic plotting and collaboration. In this post, we’ll show how it works for ggplot2 and R.




A first Plotly ggplot2 plot


Let’s make a plot from the ggplot2 cheatsheet. You can copy and paste this code or sign-up for Plotly and get your own key. It’s free, you own your data, and you control your privacy (the set up is quite like GitHub).


install.packages("devtools") # so we can install from github
install_github("ropensci/plotly") # plotly is part of the ropensci project
py <- plotly("RgraphingAPI", "ektgzomjbx")  # initiate plotly graph object

xvar <- c(rnorm(1500, mean = -1), rnorm(1500, mean = 1.5))
yvar <- c(rnorm(1500, mean = 1), rnorm(1500, mean = 1.5))
zvar <- as.factor(c(rep(1, 1500), rep(2, 1500)))
xy <- data.frame(xvar, yvar, zvar)
plot<-ggplot(xy, aes(xvar)) + geom_histogram()
py$ggplotly()  # add this to your ggplot2 script to call plotly


By adding the final line of code, I get the same plot drawn in the browser. It’s here:, and also shown in an iframe below. If you re-make this plot, you’ll see that we’ve styled it in Plotly’s GUI. Beyond editing, sharing, and exporting, we can also add a fit. The plot is interactive and drawn with D3.js, a popular JavaScript visualization library. You can zoom by clicking and dragging, pan, and see text on the hover by mousing over the plot.



Here is how we added a fit and can edit the figure:




Your Rosetta Stone for translating figures

When you share a plot or add collaborators, you’re sharing an object that contains your data, plot, comments, revisions, and the code to re-make the plot from a few languages. The plot is also added to your profile. I like Wired writer Rhett Allain’s profile:
You can export the figure from the GUI, via an API call, or with a URL. You can also access and share the script to make the exact same plot in different languages, and embed the plot in an iframe, Notebook (see this plot in an IPython Notebook), or webpage like we’ve done for the above plot.
To add or edit data in the figure, we can upload or copy and paste data in the GUI, or append data using R.
Or call the figure in R:
py <- plotly("ggplot2examples", "3gazttckd7") 
figure <- py$get_figure("MattSundquist", 1339)
And call the data:

That routine is possible from other languages and any plots. You can share figures and data between a GUI, Python, R, MATLAB, Julia, Excel, Dropbox, Google Drive, and SAS files.

Three Final thoughts

  • Why did we build wrappers? Well, we originally set out to build our own syntax. You can use our syntax, which gives you access to the entirety of Plotly’s graphing library. However, we quickly heard from folks that it would be more convenient to be able to translate their figures to the web from libraries they were already using.
  • Thus, Plotly has APIs for R, Julia, Python, MATLAB, and Node.js; supports LaTeX; and has figure converters for sharing plots from ggplot2, matplotlib, and Igor Pro. You can also translate figures from Seaborn, prettyplotlib, and ggplot for Python, as shown in this IPython Notebook. Then if you’d like to you can use our native syntax or the GUI to edit or make 3D graphs and streaming graphs.
  • We’ve tried to keep the graphing library flexible. So while Plotly doesn’t natively support network visualizations (see what we support below), you can make them with MATLAB and Julia, as Benjamin Lind recently demonstrated on this blog. The same is true with maps. If you hit a wall, have feedback, or have questions, let us know. We’re at feedback at plot dot ly and @plotlygraphs.

The past two years we’ve had our own Bad Hessian shindig, to much win and excitement. This year we’re going to leech off other events and call them our own.

The first will be the after party to the ASA Datathon. We don’t actually have a place for this yet, but judging will take place on Saturday, August 16, 6:30-8:30 PM in the Hilton Union Square, Fourth Floor, Rooms 3-4. So block out 8:30-onwards for Bad Hessian party times.

The second place you can catch us is with the rest of the sociology blog crowd at Trocadero Club, Sunday, August 17, at 5:30 PM.

If you haven’t had enough, you can probably catch many of us at ASA Karaoke 2014: Computational Karaoke in the Age of Big Data. Bonus points for singing the most “big data” of songs.

This is a guest post by Randy Zwitch (@randyzwitch), a digital analytics and predictive modeling consultant in the Greater Philadelphia area. Randy blogs regularly about Data Science and related technologies at He’s blogged at Bad Hessian before here.

WordPress Stats - Visitors vs. Views
WordPress Stats – Visitors vs. Views

For those of you with WordPress blogs and have the Jetpack Stats module installed, you’re intimately familiar with this chart. There’s nothing particularly special about this chart, other than you usually don’t see bar charts with the bars shown superimposed.

I wanted to see what it would take to replicate this chart in R, Python and Julia. Here’s what I found. (download the data).

Continue reading

This is a guest post by Monica Lee and Dan Silver. Monica is a Doctoral Candidate in Sociology and Harper Dissertation Fellow at the University of Chicago. Dan is an Assistant Professor of Sociology at the University of Toronto. He received his PhD from the Committee on Social Thought at the University of Chicago.

For the past few months, we’ve been doing some research on musical genres and musical unconventionality.  We’re presenting it at a conference soon and hope to get some initial feedback on the work.

This project is inspired by the Boss, rock legend Bruce Springsteen.  During his keynote speech at the 2012 South-by-Southwest Music Festival in Austin, TX, Springsteen reflected on the potentially changing role of genre classifications for musicians.  In Springsteen’s youth, “there wasn’t much music to play.  When I picked up the guitar, there was only ten years of Rock history to draw on.”  Now, “no one really hardly agrees on anything in pop anymore.”  That American popular music lacks a center is evident in a massive proliferation in genre classifications:

“There are so many sub–genres and fashions, two–tone, acid rock, alternative dance, alternative metal, alternative rock, art punk, art rock, avant garde metal, black metal, Christian metal, heavy metal, funk metal, bland metal, medieval metal, indie metal, melodic death metal, melodic black metal, metal core…psychedelic rock, punk rock, hip hop, rap rock, rap metal, Nintendo core [he goes on for quite a while]… Just add neo– and post– to everything I said, and mention them all again. Yeah, and rock & roll.”

Continue reading