The graph above recently appeared as part of Scott Walker’s Twitter feed. Presumably, the idea is to suggest that under Walker’s leadership, Wisconsin has done better than the country as a whole when it comes to unemployment, though an alternative version of the ad makes it somewhat more personal, using the same basic figures to suggest that Walker—a Republican presidential candidate—is outperforming sitting Democratic president Barack Obama. In these ads, the Walker campaign repeatedly highlights the fact that the unemployment rate in Wisconsin is lower than the national average. Note, however, that the unemployment rate in Wisconsin was already lower than the national average when Walker took office. In other words, Walker inherited a good labor market. If we want to measure Walker’s effect on the Wisconsin economy, we need to look at changes in the unemployment rate over time.
A few weeks ago I helped organize and instruct a Software Carpentry workshop geared towards social scientists, with the great help from folks at UW-Madison’s Advanced Computing Institute. Aside from tweaking a few examples (e.g. replacing an example using fake cochlear implant data with one of fake survey data), the curriculum was largely the same. The Software Carpentry curriculum is made to help researchers, mostly in STEM fields, to write code for reproducibility and collaboration. There’s instruction in the Unix shell, a scripting language of your choice (we did Python), and collaboration with Git.
We had a good mix of folks at the workshop, many who had some familiarity with coding to those who had zero experience. There were a number of questions at the workshop about how folks could use these tools in their research, a lot of them coming from qualitative researchers.
I was curious about what other ways researchers who use qualitative methods could incorporate programming into their research routine. So I took to Facebook and Twitter.
In network analysis, blockmodels provide a simplified representation of a more complex relational structure. The basic idea is to assign each actor to a position and then depict the relationship between positions. In settings where relational dynamics are sufficiently routinized, the relationship between positions neatly summarizes the relationship between sets of actors. How do we go about assigning actors to positions? Early work on this problem focused in particular on the concept of structural equivalence. Formally speaking, a pair of actors is said to be structurally equivalent if they are tied to the same set of alters. Note that by this definition, a pair of actors can be structurally equivalent without being tied to one another. This idea is central to debates over the role of cohesion versus equivalence.
In practice, actors are almost never exactly structural equivalent to one another. To get around this problem, we first measure the degree of structural equivalence between each pair of actors and then use these measures to look for groups of actors who are roughly comparable to one another. Structural equivalence can be measured in a number of different ways, with correlation and Euclidean distance emerging as popular options. Similarly, there are a number of methods for identifying groups of structurally equivalent actors. The
equiv.clust routine included in the
sna package in R, for example, relies on hierarchical cluster analysis (HCA). While the designation of positions is less cut and dry, one can use multidimensional scaling (MDS) in a similar manner. MDS and HCA can also be used in combination, with the former serving as a form of pre-processing. Either way, once clusters of structurally equivalent actors have been identified, we can construct a reduced graph depicting the relationship between the resulting groups.
Yet the most prominent examples of blockmodeling built not on HCA or MDS, but on an algorithm known as CONCOR. The algorithm takes it name from the simple trick on which it is based, namely the CONvergence of iterated CORrelations. We are all familiar with the idea of using correlation to measure the similarity between columns of a data matrix. As it turns out, you can also use correlation to measure the degree of similarity between the columns of the resulting correlation matrix. In other words, you can use correlation to measure the similarity of similarities. If you repeat this procedure over and over, you eventually end up with a matrix whose entries take on one of two values: 1 or -1. The final matrix can then be permuted to produce blocks of 1s and -1s, with each block representing a group of structurally equivalent actors. Dividing the original data accordingly, each of these groups can be further partitioned to produce a more fine-grained solution.
Insofar as CONCOR uses correlation as a both a measure of structural equivalence as well as a means of identifying groups of structurally equivalent actors, it is easy to forget that blockmodeling with CONCOR entails the same basic steps as blockmodeling with HCA. The logic behind the two procedures is identical. Indeed, Breiger, Boorman, and Arabie (1975) explicitly describe CONCOR as a hierarchical clustering algorithm. Note, however, that when it comes to measuring structural equivalence, CONCOR relies exclusively on the use of correlation, whereas HCA can be made to work with most common measures of (dis)similarity.
Since CONCOR wasn’t available as part of the
igraph libraries, I decided to put together my own CONCOR routine. It could probably still use a little work in terms of things like error checking, but there is enough there to replicate the wiring room example included in the piece by Breiger et al. Check it out! The program and sample data are available on my GitHub page. If you have
devtools installed, you can download everything directly using R. At the moment, the
concor_hca command is only set up to handle one-mode data, though this can be easily fixed. In an earlier version of the code, I included a second function for calculating tie densities, but I think it makes more sense to use
concor_hca to generate a membership vector which can then be passed to the
blockmodel command included as part of the
#REPLICATE BREIGER ET AL. (1975) #INSTALL CONCOR devtools::install_github("aslez/concoR") #LIBRARIES library(concoR) library(sna) #LOAD DATA data(bank_wiring) bank_wiring #CHECK INITIAL CORRELATIONS (TABLE III) m0 <- cor(do.call(rbind, bank_wiring)) round(m0, 2) #IDENTIFY BLOCKS USING A 4-BLOCK MODEL (TABLE IV) blks <- concor_hca(bank_wiring, p = 2) blks #CHECK FIT USING SNA (TABLE V) #code below fails unless glabels are specified blk_mod <- blockmodel(bank_wiring, blks$block, glabels = names(bank_wiring), plabels = rownames(bank_wiring[])) blk_mod plot(blk_mod)
The results are shown below. If you click on the image, you should be able to see all the labels.
[Note: I do realize that this event was nearly two months ago. I have no one to blame but the academic job market.]
On August 15 and 16, we held the first annual ASA Datathon at the D-Lab at Berkeley. Nearly 25 people came from academia, industry, and government participated during the 24-hour hack session. The datathon focused on open city data and methods, and questions surrounded issues such as gentrification, transit, and urban change.
Two of our sponsors kicked off the event by giving some useful presentations on open city data and visualization tools. Mike Rosengarten from OpenGov presented on OpenGov’s incredibly detailed and descriptive tools for exploring municipal revenues and budgets. And Matt Sundquist from plot.ly showed off the platform’s interactive interface which works across multiple programming environments.
Fueled by various elements of caffeine and great food, six teams hacked away through the night and presented their work on the 16th at the Hilton San Francisco. Our excellent panel of judges picked the three top presentations which stood out the most:
Honorable mention: Spurious Correlations
The Spurious Correlations team developed a statistical definition for gentrification and attempted to define which zip codes had been gentrified by their definition. Curious about those doing the gentrifying, they asked if artists acted as “middle gentrifiers.” While this seemed to correlate in Minneapolis, it didn’t hold for San Francisco.
Second place: Team Vélo
Team Vélo, as the name implies, was interested in bike thefts in San Francisco and crime in general. They used SFPD data to rate crime risk in each neighborhood and tried to understand which factors may be influencing crime rates, including racial diversity, income, and self-employment.
First place: Best Buddies Bus Brigade
Lastly, our first place winners asked “Does SF public transportation underserve those in low-income communities or without cars?” Using San Francisco transit data, they developed a visualization tool to investigate bus load and how this changes by location, conditional on things like car ownership.
You can check out all the presentations at the datathon’s GitHub page.
Laura Nelson, Laura Norén, and I want to give a special thanks to our sponsors: OpenGov, UC Berkeley Sociology, UW Madison Sociology, the D-Lab, SurveyGizmo, the Data Science Toolkit, Duke Network Analysis Center, plot.ly, orgtheory, Fabio Rojas, Neal Caren, and Pam Oliver.
This is a guest post by Matt Sundquist. Matt studied philosophy at Harvard and is a Co-founder at Plotly. He previously worked for Facebook’s Privacy Team, has been a Fulbright Scholar in Argentina and a Student Fellow of the Harvard Law School Program on the Legal Profession, and wrote about the Supreme Court for SCOTUSblog.com.
Emailing code, data, graphs, files, and folders around is painful (see below). Discussing all these different objects and translating between languages, versions, and file types makes it worse. We’re working on a project called Plotly aimed at solving this problem. The goal is to be a platform for delightful, web-based, language-agnostic plotting and collaboration. In this post, we’ll show how it works for ggplot2 and R.
A first Plotly ggplot2 plot
Let’s make a plot from the ggplot2 cheatsheet. You can copy and paste this code or sign-up for Plotly and get your own key. It’s free, you own your data, and you control your privacy (the set up is quite like GitHub).
install.packages("devtools") # so we can install from github library("devtools") install_github("ropensci/plotly") # plotly is part of the ropensci project library(plotly) py <- plotly("RgraphingAPI", "ektgzomjbx") # initiate plotly graph object library(ggplot2) library(gridExtra) set.seed(10005) xvar <- c(rnorm(1500, mean = -1), rnorm(1500, mean = 1.5)) yvar <- c(rnorm(1500, mean = 1), rnorm(1500, mean = 1.5)) zvar <- as.factor(c(rep(1, 1500), rep(2, 1500))) xy <- data.frame(xvar, yvar, zvar) plot<-ggplot(xy, aes(xvar)) + geom_histogram() py$ggplotly() # add this to your ggplot2 script to call plotly
Here is how we added a fit and can edit the figure:
Your Rosetta Stone for translating figures
py <- plotly("ggplot2examples", "3gazttckd7") figure <- py$get_figure("MattSundquist", 1339) str(figure)
That routine is possible from other languages and any plots. You can share figures and data between a GUI, Python, R, MATLAB, Julia, Excel, Dropbox, Google Drive, and SAS files.
Three Final thoughts
- Why did we build wrappers? Well, we originally set out to build our own syntax. You can use our syntax, which gives you access to the entirety of Plotly’s graphing library. However, we quickly heard from folks that it would be more convenient to be able to translate their figures to the web from libraries they were already using.
- Thus, Plotly has APIs for R, Julia, Python, MATLAB, and Node.js; supports LaTeX; and has figure converters for sharing plots from ggplot2, matplotlib, and Igor Pro. You can also translate figures from Seaborn, prettyplotlib, and ggplot for Python, as shown in this IPython Notebook. Then if you’d like to you can use our native syntax or the GUI to edit or make 3D graphs and streaming graphs.
- We’ve tried to keep the graphing library flexible. So while Plotly doesn’t natively support network visualizations (see what we support below), you can make them with MATLAB and Julia, as Benjamin Lind recently demonstrated on this blog. The same is true with maps. If you hit a wall, have feedback, or have questions, let us know. We’re at feedback at plot dot ly and @plotlygraphs.
The past two years we’ve had our own Bad Hessian shindig, to much win and excitement. This year we’re going to leech off other events and call them our own.
The first will be the after party to the ASA Datathon. We don’t actually have a place for this yet, but judging will take place on Saturday, August 16, 6:30-8:30 PM in the Hilton Union Square, Fourth Floor, Rooms 3-4. So block out 8:30-onwards for Bad Hessian party times.
The second place you can catch us is with the rest of the sociology blog crowd at Trocadero Club, Sunday, August 17, at 5:30 PM.
If you haven’t had enough, you can probably catch many of us at ASA Karaoke 2014: Computational Karaoke in the Age of Big Data. Bonus points for singing the most “big data” of songs.
This is a guest post by Randy Zwitch (@randyzwitch), a digital analytics and predictive modeling consultant in the Greater Philadelphia area. Randy blogs regularly about Data Science and related technologies at http://randyzwitch.com. He’s blogged at Bad Hessian before here.
For those of you with WordPress blogs and have the Jetpack Stats module installed, you’re intimately familiar with this chart. There’s nothing particularly special about this chart, other than you usually don’t see bar charts with the bars shown superimposed.
I wanted to see what it would take to replicate this chart in R, Python and Julia. Here’s what I found. (download the data).
This is a guest post by Monica Lee and Dan Silver. Monica is a Doctoral Candidate in Sociology and Harper Dissertation Fellow at the University of Chicago. Dan is an Assistant Professor of Sociology at the University of Toronto. He received his PhD from the Committee on Social Thought at the University of Chicago.
For the past few months, we’ve been doing some research on musical genres and musical unconventionality. We’re presenting it at a conference soon and hope to get some initial feedback on the work.
This project is inspired by the Boss, rock legend Bruce Springsteen. During his keynote speech at the 2012 South-by-Southwest Music Festival in Austin, TX, Springsteen reflected on the potentially changing role of genre classifications for musicians. In Springsteen’s youth, “there wasn’t much music to play. When I picked up the guitar, there was only ten years of Rock history to draw on.” Now, “no one really hardly agrees on anything in pop anymore.” That American popular music lacks a center is evident in a massive proliferation in genre classifications:
“There are so many sub–genres and fashions, two–tone, acid rock, alternative dance, alternative metal, alternative rock, art punk, art rock, avant garde metal, black metal, Christian metal, heavy metal, funk metal, bland metal, medieval metal, indie metal, melodic death metal, melodic black metal, metal core…psychedelic rock, punk rock, hip hop, rap rock, rap metal, Nintendo core [he goes on for quite a while]… Just add neo– and post– to everything I said, and mention them all again. Yeah, and rock & roll.”
As ASA gets closer, so does the first ASA Datathon!
We’re on from 1pm August 15 through 1pm the 16th at Berkeley’s D-Lab. Public presentations and judging will take place at one of the ASA conference hotels, the Hilton Union Square, Room 3-4, Fourth Floor from 6:30-8:15 on August 16th.
Signing up will give us a better idea of who will be at the event and how many folks we can expect to feed and caffinate. We’re also going to give teams a week to get to know each other before the event, so signing up will allow us to make sure everyone gets the same amount of time to work.
If you’re interested, you are invited. We don’t discriminate against particular methodologies or backgrounds. We hope to have social scientists, data scientists, computer scientists, municipal staffers, start-up employees, grad students, and data hackers of all stripes – quantitative, qualitative, and the methodologically agnostic.
With Season 6 of RuPaul’s Drag Race in the books and the new queen crowned, it’s time to reflect on how our pre-season forecasts did. In February I posted a wiki survey asking who would win this season before the first episode had aired. I posted this to reddit’s r/rupaulsdragrace, Twitter, and Facebook, and it generated an impressive 15,632 votes for 435 unique user sessions. Which means the average survey taker did a little under 36 pairwise comparisons.
The plot below shows the results. The x-axis is the score assigned by the All Our Ideas statistical model and can be interpreted that, if “idea” 1 (or, in this case, queen 1) is pitted at random against idea 2, this is the chance that idea 1 will win. The color is how close the wiki survey got to the actual rank. The more pale the dot, the closer. Bluer dots mean the wiki survey overestimated the queen, while redder dots mean it underestimated them.
So how did the wiki survey do? Not terrible. Courtney Act was a clear frontrunner and had a lot of star power to carry her to the end. Bianca was a close second in the wiki survey and finally outshone her when it came to the final. These two are relatively close to each other in score. This was actually the first season in which two queens never had to lipsync. Ben DeLaCreme is ranked third in the survey, although she came in fifth. Little surprise she was voted Miss Congeniality.
After that, it gets interesting. Milk was ranked four by the survey, but came in 9th on the show. I’m thinking her quirkiness may have given folks the impression that she could go much further than she actually did. Adore, one of the top three, comes in fifth on the survey, rather close to her friend Laganja.
April Carrion and Kelly Mantle were expected to go far, but got the chop relatively early on. Darienne was a dark horse in this competition, ending up in fourth place when pre-season fans thought she’d be middling.
Lastly, Joslyn and Trinity are the biggest success stories of season 6. They had a surprising amount of staying power when folks thought they wouldn’t make it out of the first month.
So what can we learn from this? Well, for one, for a more or less staged reality show, I’m somewhat impressed by how well these rankings came out. Unlike using wiki surveys for sports forecasting, we have no prior information on contestants from season to season. Prior seasons give us no information about contestants (unless you consider something like “drag lineages”, e.g. Laganja is Alyssa Edwards’s drag daughter). All information comes from the domain expertise of drag aficionados. Courtney and Bianca were already widely regarded drag stars in their own right before the competition. Although this didn’t seem to be the case with other seasons, it seems like there was a strong Matthew effect at work this time. Is this the new normal as more well-known queens start competing?