I’m a big fan of Drew Conway‘s Data Science Venn Diagram, in which he outlines the three intersecting spheres of skill that the data scientist needs — hacking skills, math and statistics knowledge, and substantive expertise. I’ve used this idiom in thinking through how to bring more sociologists into using computational methods. This has been a matter of getting them to learn how to hack or see the virtues of hacking even if they don’t have a taste for it themselves.

But what I think the diagram is missing — or it’s at least gets buried underneath the surface — is knowledge of the processes of data production. This is maybe a subtler point which I think gets looped in with “substantive expertise” but I want to draw this line out to be as explicit as possible because I think this is one of data science’s weaker flanks and one of the places where it needs to be strengthened to gain more acceptance within the social sciences.

I began thinking along this front in part because of a discussion that’s been happening over at Language Log on computational linguistics and literary scholarship.

A post by Hannah Alpert-Abrams and Dan Garrette comments on interdisciplinary collaboration and some of the issues that come up in trying to cross those boundaries. It’s a response to some work done by computational linguists (Bamman et al.)  at CMU, “Learning Latent Personas of Film Characters.” In this paper they attempt to use textual analysis methods to classify film characters, gleaned from Wikipedia articles. I haven’t read the paper and have very little knowledge of film theory, but what I found to be the most salient critique was the following:

When we look at Wikipedia entries about film, for example, we would not expect to find universal, latent character personas. This is because Wikipedia entries are not simply transcripts of films: they are written by a community that talks about film in a specific way. The authors are typically male, young, white, and educated; their descriptive language is informed by their cultural context. In fact, to generalize from the language of this community is to make the same mistake as Campbell and Jung by treating the worldview of an empowered elite as representative of the world at large.

To build a model based on Wikipedia entries, then, is to build a model that reveals not how films work, but how a specific subcategory of the population talks about film.

One of the CMU authors, Brendan O’Connor, retorts:

We did not try to make a contribution to contemporary literary theory. Rather, we focus on developing a computational linguistic research method of analyzing characters in stories. We hope there is a place for both the development of new research methods, as well as actual new substantive findings.

Chris at The Lousy Linguist responds, suggesting that this clash may be a product of different publishing cultures in computational linguistics and NLP, on one hand, and in literary studies, on the other. This is a problem I’ve alluded to before elsewhere.

But the main point I want to underline here is less about the challenges to interdisciplinary research (on which I’m sure many, many people have written) or the availability of publications that will accept this kind of work. I want to talk about how I think data scientists need to interrogate where their data come from.

I believe that social scientists tend to think about this implicitly. A lot of work in social sciences has gone into thinking about measurement and study design, about scale building and gathering good samples. And despite no lack of data, this is something data scientists don’t seem to spend enough time on. This is one of the points that Jen Schradie has made by saying that “big data isn’t big enough,” highlighting the digital divide and how capturing data traces from websites and APIs only gives you part of the picture. And I believe that this is one of the big impediments of data science (and by extension, computational social science) in gaining acceptance with social sciences. This may be a theme for Scholarships master’s degree.

Thinking again of the Venn diagram, this may fall under the heading of “substantive expertise” but it’s nebulous because that implies more knowledge of theory, concept, and case knowledge. But this skill fits squarely within methodological concerns. While Drew puts the “danger zone” on the intersection of “hacking” and “substantive expertise”, I think there’s another “danger zone” that forms when we don’t try to understand the processes that produce the data we’re working with.