This is a guest post by Laura K. Nelson. She is a doctoral candidate in sociology at the University of California, Berkeley. She is interested in applying automated text analysis techniques to understand how cultures and logics unify political and social movements. Her current research, funded in part by the NSF, examines these cultures and logics via the long-term development of women’s movements in the United States. She can be reached at lknelson3@berkeley.edu.
Computer-assisted, or automated, text analysis is finally making its way into sociology, as evidenced by the new issue of Poetics devoted to one technique, topic modeling (Poetics 41, 2013). While these methods have been widely used and explored in disciplines like computational linguistics, digital humanities, and, importantly, political science, only recently have sociologists paid attention to them. In my short time using automated text analysis methods I have noticed two recurring issues, both which I will address in this post. First, when I’ve presented these methods at conferences, and when I’ve seen others present these methods, the same two questions are inevitably asked and they have indeed come up again in response to this issue (more on this below). If you use these methods, you should have a response. Second, those who are attempting to use these methods often are not aware of the full range of techniques within the automated text analysis umbrella and choose a method based on convenience, not knowledge.
Measuring Language, Measuring Words
Andrew Perrin reviewed the Poetics issue on the blog ScatterPlot and perfectly articulated the two critiques I consistently hear about automated text analysis methods. First, studies like those in Poetics “lack a well-conceptualized theory of language, which leads to some conceptual slippage”, and second, is in these studies “there is little attention to the conditions of production of text: whose words, and which words, are written down, archived, and digitized.”
First, on the second critique (the conditions of the production of text): Perrin (and anyone else who makes this critique) is basically cautioning researchers to think about their data. This comment could apply to all social science researchers, and indeed, sociologists using any type of data should consider the processes through which their data were created. In fact, each study in the Poetics article carefully articulated why they chose the particular text they did, and how their data was appropriate for their select question. This is the basis of good social science and is not unique to text-based data.
The first critique (the theory of language critique), however, is more specific to text analysis and is thus more interesting. Topic modeling is just one technique of a collection of automated text analysis methods, including machine translation, part-of-speech tagging, spelling and grammar checks, text classification, author detection, and so on, and decades of computational linguistic research has determined which features of the text will most effectively accomplish each task. For example, if you want to automate the translation of English to Spanish, grammar and syntax are extremely important. If you want to know the subject, object, and action of a particular sentence, tagging each word’s part of speech is key. If you want to identify if two texts were written by two different authors all you need to look at are pronouns. And, it turns out, if you want to extract information from text, that is if you want to discover what a text is about, the best approach is the syntax- and grammar-free bag-of-words approach. Every attempt to improve categorizing text by adding information about grammar or syntax (for example using bigrams or trigrams instead of unigrams) have resulted in inconclusive gains at best.
So, when Perrin (or any audience member or reviewer) says (for example):
Effectively [topic models assume] that corpora have structures that are discoverable separately from the linguistic structures that enable them (e.g., grammar, syntax, discourse). This theoretical move is necessary in order to license the bag-of-words approach that topic modeling uses, and it turns out to be a very productive move in terms of discerning patterns of word usage across texts. But it’s also probably wrong in terms of an actual theory of language, not just because of word order (a shortcoming several of the articles acknowledge) but because utterances are constrained and enabled by syntactic, grammatical, and discursive structures. Parole is not Langue…it’s important to recognize that, at best, LDA is a statistical model of speech acts (parole), not of language (langue).
I say, so what if it is a model of speech acts? You tell me why this matters (and provide the research to back it up). This “theoretical” move is not simple convenience but is a conscious and well-researched move that allows researchers to effectively and efficiently discover what a text is about, more effectively than reading the text and categorizing it oneself. Computational linguists are very aware that speech acts are not language, and have reams of research to back up their methodological claims.
So, you can indeed abstract words from grammatical context to better understand text, in fact as of right now it’s the best way to do so. Researchers who use these methods know exactly what they are measuring and what they are not, and can (or should) temper their claims accordingly.
I thus hope sociology can move beyond this “model of language vs. speech acts” critique and move toward utilizing these tools to better understand text, and thus society, which brings me to some automated text analysis 101…
Match Your Methods With Your Question
There are risks with automated content analysis, not because its model of language is technically “wrong”, but because researchers do not fully understand what these methods measure and many believe that topic modeling (or any other method) is a “magic” approach to measuring culture through text.
Topic modeling is in fact only one of many approaches to automated content analysis. Each approach models language differently and is measuring a specific part of language, thus each method/model is appropriate to address different types of questions. Other methods relevant to sociologists include language modeling, word counts, lexical feature selection, supervised machine learning, sentiment analysis, document clustering, and the list goes on.
As with all research, understand your options, understand the assumptions behind each method, and think carefully about which method and model best fits your question and your data; for text analysis, most of the time it will not be topic modeling. Because the Poetics issue was focused on topic modeling I will focus on it.
What topic modeling can do:
- It allows categories to arise inductively from text, removing the need to artificially impose a structure on the text prior to analyzing the text itself.
- Similarly, it can be used to discover patterns across text, or categories within text, that you may not be considering or may not be immediately available to a human reader. Topic modeling should be used to uncover latent categories within text to advance our understanding of what the text means and what it says about the social world that created it. Often surprising but analytically useful categories may be revealed.
- It can deal with large and diverse corpuses that you as a researcher are struggling to make sense of.
- It can help identify key differences between the topics addressed in different texts, or can identify how topical focus changes over time.
What topic modeling can not do:
- Topic modeling does not find the “one”, only, or “best” way to categorize text. There are hundreds of topic modeling algorithms each producing different word clusters, and within each algorithm varying the number of topics produces different word clusters, ultimately allowing close to infinite ways to categorize text. This method works if it reveals new or productive patterns in the text that will help you understand the phenomenon of interest, not by objectively or definitively categorizing text.
- It does not work if you already have categories in mind. To use an empirical example, in his recent article published in Mobilization, Alex Hanna analyzed whether posts on a political Facebook group wall dealt specifically with different types of mobilization efforts (“Computer-Aided Content Analysis of Digitally Enabled Movements” 18(4): 367-388, 2013). Because he had specific categories in mind topic modeling would have been entirely unhelpful. Instead, he correctly chose a supervised machine learning approach where he had control of the categories into which texts were classified, allowing him to directly answer his question.
- It does not determine who does what to whom. If you’re interested in this type of question you’ll need a language model that does incorporate grammar.
- It does not reveal statistical differences between texts or categories. Other methods are needed to do this.
- It definitely will not magically help us understand the black box of culture. It’s science, not magic, and any science takes work.
I, for one, am happy these methods are entering the discourse of sociology in general, and cultural sociology in particular. I think computer-assisted methods will soon eclipse “traditional” (hand-coded) content analysis methods, to the benefit of science. To move forward, the multiple ways of doing automated content analysis should be taught in all methods classes along-side methods like regression analysis, and should certainly be taught in any content analysis class. I believe they should be a regular part of the tool-kit of any researcher, as they can be used on interview data, field notes, and open ended survey questions, as well as other text-based data. If these methods are more widely known and taught, sociology in general will take a huge scientific leap forward.
Until then, those of us using these methods should refresh our memory on the research that demonstrates exactly what each automated text analysis technique is measuring, and choose a technique accordingly.