This is a guest post by Laura K. Nelson. She is a doctoral candidate in sociology at the University of California, Berkeley. She is interested in applying automated text analysis techniques to understand how cultures and logics unify political and social movements. Her current research, funded in part by the NSF, examines these cultures and logics via the long-term development of women’s movements in the United States. She can be reached at firstname.lastname@example.org.
Computer-assisted, or automated, text analysis is finally making its way into sociology, as evidenced by the new issue of Poetics devoted to one technique, topic modeling (Poetics 41, 2013). While these methods have been widely used and explored in disciplines like computational linguistics, digital humanities, and, importantly, political science, only recently have sociologists paid attention to them. In my short time using automated text analysis methods I have noticed two recurring issues, both which I will address in this post. First, when I’ve presented these methods at conferences, and when I’ve seen others present these methods, the same two questions are inevitably asked and they have indeed come up again in response to this issue (more on this below). If you use these methods, you should have a response. Second, those who are attempting to use these methods often are not aware of the full range of techniques within the automated text analysis umbrella and choose a method based on convenience, not knowledge.
Michael Corey, a former UChicago PhD soc student (and recent guest poster at OrgTheory), asked me to forward this job posting at Facebook.
Quantitative UX Researcher
Menlo Park, CA
Facebook is working to connect the world in a big way. To succeed we need to understand the unique character of each of the world’s communities, what Facebook means or could mean to them, and how best to make our technology work for them. We’re looking for people with strong quantitative research skills to help in this effort. The ideal candidate will be a social scientist with expertise in quantitative research methodologies OR a quantitative specialist with experience solving social problems. They’ll be comfortable improvising and have the ability to work cross-functionally and thrive in a fast-paced organization.
Help shape the research agenda and drive research projects from end-to-end
Collaborate with product teams to define relevant questions about user growth and engagement
Deploy appropriate quantitative methodologies to answer those questions
Develop novel approaches where traditional methods won’t do
Collaborate with qualitative researchers as needed and iterate quickly to generate usable insights for product and business decisions
Deliver insights and recommendations clearly to relevant audiences
Ability to ask, as well as answer, meaningful and impactful questions
Ability to communicate complex analyses and results to any audience
Experience with Unix, Python, and large datasets (> 1TB) a plus
Master’s or Ph.D. in the social sciences (e.g., Psychology, Communication, Sociology, Political Science, Economics), OR in a quantitative field (e.g., Statistics, Informatics, Econometrics) with experience answering social questions
Fluency in data manipulation and analysis (R/SAS/Stata, SQL/Hive)
Expertise in quantitative research methodologies (e.g., survey sampling and design, significance testing, regression modeling, experimental design, behavioral data analysis)
I’m really excited to officially announce the first annual pre-ASA datathon, taking place at Berkeley’s D-Lab on August 15-16, 2014.
The theme is “big cities, big data: big opportunity for computational social science,” the idea being looking at contemporary urban issues — especially housing challenges — using data gathered and made publicly available by cities including San Francisco, New York, Chicago, Austin, Boston, Somerville, Seattle, etc.
The hacking will start at noon on August 15 and go until the next day. Sleeping is optional. We’ll have a presentation and judging session in the evening of August 16 in San Francisco, exact location TBD.
We’re working with several academic and industry partners to bring together tools and datasets which social scientists can use at the event. So stay tuned as that develops.
You can apply here and see the full call [PDF].
ALSO — Check out the CITASA Symposium the morning of the 15th (citasasymposium.info) before joining us at noon for the Datathon! There’ll be a number of great talks which will complement the hacking over at the D-Lab.
I was pleased to see Fabio Rojas make an open invitation for more female scholars on OrgTheory. Writing for a technically-oriented blog, I’ve been painfully aware of the dearth of female voices expressed here. And as computational social scientists, we should be incredibly wary of the possibility of reproducing many of the same kinds of inequalities that have plagued computer science and tech at-large. We see this when “big data isn’t big enough“, as Jen Schradie has put it, when non-dominant voices are shushed in myriad different ways online, and I fear it when all our current contributors are men. Sociology has gone a long way to open up space for more “scholars at the margins” (a term I’m taking from Eric Grollman and his blog Conditionally Accepted), but there’s still a long way to go.
This is, then, an open invitation for anyone to contribute to Bad Hessian, especially women, people of color, queer people, people with disabilities, working-class or poor people, fat people, immigrants, and single parents. Our doors are always open for guest contributors and new regular contributors. Computational social science ought to be as committed as possible to not only bringing computational methods into the social sciences, but making sure that everyone, especially those at the margins, have a place to speak to and engage with those methods.
2013 was the first full year of Bad Hessian’s existence, so we’re taking stock of what we’ve accomplished in the past year.
We’ve had 37 posts written by the regular crew plus 5 great guest authors.
We’ve had 51,520 unique visits, 39,412 unique visitors, and 70,772 pageviews. Most people are coming from search engines and we’re getting most social media traffic through Twitter.
The five most popular posts of 2013 (written in 2013) were:
- Lipsyncing for your life: a survival analysis of RuPaul’s Drag Race by Alex
- A Final Twitter-based Prediction of RuPaul’s Drag Race Season 5 by Alex
- Cluster Computing for $0.27/hr using Amazon EC2 and IPython Notebook by Randy Zwitch
- RuPaul’s Drag Race Season 5 Finale — Predicting America’s Next Drag Superstar from Twitter by Alex
- Has R-help gotten meaner over time? And what does Mancur Olson have to say about it? by Trey
It was a great year for us. What does 2014 bring? I can think of a few things that’ll probably come up.
- More stats pedagogy
- More IPython
- More social science hackathons and data events
- More discussions of protest event data
- More drag queens (duh)
And I hope more content in general! Is there anything you’d like to see here in 2014? Let us know!
Last month, Mobilization published a special issue on new methods in social movements research, edited by Neal Caren. I was one of the contributors to the issue, submitting a piece borne of my master’s work. The piece is on using supervised machine learning of Facebook messages from Egypt’s April 6th Movement in its formative months of 2008, corroborated by interviews with April 6th activists.
With the emergence of the Arab Spring and the Occupy movements, interest in the study of movements that use the Internet and social networking sites has grown exponentially. However, our inability to easily and cheaply analyze the large amount of content these movements produce limits our study of them. This article attempts to address this methodological lacuna by detailing procedures for collecting data from Facebook and presenting a class of computer-aided content analysis methods. I apply one of these methods in the analysis of mobilization patterns of Egypt’s April 6 youth movement. I corroborate the method with in-depth interviews from movement participants. I conclude by discussing the difficulties and pitfalls of using this type of data in content analysis and in using automated methods for coding textual data in multiple languages.
You can find the PDF here.
The issue is full of a lot of other great stuff, including:
Studying Online Activism: The Effects of Sampling Design on Findings, Jennifer Earl
How Repertoires Evolve: The Diffusion of Suicide Protest in the Twentieth Century, Michael Biggs
Contextualizing Consequences: A Sociolegal Approach to Social Movement Consequences in Professional Fields, Elizabeth Chiarello
A Methodology for Frame Dynamics: Analyzing Keying Battles in Palestinian Nationalism, Hank Johnston and Eitan Y. Alimi
The Radicalization of Contention in Northern Ireland, 1968-1972: A Relational Perspective, Gianluca De Fazio
I just wanted to pass along some info on behalf of our pal Craig Tutterow who has been working hard on socilab, a cool new project which magically transforms your LinkedIn data into network-based social science. In addition to being able to analyze their data online, users can also download their data as a .csv file that can then be read in to their favorite network package. According to Craig, future incarnations will also include support for Pajek .net files and unicode names in .csv download. If you want more details, check out the announcement that recently went out to the SOCNET listserv. You can also look below the break for a working example.
This is a guest post by Randy Zwitch (@randyzwitch), a digital analytics and predictive modeling consultant in the Greater Philadelphia area. Randy blogs regularly about Data Science and related technologies at http://randyzwitch.com.
A few months ago I passed the 10-year point in my analytics/predictive modeling career. While ‘Big Data’ and ‘Data Science’ have only become buzzwords in recent years, hitting the limit on computing resources has been something that has plagued me throughout my career. I’ve seen this problem manifest itself in many ways, from having analysts get assigned multiple computers for daily work, to continuously scraping together budget for more processors on a remote SAS server and spending millions on large enterprise databases just to get processing of data below a 24-hour window.
Luckily, advances in open source software & cloud computing have driven down the cost of data processing & analysis immensely. Using IPython Notebook along with Amazon EC2, you can now procure a 32-core, 60GB RAM virtual machine for roughly $0.27/hr (using a spot instance). This tutorial will show you how to setup a cluster instance at Amazon, install Python, setup IPython as a public notebook server and access this remote cluster via your local web browser.
To get started with this tutorial, you need to have an Amazon Web Services account. I also assume that you already have basic experience interacting with computers via the command line and know about IPython. Basically, that you are the average Bad Hessian reader…
This is a guest post by Karissa McKelvey. She has a BA in Computer Science and Political Science from Indiana University. After graduating, she worked as a research assistant at the Center for Complex Networks and Systems Research at Indiana University on an NSF grant to analyze and visualize the relationship between social media expressions and political events. She is an active contributor to open source projects and continues to publish in computer supported cooperative work and computational social science venues. She currently works as a Software Engineer at Continuum Analytics.
Imagine you are a graduate student of some social or behavioral science (not hard, I assume). You want to collect some data: say I’m going to study the fluctuation of value of products over time on Craiglist, or ping the Sunlight Foundation’s open government data, or use the GDELT to study violent political events. There are a variety of tools I may end up using for my workflow:
- Retrieving the data: Python, BeautifulSoup
- Storing the data: CSV, Json, MySQL, MongoDB, bash
- Retrieving this stored data: SQL, Hive, Hadoop, Python, Java
- Manipulating the data: Python, CSV, R
- Running regressions, simulations: R, Python, STATA, Java
- Presenting the data: R, Excel, Powerpoint, Word, LaTeX
My workflow for doing research now requires a variety of tools, some of which I might have never used before. The number of tools I use seems to scale with the amount of work I try to accomplish. When I encounter a problem in my analysis, or can’t reproduce some regression or simulation I ran, what happened? Where did it break?
Should it really be this difficult? Should I really have to learn 10 different tools to do data analysis on large datasets? We can look at the Big Data problem in a similar light as surveys and regression models. The largest and most fundamental part of the equation is just that this stuff is new – high-priority and well thoughout workflows have yet to be fully developed and stablized.
What if I told you that you could do all of this with the fantastically large number of open source packages in Python? In your web browser, on your iPad?
This is a guest post by Jen Schradie. Jen is a doctoral candidate in the Department of Sociology at the University of California-Berkeley and the Berkeley Center for New Media. She has a master’s degree in sociology from UC Berkeley and a MPA from the Harvard Kennedy School. Using both statistical methods and qualitative fieldwork, her research is at the intersection of social media, social movements and social class. Her broad research agenda is to interrogate digital democracy claims in light of societal and structural differences. Before academia, she directed six documentary films on social movements confronting corporate power. You can find her at www.schradie.com or @schradie on Twitter.
Five years ago, Chris Anderson, editor-in-chief of Wired Magazine, wrote a provocative article entitled, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” (2008). He argued that hypothesis testing is no longer necessary with google’s petabytes of data, which provides all of the answers to how society works. Correlation now “supercedes” causation:
This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.
An easy strawman, Anderson’s piece generated a host of articles in academic journals decrying his claim. The overall consensus, to no surprise, was that the scientific method – i.e. hypothesis testing – is far from over. Most argued as Pigliucci (2009:534) articulated,
But, if we stop looking for models and hypotheses, are we still really doing science? Science, unlike advertising, is not about finding patterns—although that is certainly part of the process—it is about finding explanations for those patterns.
Other analysts focused on the debate around “correlation is not causation.” Some critiqued Anderson in that correlation can lead you in the wrong direction with spurious noise. Others implicitly pointed to what Box (1976) articulated so well pre-Big Data – that science is an iterative process in which correlation is useful in that it can trigger research which uses hypothesis testing.