There have been repeated calls for “space” in many fields of social science (all links are behind paywalls, sorry):
- Demography: (Voss 2007)
- Sociology: (Gieryn 2000)
- Epidemiology: for an early critical review (Jacquez 2000)
- Geography: obviously geographers were into space before it was cool. A couple of pieces I like are Doreen Massey’s book, For Space and O’Sullivan (2006) for a review of GIS.
- Anthropology: the proceedings of a conference including a piece by Clifford Geertz, Senses of Place (1996). Though what I’m writing here has less to do with the space/place debate.
These are nice papers about what the authors think should be new research agendas, bu I think social sciences need to stop calling for space and start “playing” with space. Let me explain…
This idea started when fellow Bad Hessian, Alex Hanna, suggested that I read a paper about spatio-temporal models of crime in Chicago. We are in the same writing group. Alex has suffered through many presentations of a paper I’m writing about crime in Chicago. Chicago? Crime? I mean these have to be related papers, right? So I gave it a quick read:
Seth R. Flaxman, Daniel B. Neill, Alex J. Smola. 2013. Correlates of homicide: New space/time interaction tests for spatiotemporal point processes. Heinz College working paper, available at: http://www.heinz.cmu.edu/faculty-and-research/research/research-details/index.aspx?rid=483
…and it’s a really great paper! Flaxman reviews three standard measures for spatial and temporal independence and then proposes a new measure that can simultaneously test for spatio-temporal dependence. The measures are validated against real crime data from Chicago. On the other hand, it’s also completely useless for my project. I mean, I stuck it in a footnote, but I can’t engage with it in a substantively meaningful way because my paper is about the Modifiable Areal Unit Problem and the good ol’ MAUP is fundamentally about polygons — not points. The MAUP occurs because a given set of points can be aggregated into any number of different polygon units, and the subsequent results of models, bivariate relationships, or even hot spot analysis might change based on the aggregation method.
This means that Flaxman’s approach and my approach are not comparable because they each rest on different assumptions about how to measure distance, social interaction, and spatial dependence. They’re based on different spatial ontologies, if you will. But back to the main argument of this post: could we play around with the models in Flaxman, the models I’m making, plus some other models in order to test some of the implications of our ideas of space? Here are some hypothetical hypotheses….
Isotropy. Isotropy means that effects are the same in every direction. For example, weather models often take into account anisotropy because of prevailing wind direction. As Flaxman mentions at the end of the paper, alternative distance measures like Manahatten distance could be used. I would take it a step further and suggest that distance could be measured across a trend surface which might control for higher crime rates on the south side of Chicago and in the near-west suburbs. Likewise, spatial regression models of polygon data can use polynomial terms to approximate trend surfaces. Do the additional controls for anisotropy improve model fit? Or change parameter estimates?
Spatial discontinuities. A neighborhood model posits — albeit implicitly and sort of wishy-washy — that there could be two locations that are very close as the crow flies, but are subject to dramatically different forces because they are in different polygons. These sharp breaks might really exist, e.g. “the bad side of the tracks”, red-lining, TIFF funding, empowerment zones, rivers, gated suburbs. Or they might not. Point process models usually assume that space is continuous, i.e. that there are no discontinuities. Playing around with alternative models might give us evidence one way or another.
Effect decay. In spatial regression models like I’m using, it’s pretty normal to operationalize spatial effects for contiguous polygons and then set the effect to zero for all higher order neighbors. As in the Flaxman paper, most point models use some sort of kernal function to create effect estimates between points within a given bandwidth. These are both pretty arbitrary choices that make spatial effects too “circular”. For exmple, think of the economic geographies of interstate exchanges in middle America. You’ll see fast food, big box retail, gas stations, car dealerships, hotesls, etc. at alomst every interchange. Certainly there is a spatial pattern here but it’s not circular and it’s not (exponentially, geometrically, or linearly) decaying across distance. Comparisons between our standard models — where decay is constrained to follow parametric forms — and semi-parametric “hot spot” analyses might tell us if our models of spatial effects are too far away from reality.
Ok. Those sound like valid research questions, so why not just do that research and publish some results? As I see it, spatial work in social sciences usually boils down to two main types of writing. First, there are the papers that aren’t terribly interested in the substantive research areas, and are more about developing statistical models or testing a bunch of different models with the same data. Here are some examples of that type:
- (Dormann et al 2007) undertake a herculean task by explicating and producing R code for no less than 13 different spatial models.
- (Hubbard et al 2010) compare GEE to mixed models of neighborhood health outcomes.
- (Tita and Greenbaum 2009) compare a spatial versus a spatio-social network as weighting matrices in spatial regression.
The problem with this approach is that the data is often old, simplified data from well-known example datasets. Or worst yet, it is simulated data with none of the usual problems of missing data, measurement error, and outliers. At best, these papers use over simplified models. For example, there aren’t any control variables for crime even when there is a giant body of literature about the socio-cultural correlates of spatial crime patterns (Flaxman and I are both guilty of this).
The second type of research would be just the opposite: interested in the substantive conclusions and disinterested in the vagaries of spatial models. They might compare hierchical or logistic regressions to the spatial regressions, but very rarely go in depth about all the possible ways of operationalizing the spatial processes they’re studying. And when you think about it, you can’t blame them because journal editors like to see logical arguments for the model assumptions used in a paper – not an admission that we don’t know anything about the process under study and a bunch of different models all with slightly different operationalizations of the spatial process. But here’s the thing: we don’t actually know very much about the spatial processes at work! And we have absolutely no evidence that the spatial processes for, say, crime are also useful in other domains like educational outcomes, voting behavior, factory siting, human pathogens, or communication networks.
Thus, we don’t need more social science papers that do spatial models. We need (many) more social science papers that do multiple, incongruent spatial models on the same substantively rich datasets. I think it’s only by modeling, for example, crime as an isotropic point process, a social network with spatial distance between nodes, and a series of discrete neighborhood polygons can we start to grasp if one set of assumptions about space is more/less accurate and more/less useful. In case you couldn’t tell, I’m a big fan of George Box’s famous quote. This is the slightly longer version:
“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.” (Box & Draper 1987, 74)
Good luck, and go play!
[Update, 2013-07-22: I changed the citation to the Flaxman paper, as it is now a working paper in his department at Carneige Mellon University.]
This is cross-posted from OrgTheory.
Fabio’s earlier post on the academic brain drain prompted some good discussion in the comments about students who have computational skills who leave academia for positions in Silicon Valley. Some of the tension in the discussion surrounded whether those students would be better suited for those jobs and how we need those people within the social sciences to handle all the new “big data” that’s coming our way. As someone who’s worked in industry a few times, I don’t exactly think it’s my bag. I’m fairly confident that I’d like to stay within academia. To that end, I want to use this post to think through a few institutional ways that sociology could be changed to be made more amenable to computational social science. By “amenable” I mean trying to incorporate the types of methods and data into the mainstream of sociology research. The exact goals may be a little murky, but a few examples could suffice: publishing big data articles in ASR/AJS or having tenure-track job searches for these types of scholars that are initiated within sociology (and not as a cluster hire or as a search initiated in computer science). I encourage you to add your own below; I’m sure institutional scholars have many, many ideas about this. And I’m sure there’s a lot of fiscal realities that makes all of this sound slightly utopian or maybe even Polyannish. But, taking a cue from Erik Olin Wright, real utopias and so on.
This is also presuming that there’s a critical mass of sociologists that actually want to see the incorporation of computational methods. I know Fabio and Christopher Bail have voiced their support, and there’s that Lazer et al. piece in Science that’s been cited a few hundred times (it’s pretty telling that it was published in a journal like Science), but I don’t know how to gauge this kind of thing outside of my computationally homophilic networks.
I’ve jumped in on the development of the rewrite of TABARI, the automated coding system used to generate GDELT, and the Levant and KEDS projects before it. The new project, PETRARCH, is being spearheaded by the project leader Phil Schrodt and the development led by Friend of Bad Hessian John Beieler. PETRARCH is, hopefully, going to be more modular, written in Python, and have the ability to work in parallel. Oh, and it’s open-source.
One thing that I’ve been working on is the ability to extract features from newswire text that is not related to coding for event type. Right now, I’m working on numerical detection — extracting relevant numbers from the text and, hopefully, tagging it with the type of number that it is. For instance:
One Palestinian was killed on Sunday in the latest Israeli military operation in the Hamas-run Gaza Strip, medics said.
or, more relevant to my research and the current question at hand:
Hundreds of Palestinians in the Gaza Strip protested the upcoming visit of US President George W. Bush on Tuesday while demanding international pressure on Israel to end a months-old siege.
The question is, do any guidelines exist for converting words like “hundreds” (or “dozens”, “scores”, “several”) into numerical values? I’m not sure how similar coding projects in social movements have handled this. John has suggested the ranges used in the Atrocities Event Data (e.g. “several” = 5-24, “tens” = 50-99). What other strategies are there?
Prompted by a tweet yesterday from Ella Wind, an editor at the great Arab commentary site Jadaliyya, I undertook the task of writing a very quick and dirty converter that takes Arabic or Persian text and converts it to the International Journal of Middle East Studies (IJMES) transliteration system (details here [PDF]). I’ve posted the actual converter here. It’s in very initial stages and I will discuss some of the difficulties of making it more robust below.
It’s nice that the IJMES has an agreed upon transliteration system; it makes academic work much more legible and minimizes quarrels about translation (hypothetically). For example, حسني مبارك (Hosni Mubarak) is transliterated as ḥusnī mubārak.
Transliterating, however, is a big pain. The transliterated characters are not in the ASCII character set [A-Za-z0-9] that is mostly used by English and other Western languages, and many of its characters are largely drawn from Unicode (e.g. ḥ). That means a lot of copy-pasta of individual Unicode characters from the character viewers in your OS or some text file that stores them.
When Ella posted the tweet, I thought that programming this would be a piece of cake. How hard would it be to write a character mapping and throw up a PHP interface? Well, it’s not that simple. There are a few problems with this.
1. Most Arabic writing does not include short vowels.
Arabic is a very precise language (I focus the rest of this article on Arabic because I don’t know much about Persian). There are no silent letters and vowels denote verb form and casing. But in most modern Arabic writing, short vowels are not written in because readers are expected to know them. For example, compare the opening of al-Faatiha in the Qu’ran with vowels:
بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ
to without them:
بسم الله الرحمن الرحيم
In the Qu’ran, all vowels are usually written. But this doesn’t occur in most modern books, signs, and especially newspaper and social media text.
So what does this mean for transliteration? Well, it means that you can’t transliterate words precisely unless the machine knows which word you’re going for. The average Arabic reader will know that بسم should be “bismi” and not “bsm.”
I can suggest two solutions to this problem: either use a robust dictionary that can map words without vowels to their voweled equivalent, or have some kind of rule set that determines which vowels must be inserted into the word. The former seems eminently more plausible than the latter, but even so, given the rules of Arabic grammar, it would be necessary to do some kind of part-of-speech tagging to determine the case endings of words (if you really want to know more about this twisted system, other people have explained this much better than I can). Luckily, most of the time we don’t really care about case endings.
In any case, short vowels are probably the biggest impediment to a fully automated system. The good news is that short vowels are ASCII characters (a, i, u) and can be inserted by the reader.
2. It is not simple to determine whether certain letters (و and ي) should be long vowels or consonants.
The letters و (wāw) and ي (yā’) play double duty in Arabic. Sometimes they are long vowels and sometimes they are consonants. For instance, in حسني (ḥusnī), yā’ is a long vowel. But in سوريا (Syria, or Sūriyā), it is a consonant. There is probably some logic behind when one of these letters is a long vowel and when it is a consonant. But the point is that the logic isn’t immediately obvious.
3. Handling dipthongs, doubled letters, and reoccurring constructions.
Here, I am thinking of the definite article ال (al-), dipthongs like وَ (au), and the shaddah ّ which doubles letters. This means there probably has to be a look-ahead function to make sure that these are accounted for. Not the hardest thing to code in, but something to look out for nonetheless.
Those are the only things I can think of right now, although I imagine there are more lurking in the shadows that may jump out once one starts working on this. I may continue development on this, at least in an attempt to solve issues 2 and 3. Solving issue 1 is a task that will probably take some more thoughtful consideration.
It’s pretty apparent that race is a contentious topic in the sports media. I decided to explore popular perceptions of differential treatment of white and non-white quarterbacks in the NFL and algorithmically analyzed more than 36,000 articles from ESPN.com published over the past 17 months.
Tonight is the airing of the final episode of RuPaul’s Drag Race, Season 5. They are doing what they did last season, which is to delay the final crowning until the reunion show. Apparently I wasn’t wrong last week when I said last time that they tape three different endings to the show, mostly to ward off Twitter leaks by fans in the audience. And apparently the queens themselves don’t know who won until everyone else does, according to Jinkx.
As noted by Ru on the final three episode and the “RuCap,” they encouraged fans to vote by tweeting and by reposting on Facebook. Although I can’t get Facebook data directly, I’m going to look at the Twitter data that I’ve collected. Before delving into the final predictions, I remembered that I have Twitter data from last year’s airing from the Twitter gardenhose. From that, we should be able to get a sense of who had the sway of public opinion on Twitter.
The graph below plots the last week of season 4, between the announcement that the queen would be crowned at the reunion, and the final reunion show. I chose to focus only on mentions of a queen’s Twitter handle, instead of using #TeamWhatever, because there weren’t many counts of those in the gardenhose. The first peak is the final contest show, and the second is the actual crowning.
The case here is rather clear cut — Sharon Needles leads everyone for nearly the whole time period. The raw counts of mentions show no contest there. I’m actually rather surprised that Phi Phi led Chad. Maybe there was another way they showed support for her?
Keyword Count sharon_needles 2538 phiphiohara 877 chadmichaels1 497
R users know it can be finicky in its requirements and opaque in its error messages. The beginning R user often then happily discovers that a mailing list for dealing with R problems with a large and active user base, R-help, has existed since 1997. Then, the beginning R user wades into the waters, asks a question, and is promptly torn to shreds for inadequate knowledge of statistics and/or the software, for wanting to do something silly, or for the gravest sin of violating the posting guidelines. The R user slinks away, tail between legs, and attempts to find another source of help. Or so the conventional wisdom goes. Late last year, someone on Twitter (I don’t remember who, let me know if it was you) asked if R-help was getting meaner. I decided to collect some evidence and find out.
Our findings are surprising, but I think I have some simple sociological explanations.
We’re down to the final episode. This one is for all the marbles. Wait, that’s not the best saying in this context. In any case, moving right along. In the top four episode, Detox was eliminated, but not after Roxxxy threw maybe ALL of the shade towards Jinkx (although, to Roxxxy’s credit, she says a lot of this was due to editing).
Jinkx, however, defended herself well by absolutely killing the lipsync. Probably one of the top three of the season, easy.
Getting down to the wire, it’s looking incredibly close. As it is, the model has ceased to tell us anything of value. Here are the rankings:
1 Alaska 0.6050052 1.6752789 2 Roxxxy Andrews 2.5749070 3.6076899 3 Jinkx Monsoon 3.4666713 3.2207345
But looking at the confidence intervals, all three estimates are statistically indistinguishable from zero. The remaining girls don’t have sufficient variation on the variables of interest to differentiate them from each other in terms of winning this thing.
So what’s drag race forecaster to do? Well, the first thought that came to my mind was — MOAR DATA. And hunty, there’s one place where I’ve got data by the troves — Twitter.