Last week’s post on the metal collaboration network brought attention largely to the “giant component”–the largest subgraph in a network where all actors have at least one path to all other actors. In large networks, even sparse ones, giant components typically emerge and include the majority of actors in the network. While focusing on the giant component follows conventional practice while analyzing small world networks, perhaps worthwhile information can be inferred from actors outside of the giant component.
A few months ago I started listening to Tomahawk, a band described on Wikipedia as “an experimental alternative metal/alternative rock supergroup.” Beyond the quality of their music, I found myself intrigued by the musical background of their members. In addition to Tomahawk, their other bands include acclaimed groups such as Faith No More, Helmet, the Melvins, Fantômas, and the Jesus Lizard. Mike Patton alone has been affiliated with at least fifteen bands. Continue reading
As mentioned in a previous post, Alex Hanna and I had the opportunity to teach last week at the Higher School of Economic’s International Social Network Analysis Summer School in St. Petersburg. While last year’s workshop emphasized smaller social networks, this year’s workshop focused on online networks. For my part, I provided an introductory lecture to social network analysis along with four labs on the subject of R and social network analysis.
The introduction to social network analysis began with an historical overview, followed by outlining which concepts constitute a social network. The remaining portions review subjects relating to subgraphs, walks, centrality, cohesive subgroups, along with major research subjects in the field. Setting aside the substantive interest in networks, the first lab covered basic R usage, objects, and syntax. Admittedly, this material was relatively dryer, though necessary to make the most of the network analysis software in R. We followed this introduction to R with an introduction to R’s social network analysis software. This second lab introduces the class to the different network packages within R, reading data, basic measurements brought up in the introductory lecture, and visualization. The third R SNA lab was on the subject of graph-level indices, random graphs, and Conditional Uniform Graph tests. Both the second and third labs were conducted primarily using the igraph package. The fourth and final lab of the course was on the subject of exponential random graph modeling. For this lab, we walked through tests for homophily and edgewise-shared partner effects using data on both our Twitter hashtag (#SNASPb2013) as well as US political blogs.
The slides include scripts that download and read the data used within all lab examples.
I’ve hosted PDFs of all the slides on Google Drive.
There have been repeated calls for “space” in many fields of social science (all links are behind paywalls, sorry):
- Demography: (Voss 2007)
- Sociology: (Gieryn 2000)
- Epidemiology: for an early critical review (Jacquez 2000)
- Geography: obviously geographers were into space before it was cool. A couple of pieces I like are Doreen Massey’s book, For Space and O’Sullivan (2006) for a review of GIS.
- Anthropology: the proceedings of a conference including a piece by Clifford Geertz, Senses of Place (1996). Though what I’m writing here has less to do with the space/place debate.
These are nice papers about what the authors think should be new research agendas, bu I think social sciences need to stop calling for space and start “playing” with space. Let me explain…
This idea started when fellow Bad Hessian, Alex Hanna, suggested that I read a paper about spatio-temporal models of crime in Chicago. We are in the same writing group. Alex has suffered through many presentations of a paper I’m writing about crime in Chicago. Chicago? Crime? I mean these have to be related papers, right? So I gave it a quick read:
Seth R. Flaxman, Daniel B. Neill, Alex J. Smola. 2013. Correlates of homicide: New space/time interaction tests for spatiotemporal point processes. Heinz College working paper, available at: http://www.heinz.cmu.edu/faculty-and-research/research/research-details/index.aspx?rid=483
…and it’s a really great paper! Flaxman reviews three standard measures for spatial and temporal independence and then proposes a new measure that can simultaneously test for spatio-temporal dependence. The measures are validated against real crime data from Chicago. On the other hand, it’s also completely useless for my project. I mean, I stuck it in a footnote, but I can’t engage with it in a substantively meaningful way because my paper is about the Modifiable Areal Unit Problem and the good ol’ MAUP is fundamentally about polygons — not points. The MAUP occurs because a given set of points can be aggregated into any number of different polygon units, and the subsequent results of models, bivariate relationships, or even hot spot analysis might change based on the aggregation method.
This means that Flaxman’s approach and my approach are not comparable because they each rest on different assumptions about how to measure distance, social interaction, and spatial dependence. They’re based on different spatial ontologies, if you will. But back to the main argument of this post: could we play around with the models in Flaxman, the models I’m making, plus some other models in order to test some of the implications of our ideas of space? Here are some hypothetical hypotheses….
Isotropy. Isotropy means that effects are the same in every direction. For example, weather models often take into account anisotropy because of prevailing wind direction. As Flaxman mentions at the end of the paper, alternative distance measures like Manahatten distance could be used. I would take it a step further and suggest that distance could be measured across a trend surface which might control for higher crime rates on the south side of Chicago and in the near-west suburbs. Likewise, spatial regression models of polygon data can use polynomial terms to approximate trend surfaces. Do the additional controls for anisotropy improve model fit? Or change parameter estimates?
Spatial discontinuities. A neighborhood model posits — albeit implicitly and sort of wishy-washy — that there could be two locations that are very close as the crow flies, but are subject to dramatically different forces because they are in different polygons. These sharp breaks might really exist, e.g. “the bad side of the tracks”, red-lining, TIFF funding, empowerment zones, rivers, gated suburbs. Or they might not. Point process models usually assume that space is continuous, i.e. that there are no discontinuities. Playing around with alternative models might give us evidence one way or another.
Effect decay. In spatial regression models like I’m using, it’s pretty normal to operationalize spatial effects for contiguous polygons and then set the effect to zero for all higher order neighbors. As in the Flaxman paper, most point models use some sort of kernal function to create effect estimates between points within a given bandwidth. These are both pretty arbitrary choices that make spatial effects too “circular”. For exmple, think of the economic geographies of interstate exchanges in middle America. You’ll see fast food, big box retail, gas stations, car dealerships, hotesls, etc. at alomst every interchange. Certainly there is a spatial pattern here but it’s not circular and it’s not (exponentially, geometrically, or linearly) decaying across distance. Comparisons between our standard models — where decay is constrained to follow parametric forms — and semi-parametric “hot spot” analyses might tell us if our models of spatial effects are too far away from reality.
Ok. Those sound like valid research questions, so why not just do that research and publish some results? As I see it, spatial work in social sciences usually boils down to two main types of writing. First, there are the papers that aren’t terribly interested in the substantive research areas, and are more about developing statistical models or testing a bunch of different models with the same data. Here are some examples of that type:
- (Dormann et al 2007) undertake a herculean task by explicating and producing R code for no less than 13 different spatial models.
- (Hubbard et al 2010) compare GEE to mixed models of neighborhood health outcomes.
- (Tita and Greenbaum 2009) compare a spatial versus a spatio-social network as weighting matrices in spatial regression.
The problem with this approach is that the data is often old, simplified data from well-known example datasets. Or worst yet, it is simulated data with none of the usual problems of missing data, measurement error, and outliers. At best, these papers use over simplified models. For example, there aren’t any control variables for crime even when there is a giant body of literature about the socio-cultural correlates of spatial crime patterns (Flaxman and I are both guilty of this).
The second type of research would be just the opposite: interested in the substantive conclusions and disinterested in the vagaries of spatial models. They might compare hierchical or logistic regressions to the spatial regressions, but very rarely go in depth about all the possible ways of operationalizing the spatial processes they’re studying. And when you think about it, you can’t blame them because journal editors like to see logical arguments for the model assumptions used in a paper – not an admission that we don’t know anything about the process under study and a bunch of different models all with slightly different operationalizations of the spatial process. But here’s the thing: we don’t actually know very much about the spatial processes at work! And we have absolutely no evidence that the spatial processes for, say, crime are also useful in other domains like educational outcomes, voting behavior, factory siting, human pathogens, or communication networks.
Thus, we don’t need more social science papers that do spatial models. We need (many) more social science papers that do multiple, incongruent spatial models on the same substantively rich datasets. I think it’s only by modeling, for example, crime as an isotropic point process, a social network with spatial distance between nodes, and a series of discrete neighborhood polygons can we start to grasp if one set of assumptions about space is more/less accurate and more/less useful. In case you couldn’t tell, I’m a big fan of George Box’s famous quote. This is the slightly longer version:
“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.” (Box & Draper 1987, 74)
Good luck, and go play!
[Update, 2013-07-22: I changed the citation to the Flaxman paper, as it is now a working paper in his department at Carneige Mellon University.]
R users know it can be finicky in its requirements and opaque in its error messages. The beginning R user often then happily discovers that a mailing list for dealing with R problems with a large and active user base, R-help, has existed since 1997. Then, the beginning R user wades into the waters, asks a question, and is promptly torn to shreds for inadequate knowledge of statistics and/or the software, for wanting to do something silly, or for the gravest sin of violating the posting guidelines. The R user slinks away, tail between legs, and attempts to find another source of help. Or so the conventional wisdom goes. Late last year, someone on Twitter (I don’t remember who, let me know if it was you) asked if R-help was getting meaner. I decided to collect some evidence and find out.
Our findings are surprising, but I think I have some simple sociological explanations.
We’re in the Final Four now, the actual final four that matters (sorry sports forecasters).
Last week, Coco got the chop, which made sense statistically (she had a huge relative risk AND had been the first queen to have had to lipsync four times) and from a narrative standpoint — Alyssa got eliminated the week before so they didn’t really need to keep Coco around to continue all that drama.
So now the biggest question is who is getting kicked off this week and will leave us with our top three? Before I get to my predictions, I want to point readers to Dilettwat’s analysis, which, while uses no regressions, is still chock full of some interesting statistics about the top four and makes some predictions about who needs to do what in this episode to win.
For my own analysis, it’s looking very close here. Here are the numbers.
1 Jinkx Monsoon 1.2170057 1.5580338 2 Alaska 1.2509423 2.0045457 3 Roxxxy Andrews 3.2466063 3.3926072 4 Detox 5.5899580 1.5527694
Jinkx and Alaska are neck-and-neck in this model, and confidence intervals make pairwise comparisons rather hard to make here. But this order is the same as Homoviper’s Index.
Just to get a sense of how close this is, here is a plot that tracks the relative risks across the last few weeks.
Last week, Coco’s relative risk was incredibly high, the highest it has been. This week, Detox has the only relative risk that is indistinguishable from zero, which makes me think she’s about to go. Which makes sense — she’s the only one in the group who has lipsynced more than once. So all I’m willing to say more confidently is that Detox goes home tonight.
On a totally different note, Jujubee came to Madison on Thursday and I got a chance to tell her about my forecasting efforts when we were taking some pictures…
Then I saw Nate Silver at the Midwest Political Science Association conference in Chicago. Unfortunately, there was not an opportunity for a Kate Silver/Nate Silver photo op. Maybe next time.
I’m scribbling this furiously because I had a busy weekend, but stay tuned for next week’s extra special analysis.
Last week, Alyssa got the boot and Jinkx kept her place. And I totally called it with my first model that accounted for the proportional hazards assumption. I think the model is having a little more success as the season plods on.
Before I get to the predictions for episode 10, there’s two really interesting prospects that may either give this model some more predictive power or become very interesting projects in their own right.
Last week, Alaska took it home with her dangerous performance, while Ivy Winters was sent home after going up against Alyssa Edwards. This is sad on many fronts. First, I love me some Ivy Winters. Second, Jinkx had revealed that she had a crush on Ivy, and the relationship that may have flourished between the two would have been too cute. But lastly, and possibly the worst, both of my models from last week had Ivy on top. Ugh.
What went wrong? Well, this certainly wasn’t Ivy’s challenge. But it’s high time that I started interrogating the models a little further.
Wow, last week’s Drag Race post made the rounds in the stats and Drag Race circles. It was cross-posted to Jezebel and has been getting some pretty high-profile links. A little birdy told me that Ms. Ru herself has read it. I think I can die a happy man knowing that RuPaul has visited Bad Hessian.
Anyhow, last week I tried to count Coco out. I was reading her like the latest AJS. The library is open. But her response to me was simple — girl, please:
(Also this happened. Wig under a wig.)
(both of these gifs courtesy of f%^@yeahdragrace)
Can that win safeguard Coco from getting eliminated? Let’s look at the numbers after the jump.
If you follow me on Twitter, you know that I’m a big fan of RuPaul’s Drag Race. The transformation, the glamour, the sheer eleganza extravanga is something my life needs to interrupt the monotony of grad school. I was able to catch up on nearly four seasons in a little less than a month, and I’ve been watching the current (fifth) season religiously every Monday at Plan B, the gay bar across from my house.
I don’t know if this occurs with other reality shows (this is the first I’ve been taken with), but there is some element of prediction involved in knowing who will come out as the winner. A drag queen we spoke with at Plan B suggested that the length of time each queen appears in the season preview is an indicator, while Homoviper’s “index” is largely based on a more qualitative, hermeneutic analysis. I figured, hey, we could probably build a statistical model to know which factors are the most determinative in winning the competition.