Inspired by Neal Caren’s excellent series on Big Data collection and analysis with Python, I want to work on a set of tutorials for some basic collection and analysis as well.
I’m drawing on some of my previous “tworkshops” that are meant to bring people from zero knowledge, to knowing how to move around basic analysis of Twitter data with potential for parallel processing in systems like Hadoop MapReduce.
Let’s start with the basics of what the data look like and how to access it.
Following up on the string of posts about software for network analysis, I recently taught a workshop for PhD students in the social sciences here at Stanford on using Python for network analysis. My session was part of a three day series of workshops introducing computational social science to students who are looking to get their feet wet. I’m posting a link (here) to the page on my website where you can download the materials I developed to teach the workshop, including commented scripts, sample datasets, and a few slides.
Some brief impressions: I’ve taught stats/methods for grad students before, but this was a different beast. Computational social science and network analysis are attractive areas for many grad students here, but without a ‘canon’ of some type to fall back on, it’s hard to know what to emphasize for students with little background. I ended up focusing more basic data and control structures in Python, which I thought would be more useful for understanding the way the networkx package handles inputs and outputs. I’m not sure that was the most effective approach, though–at least in terms of conveying why Python is a good choice for network analysis. Next time, I think I’ll try to integrate more substantive examples.
Also, inspired by Ben’s last post–maybe we should put a few network analysis packages to a speed test? I get this question all the time, and I usually just refer to my own anecdotal evidence, but it’s probably worth pitting iGraph, networkx, etc. across platforms against one another in calculating, say, shortest paths in a relatively large network. More on this later…
P.S. It’s only taken me 3 months to write my first post!
Learning to use software always entails some startup cost. I recently had an exchange with one of my colleagues who is relatively new to social network analysis. He asked about my thoughts on a certain network analysis program and mentioned that “it’s easy to get lost with so many [network analysis] programs out there.” His impression is completely understandable. Social network analysis has become immensely popular in recent years. The rise in its popularity has especially been witnessed among gifted people capable of writing good software. Indeed, one Wikipedia list broadly describes about 70 social network analysis programs. Each of these programs have their strengths and weaknesses with regards to its contributions to the field. Given the wealth of options, which programs are worth the time investment to learn?
If you’re new to network analysis then I’d highly recommend learning the packages in R, perhaps supplemented by Pajek and/or Python packages. Here’s why:
Greetings, everyone. We are delighted to have been invited to author our first Bad Hessians guest post. We are a couple of graduate students in the sociology department at University of North Carolina – Brandon Gorman and Charles Seguin. Our post is about a project we began last year after we noticed that, during the Arab Spring, between January 25th and February 11th 2011, western media completely shifted from describing Hosni Mubarak as a “key US ally” to an “entrenched dictator.” This made us wonder – what structures US media attention to foreign leaders?
I recently discovered Gary Weissman’s excellent post on Grey’s Anatomy Network of Sexual Relations and I felt inspired. For those who haven’t heard of the television show before, Grey’s Anatomy is a widely popular, award-winning prime-time medical drama airing on ABC which has received no shortage of critical acclaim. Meeting conventional medical drama expectations, the show quite regularly features members of its attractive cast “hooking up.” Or so I am told. In an effort to teach medical students some basic social network lessons, Weissman produced a network data set on the show’s sexual contacts between characters. Though I’m not particularly fond of the show and both sexual and fictional networks lie outside my research interests, Weissman’s post served as a remarkable demonstration of network analysis for pedagogical purposes.
Working with right-to-left languages like Arabic in R can be a bit of a headache, especially when mixed with left-to-right languages (like English). Since my research involves a great deal of text analysis of Arabic news articles, I find myself with a lot of headaches. Most text analysis methods require some kind of normalization before diving into the actual analyses. Normalization includes things like removing punctuation, converting words to lowercase, stripping numbers out, and so on. This is essential for any kind of frequency-based analysis so that words such as don’t, Don’t, and dont are not considered unique words. After all, when dealing with human-generated text, typos and differences in presentation are bound to occur. Often times, normalizing also includes stemming words so that words such as think, thinking, and thinks are all stemmed to “think” as they all represent (basically) the same concept.
This got posted at R-bloggers last night, after the men’s 100 meter Olympic event was over. Marcus Gesmann predicted Usain Bolt’s 9.63 second result within 0.05 seconds. Even better, he did it using a simple log-linear model that didn’t control for any other factors.
Check the original article at R-bloggers, which talks more about the progression of faster running times and includes the R code used.
Last month I had the wonderful opportunity to help instruct an intensive eight-day workshop on the subject of social network analysis. Affiliated with the Sociology of Education and Science Laboratory at the Higher School of Economics—Saint Petersburg, the workshop sought to recreate the atmosphere of ICPSR summer courses. This workshop was the first of its kind in Russia to offer social networks training as a summer methods course. Continue reading
Going to be at ASA? Come hang out with the Bad Hessians!
Friday, August 17, 8 PM.
Euclid Hall Bar & Kitchen, 1317 14th Street
RSVP on the Facebooks
What do we do when we think that a particular set of effects is likely to vary significantly across groups? There seem to be two basic approaches: we can either (a) run separate models for each group or we can (b) pool data across groups and then allow effects to vary through the inclusion of interaction terms (i.e. run a fully-interacted model). In terms of coefficients, the two approaches will ultimately produce equivalent results.* The standard errors, however, are a different story. This inevitably has implications for things like statistical significance, a subject with which sociologists in particular are known to be preoccupied.
A common intuition is that these changes are due to changes in the degrees of freedom resulting from disaggregation. Having recently run into this suggestion in a couple of different places, I decided to make up some data to get a better sense of how standard errors are affected by groupwise disaggregation (i.e. running separate models for each group as opposed to running a single pooled model with a bunch of interaction effects). I was interested in particular in the way which the expansion and contraction of group-specific standard errors varies depending on differences in group size and error variance. The results of this experiment are shown in the graph below which, in effect, depicts the expansion and contraction of standard errors as a function of the level of groupwise heteroscedasticity. To anticipate the discussion below the break, the main finding here seems to be that, on average, disaggregation has no effect on standard errors in the absence of heteroscedasticity.**