This weekend, I made it out to Penn State to participate in the GDELT hackathon, sponsored by the Big Data Social Science IGERT and held in the punnily-named Databasement. The hackathon brought together a lot of different groups — political scientists, industry contractors, computer and information scientists, geographers, and — of course — sociologists (I was one of two).
GDELT, as you mayremember, a political events database with nearly 225 million events from 1979 to the present. Hackathon attendees had interests ranging from optimizing and normalizing the database, predicting violent conflict, and improving event data in general.
I’m a big fan of Drew Conway‘s Data Science Venn Diagram, in which he outlines the three intersecting spheres of skill that the data scientist needs — hacking skills, math and statistics knowledge, and substantive expertise. I’ve used this idiom in thinking through how to bring more sociologists into using computational methods. This has been a matter of getting them to learn how to hack or see the virtues of hacking even if they don’t have a taste for it themselves.
But what I think the diagram is missing — or it’s at least gets buried underneath the surface — is knowledge of the processes of data production. This is maybe a subtler point which I think gets looped in with “substantive expertise” but I want to draw this line out to be as explicit as possible because I think this is one of data science’s weaker flanks and one of the places where it needs to be strengthened to gain more acceptance within the social sciences.
Last week’s post on the metal collaboration network brought attention largely to the “giant component”–the largest subgraph in a network where all actors have at least one path to all other actors. In large networks, even sparse ones, giant components typically emerge and include the majority of actors in the network. While focusing on the giant component follows conventional practice while analyzing small world networks, perhaps worthwhile information can be inferred from actors outside of the giant component.
A couple of months ago a friend directed me to a piece by the New Yorker which included a nice interactive map depicting the landscape of craft brewing in the United States based on data provided by the Brewer’s Association. Using this data, what can we say about the geography of craft brewing in the United States?
To get started, let’s look at the distribution of craft breweries at the state level:
There are a couple of points to note here. First, I’ve omitted Alaska and Hawaii. Disconnected observations can become a bit of a headache when working with some of the methods described below. There are ways around this, but for the purposes of this post I’d prefer to set these issues aside. Second, instead of dealing with raw counts, I’ve taken the log of the number of craft breweries per 500,000 people. Finally, I’ve used Jenks classification to sort states into five categories, each of which is associated with a color determined by the ColorBrewer routine implemented via the brewer.pal command include as part of R’s RColorBrewer library.
Just looking at the map, we can see clusters of high-craft-brewing states in New England and the Pacific Northwest, as well as a large cluster of low-craft-brewing states in the South. We can begin to quantify these patterns by using a set of exploratory techniques designed to capture the extent to which observations which share similar values on a given outcome also tend to share a similar spatial location.
This is a guest post by Sean J. Taylor, a PhD student in Information Systems at NYU’s Stern School of Business.
Last Thursday and Friday I attended the 2nd annual DataGotham conference in New York City. Alex Hanna asked me to write about my experience there for the benefit of those who were unable to attend, so here’s my take on the event.
Thursday evening was a social event in a really sweet rooftop space in Tribeca with an open bar and great food (a dangerous combination for this still-grad-student). Though I spent a lot of the time catching up with old friends, I would describe the evening as “hanging out on Twitter, but in person.” I met no fewer than a dozen people I had only previously known online. I am continually delighted at how awesomeness on Twitter is a reliable indicator of awesomeness in-person. Events like DataGotham are often worth it for this reason alone.
A few months ago I started listening to Tomahawk, a band described on Wikipedia as “an experimental alternative metal/alternative rock supergroup.” Beyond the quality of their music, I found myself intrigued by the musical background of their members. In addition to Tomahawk, their other bands include acclaimed groups such as Faith No More, Helmet, the Melvins, Fantômas, and the Jesus Lizard. Mike Patton alone has been affiliated with at least fifteen bands. Continue reading →
UPDATE 2013-10-01: Nate Porter pointed out that the Hacker League page doesn’t let you sign up. For now, use this Google doc.
A lot of folks on Twitter during ASA this year were chatting about the possibility of a hackathon during ASA 2014 in San Francisco. The reasons for having a hackathon, I think, are myriad; here are some of the various “purposes” that myself and members of the computational sociology listserv have considered:
Incorporate computational methods into social science through teh h4x
Inspire participants to apply computational methods to common social science problems
Create an organizational nexus for computational sociology which makes it a vibrant and visible part of the discipline
Develop and foster social ties that strengthen the field and point to the value of non-traditional venues for collaboration
Create useful and interesting research products.
Solidify connections among sub-community of folks in/around sociology who have a set of skills/tools/interests in things computational
Increase visibility of that sub-community, partly by showcasing what can be done
To support claim that sociology has a role to play in computational social science and that computation has a role to play in sociology.
Connect folks already immersed in these skill areas with folks who are around the edges, curious, etc.
To actually impart some new skills/ideas to folks.
To actually produce something collectively useful.
To lay foundation for something that could grow in future years at ASA meetings or in ASA in general (e.g., a network of folks working with these tools).
I’m really excited about the prospect of this. Laura Norén, Christopher Weiss, and I have been plotting to make this thing a reality. Right now we’re trying to gauge how many people would come out to such an event.
If you have even a tiny inkling that you might come to the hackathon, sign up at the Hacker League page.
As mentioned in a previous post, Alex Hanna and I had the opportunity to teach last week at the Higher School of Economic’s International Social Network Analysis Summer School in St. Petersburg. While last year’s workshop emphasized smaller social networks, this year’s workshop focused on online networks. For my part, I provided an introductory lecture to social network analysis along with four labs on the subject of R and social network analysis.
The introduction to social network analysis began with an historical overview, followed by outlining which concepts constitute a social network. The remaining portions review subjects relating to subgraphs, walks, centrality, cohesive subgroups, along with major research subjects in the field. Setting aside the substantive interest in networks, the first lab covered basic R usage, objects, and syntax. Admittedly, this material was relatively dryer, though necessary to make the most of the network analysis software in R. We followed this introduction to R with an introduction to R’s social network analysis software. This second lab introduces the class to the different network packages within R, reading data, basic measurements brought up in the introductory lecture, and visualization. The third R SNA lab was on the subject of graph-level indices, random graphs, and Conditional Uniform Graph tests. Both the second and third labs were conducted primarily using the igraph package. The fourth and final lab of the course was on the subject of exponential random graph modeling. For this lab, we walked through tests for homophily and edgewise-shared partner effects using data on both our Twitter hashtag (#SNASPb2013) as well as US political blogs.
The slides include scripts that download and read the data used within all lab examples.
Benjamin Lind and I have spent the last week and a half teaching at the Social Network Analysis Summer School at HSE-St. Petersburg. We’ve had about 30 students coming from as far as South Africa and Sweden, with all levels of skill and many different research interests, and have had the pleasure of teaching with some great instructors from around the world as well. You can read the backchannel chatter the #SNASPb2013 hashtag
I ran two labs on collecting network data from various Internet sources with Python. The first is a mashup of some of my prior workshops on collecting Twitter data via the API, and drawing network data through user mentions. The second shows how to retrieve network data by crawling blogs.
Technology-wise, it was my first time using a cloud service (Amazon EC2) and iPython Notebooks for teaching purposes. A few observations into EC2 for teaching: the t1.micro server level is not quite powerful enough to handle ~30 students running parsing of JSON or scrapy. So you’ll have to up the juice, otherwise. I found iPython Notebooks to be great, though — code highlighting and execution, LaTeX typesetting, and Markdown makes it a winner in my book.
I also put the code for each lab on GitHub: hse-twitter and hse-scrapy. Would love any contributions to these small scripts, especially the scraping code.
The ASA annual meeting starts on Friday, and the program is about 200 pages long. But don’t worry, we’ve got you covered. Here’s a few computational sociology events that you should catch, suggested by folks on the computational sociology listserv.
If you know of any more that look interesting, feel free to post them in the comments and I’ll add them to this Google Calendar.