Alex « Bad Hessian

About Alex

Alex Hanna is a PhD candidate in sociology at the University of Wisconsin-Madison. Substantively, I'm interested in social movements, media, and the Middle East. Methodologically, I'm interested in computational social science, textual analysis, and social network analysis. You can find me on Twitter at @alexhanna and on the web at http://alex-hanna.com.

I’m a big fan of Drew Conway‘s Data Science Venn Diagram, in which he outlines the three intersecting spheres of skill that the data scientist needs — hacking skills, math and statistics knowledge, and substantive expertise. I’ve used this idiom in thinking through how to bring more sociologists into using computational methods. This has been a matter of getting them to learn how to hack or see the virtues of hacking even if they don’t have a taste for it themselves.

But what I think the diagram is missing — or it’s at least gets buried underneath the surface — is knowledge of the processes of data production. This is maybe a subtler point which I think gets looped in with “substantive expertise” but I want to draw this line out to be as explicit as possible because I think this is one of data science’s weaker flanks and one of the places where it needs to be strengthened to gain more acceptance within the social sciences.

Continue reading →

UPDATE 2013-10-01: Nate Porter pointed out that the Hacker League page doesn’t let you sign up. For now, use this Google doc.

A lot of folks on Twitter during ASA this year were chatting about the possibility of a hackathon during ASA 2014 in San Francisco. The reasons for having a hackathon, I think, are myriad; here are some of the various “purposes” that myself and members of the computational sociology listserv have considered:

Incorporate computational methods into social science through teh h4x
Inspire participants to apply computational methods to common social science problems
Create an organizational nexus for computational sociology which makes it a vibrant and visible part of the discipline
Develop and foster social ties that strengthen the field and point to the value of non-traditional venues for collaboration
Create useful and interesting research products.
Solidify connections among sub-community of folks in/around sociology who have a set of skills/tools/interests in things computational
Increase visibility of that sub-community, partly by showcasing what can be done
To support claim that sociology has a role to play in computational social science and that computation has a role to play in sociology.
Connect folks already immersed in these skill areas with folks who are around the edges, curious, etc.
To actually impart some new skills/ideas to folks.
To actually produce something collectively useful.
To lay foundation for something that could grow in future years at ASA meetings or in ASA in general (e.g., a network of folks working with these tools).

I’m really excited about the prospect of this. Laura Norén, Christopher Weiss, and I have been plotting to make this thing a reality. Right now we’re trying to gauge how many people would come out to such an event.

If you have even a tiny inkling that you might come to the hackathon, sign up at the Hacker League page.

Benjamin Lind and I have spent the last week and a half teaching at the Social Network Analysis Summer School at HSE-St. Petersburg. We’ve had about 30 students coming from as far as South Africa and Sweden, with all levels of skill and many different research interests, and have had the pleasure of teaching with some great instructors from around the world as well. If you are getting inquisitive for more info – keep reading, it gets exciting. You can read the backchannel chatter the #SNASPb2013 hashtag

I ran two labs on collecting network data from various Internet sources with Python. The first is a mashup of some of my prior workshops on collecting Twitter data via the API, and drawing network data through user mentions. The second shows how to retrieve network data by crawling blogs.

Technology-wise, it was my first time using a cloud service (Amazon EC2) and iPython Notebooks for teaching purposes. A few observations into EC2 for teaching: the t1.micro server level is not quite powerful enough to handle ~30 students running parsing of JSON or scrapy. So you’ll have to up the juice, otherwise. I found iPython Notebooks to be great, though — code highlighting and execution, LaTeX typesetting, and Markdown makes it a winner in my book.

I also put the code for each lab on GitHub: hse-twitter and hse-scrapy. Would love any contributions to these small scripts, especially the scraping code.

The ASA annual meeting starts on Friday, and the program is about 200 pages long. But don’t worry, we’ve got you covered. Here’s a few computational sociology events that you should catch, suggested by folks on the computational sociology listserv.

If you know of any more that look interesting, feel free to post them in the comments and I’ll add them to this Google Calendar.

Seeing Shamus Khan and Phil Kasinitz’s ASA eating guide, I asked a colleague and friend of mine, Grace Nguyen, (a former chef and sometimes New Yorker) to put together a list of good, cheap(er) food options for NYC in preparation for this year’s ASA.

Here’s her compilation. A few more suggestions from her may be forthcoming in the comments. The places are linked to their Yelp pages.

You should also hit up some of these places after the Bad Hessian party, since you’ll be in the Village anyhow.

Continue reading →

This is cross-posted from OrgTheory.

Fabio’s earlier post on the academic brain drain prompted some good discussion in the comments about students who have computational skills who leave academia for positions in Silicon Valley. Some of the tension in the discussion surrounded whether those students would be better suited for those jobs and how we need those people within the social sciences to handle all the new “big data” that’s coming our way. As someone who’s worked in industry a few times, I don’t exactly think it’s my bag. I’m fairly confident that I’d like to stay within academia. To that end, I want to use this post to think through a few institutional ways that sociology could be changed to be made more amenable to computational social science. By “amenable” I mean trying to incorporate the types of methods and data into the mainstream of sociology research. The exact goals may be a little murky, but a few examples could suffice: publishing big data articles in ASR/AJS or having tenure-track job searches for these types of scholars that are initiated within sociology (and not as a cluster hire or as a search initiated in computer science). I encourage you to add your own below; I’m sure institutional scholars have many, many ideas about this. And I’m sure there’s a lot of fiscal realities that makes all of this sound slightly utopian or maybe even Polyannish. But, taking a cue from Erik Olin Wright, real utopias and so on.

This is also presuming that there’s a critical mass of sociologists that actually want to see the incorporation of computational methods. I know Fabio and Christopher Bail have voiced their support, and there’s that Lazer et al. piece in Science that’s been cited a few hundred times (it’s pretty telling that it was published in a journal like Science), but I don’t know how to gauge this kind of thing outside of my computationally homophilic networks.

Continue reading →

I’ve jumped in on the development of the rewrite of TABARI, the automated coding system used to generate GDELT, and the Levant and KEDS projects before it. The new project, PETRARCH, is being spearheaded by the project leader Phil Schrodt and the development led by Friend of Bad Hessian John Beieler. PETRARCH is, hopefully, going to be more modular, written in Python, and have the ability to work in parallel. Oh, and it’s open-source.

One thing that I’ve been working on is the ability to extract features from newswire text that is not related to coding for event type. Right now, I’m working on numerical detection — extracting relevant numbers from the text and, hopefully, tagging it with the type of number that it is. For instance:

One Palestinian was killed on Sunday in the latest Israeli military operation in the Hamas-run Gaza Strip, medics said.

or, more relevant to my research and the current question at hand:

Hundreds of Palestinians in the Gaza Strip protested the upcoming visit of US President George W. Bush on Tuesday while demanding international pressure on Israel to end a months-old siege.

The question is, do any guidelines exist for converting words like “hundreds” (or “dozens”, “scores”, “several”) into numerical values? I’m not sure how similar coding projects in social movements have handled this. John has suggested the ranges used in the Atrocities Event Data (e.g. “several” = 5-24, “tens” = 50-99). What other strategies are there?

Prompted by a tweet yesterday from Ella Wind, an editor at the great Arab commentary site Jadaliyya, I undertook the task of writing a very quick and dirty converter that takes Arabic or Persian text and converts it to the International Journal of Middle East Studies (IJMES) transliteration system (details here [PDF]). I’ve posted the actual converter here. It’s in very initial stages and I will discuss some of the difficulties of making it more robust below.

It’s nice that the IJMES has an agreed upon transliteration system; it makes academic work much more legible and minimizes quarrels about translation (hypothetically). For example, حسني مبارك (Hosni Mubarak) is transliterated as ḥusnī mubārak.

Transliterating, however, is a big pain. The transliterated characters are not in the ASCII character set [A-Za-z0-9] that is mostly used by English and other Western languages, and many of its characters are largely drawn from Unicode (e.g. ḥ). That means a lot of copy-pasta of individual Unicode characters from the character viewers in your OS or some text file that stores them.

When Ella posted the tweet, I thought that programming this would be a piece of cake. How hard would it be to write a character mapping and throw up a PHP interface? Well, it’s not that simple. There are a few problems with this.

1. Most Arabic writing does not include short vowels.

Arabic is a very precise language (I focus the rest of this article on Arabic because I don’t know much about Persian). There are no silent letters and vowels denote verb form and casing. But in most modern Arabic writing, short vowels are not written in because readers are expected to know them. For example, compare the opening of al-Faatiha in the Qu’ran with vowels:

بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

to without them:

بسم الله الرحمن الرحيم

In the Qu’ran, all vowels are usually written. But this doesn’t occur in most modern books, signs, and especially newspaper and social media text.

So what does this mean for transliteration? Well, it means that you can’t transliterate words precisely unless the machine knows which word you’re going for. The average Arabic reader will know that بسم should be “bismi” and not “bsm.”

I can suggest two solutions to this problem: either use a robust dictionary that can map words without vowels to their voweled equivalent, or have some kind of rule set that determines which vowels must be inserted into the word. The former seems eminently more plausible than the latter, but even so, given the rules of Arabic grammar, it would be necessary to do some kind of part-of-speech tagging to determine the case endings of words (if you really want to know more about this twisted system, other people have explained this much better than I can). Luckily, most of the time we don’t really care about case endings.

In any case, short vowels are probably the biggest impediment to a fully automated system. The good news is that short vowels are ASCII characters (a, i, u) and can be inserted by the reader.

2. It is not simple to determine whether certain letters (و and ي) should be long vowels or consonants.

The letters و (wāw) and ي (yā’) play double duty in Arabic. Sometimes they are long vowels and sometimes they are consonants. For instance, in حسني (ḥusnī), yā’ is a long vowel. But in سوريا (Syria, or Sūriyā), it is a consonant. There is probably some logic behind when one of these letters is a long vowel and when it is a consonant. But the point is that the logic isn’t immediately obvious.

3. Handling dipthongs, doubled letters, and reoccurring constructions.

Here, I am thinking of the definite article ال (al-), dipthongs like وَ (au), and the shaddah ّ which doubles letters. This means there probably has to be a look-ahead function to make sure that these are accounted for. Not the hardest thing to code in, but something to look out for nonetheless.

Those are the only things I can think of right now, although I imagine there are more lurking in the shadows that may jump out once one starts working on this. I may continue development on this, at least in an attempt to solve issues 2 and 3. Solving issue 1 is a task that will probably take some more thoughtful consideration.

Tonight is the airing of the final episode of RuPaul’s Drag Race, Season 5. They are doing what they did last season, which is to delay the final crowning until the reunion show. Apparently I wasn’t wrong last week when I said last time that they tape three different endings to the show, mostly to ward off Twitter leaks by fans in the audience. And apparently the queens themselves don’t know who won until everyone else does, according to Jinkx.

As noted by Ru on the final three episode and the “RuCap,” they encouraged fans to vote by tweeting and by reposting on Facebook. Although I can’t get Facebook data directly, I’m going to look at the Twitter data that I’ve collected. Before delving into the final predictions, I remembered that I have Twitter data from last year’s airing from the Twitter gardenhose. From that, we should be able to get a sense of who had the sway of public opinion on Twitter.

The graph below plots the last week of season 4, between the announcement that the queen would be crowned at the reunion, and the final reunion show. I chose to focus only on mentions of a queen’s Twitter handle, instead of using #TeamWhatever, because there weren’t many counts of those in the gardenhose. The first peak is the final contest show, and the second is the actual crowning.

The case here is rather clear cut — Sharon Needles leads everyone for nearly the whole time period. The raw counts of mentions show no contest there. I’m actually rather surprised that Phi Phi led Chad. Maybe there was another way they showed support for her?

       Keyword Count
sharon_needles  2538
   phiphiohara   877
 chadmichaels1   497

Continue reading →

Bad Hessian

Brought to you by the letter R

Author Archives: Alex

About Alex

What else does a data scientist need?

ASA 2014 Hackathon – Are you in?

Network data collection workshops from SNA Summer School

Computational sociology panels at #asa13

Awesome food in NYC for #asa13

Bad Hessian #asa13 Party

Converting approximating numerical words to numbers?

Challenges of automated Arabic/Persian transliteration

A Final Twitter-based Prediction of RuPaul’s Drag Race Season 5