This is a guest post by Karissa McKelvey. She has a BA in Computer Science and Political Science from Indiana University. After graduating, she worked as a research assistant at the Center for Complex Networks and Systems Research at Indiana University on an NSF grant to analyze and visualize the relationship between social media expressions and political events. She is an active contributor to open source projects and continues to publish in computer supported cooperative work and computational social science venues. She currently works as a Software Engineer at Continuum Analytics.
Imagine you are a graduate student of some social or behavioral science (not hard, I assume). You want to collect some data: say I’m going to study the fluctuation of value of products over time on Craiglist, or ping the Sunlight Foundation’s open government data, or use the GDELT to study violent political events. There are a variety of tools I may end up using for my workflow:
- Retrieving the data: Python, BeautifulSoup
- Storing the data: CSV, Json, MySQL, MongoDB, bash
- Retrieving this stored data: SQL, Hive, Hadoop, Python, Java
- Manipulating the data: Python, CSV, R
- Running regressions, simulations: R, Python, STATA, Java
- Presenting the data: R, Excel, Powerpoint, Word, LaTeX
My workflow for doing research now requires a variety of tools, some of which I might have never used before. The number of tools I use seems to scale with the amount of work I try to accomplish. When I encounter a problem in my analysis, or can’t reproduce some regression or simulation I ran, what happened? Where did it break?
Should it really be this difficult? Should I really have to learn 10 different tools to do data analysis on large datasets? We can look at the Big Data problem in a similar light as surveys and regression models. The largest and most fundamental part of the equation is just that this stuff is new – high-priority and well thoughout workflows have yet to be fully developed and stablized.
What if I told you that you could do all of this with the fantastically large number of open source packages in Python? In your web browser, on your iPad?
This is a guest post by Sean J. Taylor, a PhD student in Information Systems at NYU’s Stern School of Business.
Last Thursday and Friday I attended the 2nd annual DataGotham conference in New York City. Alex Hanna asked me to write about my experience there for the benefit of those who were unable to attend, so here’s my take on the event.
Thursday evening was a social event in a really sweet rooftop space in Tribeca with an open bar and great food (a dangerous combination for this still-grad-student). Though I spent a lot of the time catching up with old friends, I would describe the evening as “hanging out on Twitter, but in person.” I met no fewer than a dozen people I had only previously known online. I am continually delighted at how awesomeness on Twitter is a reliable indicator of awesomeness in-person. Events like DataGotham are often worth it for this reason alone.
UPDATE 2013-10-01: Nate Porter pointed out that the Hacker League page doesn’t let you sign up. For now, use this Google doc.
A lot of folks on Twitter during ASA this year were chatting about the possibility of a hackathon during ASA 2014 in San Francisco. The reasons for having a hackathon, I think, are myriad; here are some of the various “purposes” that myself and members of the computational sociology listserv have considered:
- Incorporate computational methods into social science through teh h4x
- Inspire participants to apply computational methods to common social science problems
- Create an organizational nexus for computational sociology which makes it a vibrant and visible part of the discipline
- Develop and foster social ties that strengthen the field and point to the value of non-traditional venues for collaboration
- Create useful and interesting research products.
- Solidify connections among sub-community of folks in/around sociology who have a set of skills/tools/interests in things computational
- Increase visibility of that sub-community, partly by showcasing what can be done
- To support claim that sociology has a role to play in computational social science and that computation has a role to play in sociology.
- Connect folks already immersed in these skill areas with folks who are around the edges, curious, etc.
- To actually impart some new skills/ideas to folks.
- To actually produce something collectively useful.
- To lay foundation for something that could grow in future years at ASA meetings or in ASA in general (e.g., a network of folks working with these tools).
I’m really excited about the prospect of this. Laura Norén, Christopher Weiss, and I have been plotting to make this thing a reality. Right now we’re trying to gauge how many people would come out to such an event.
If you have even a tiny inkling that you might come to the hackathon, sign up at the Hacker League page.
As mentioned in a previous post, Alex Hanna and I had the opportunity to teach last week at the Higher School of Economic’s International Social Network Analysis Summer School in St. Petersburg. While last year’s workshop emphasized smaller social networks, this year’s workshop focused on online networks. For my part, I provided an introductory lecture to social network analysis along with four labs on the subject of R and social network analysis.
The introduction to social network analysis began with an historical overview, followed by outlining which concepts constitute a social network. The remaining portions review subjects relating to subgraphs, walks, centrality, cohesive subgroups, along with major research subjects in the field. Setting aside the substantive interest in networks, the first lab covered basic R usage, objects, and syntax. Admittedly, this material was relatively dryer, though necessary to make the most of the network analysis software in R. We followed this introduction to R with an introduction to R’s social network analysis software. This second lab introduces the class to the different network packages within R, reading data, basic measurements brought up in the introductory lecture, and visualization. The third R SNA lab was on the subject of graph-level indices, random graphs, and Conditional Uniform Graph tests. Both the second and third labs were conducted primarily using the igraph package. The fourth and final lab of the course was on the subject of exponential random graph modeling. For this lab, we walked through tests for homophily and edgewise-shared partner effects using data on both our Twitter hashtag (#SNASPb2013) as well as US political blogs.
The slides include scripts that download and read the data used within all lab examples.
I’ve hosted PDFs of all the slides on Google Drive.
The ASA annual meeting starts on Friday, and the program is about 200 pages long. But don’t worry, we’ve got you covered. Here’s a few computational sociology events that you should catch, suggested by folks on the computational sociology listserv.
If you know of any more that look interesting, feel free to post them in the comments and I’ll add them to this Google Calendar.