This is a guest post by Karissa McKelvey. She has a BA in Computer Science and Political Science from Indiana University. After graduating, she worked as a research assistant at the Center for Complex Networks and Systems Research at Indiana University on an NSF grant to analyze and visualize the relationship between social media expressions and political events. She is an active contributor to open source projects and continues to publish in computer supported cooperative work and computational social science venues. She currently works as a Software Engineer at Continuum Analytics.

Imagine you are a graduate student of some social or behavioral science (not hard, I assume). You want to collect some data: say I’m going to study the fluctuation of value of products over time on Craiglist, or ping the Sunlight Foundation’s open government data, or use the GDELT to study violent political events. There are a variety of tools I may end up using for my workflow:

  1. Retrieving the data: Python, BeautifulSoup
  2. Storing the data: CSV, Json, MySQL, MongoDB, bash
  3. Retrieving this stored data: SQL, Hive, Hadoop, Python, Java
  4. Manipulating the data: Python, CSV, R
  5. Running regressions, simulations: R, Python, STATA, Java
  6. Presenting the data: R, Excel, Powerpoint, Word, LaTeX

My workflow for doing research now requires a variety of tools, some of which I might have never used before. The number of tools I use seems to scale with the amount of work I try to accomplish. When I encounter a problem in my analysis, or can’t reproduce some regression or simulation I ran, what happened? Where did it break?

Should it really be this difficult? Should I really have to learn 10 different tools to do data analysis on large datasets? We can look at the Big Data problem in a similar light as surveys and regression models. The largest and most fundamental part of the equation is just that this stuff is new – high-priority and well thoughout workflows have yet to be fully developed and stablized.

What if I told you that you could do all of this with the fantastically large number of open source packages in Python? In your web browser, on your iPad?

Python and PyData

I just returned from a trip to NYC at PyData. It’s a three-day conference, with one day of tutorials and two days of full-length talks all about big questions at the intersection of Python and data.

At the conference, you have your typical budding entrepreneurs, entrenched data scientists, and internal language engineers. Strikingly, there was a healthy amount of people who were amateur pythonistas, those who wanted to learn more about the language to see if they could make the switch.

I recently wrote my thoughts about what I will call “legacy tools,” such as R, SPSS, Matlab, and FORTRAN. These tools, particularly R, are becoming obsolete because they can’t handle your Big Data problem. When using these tools, your data often seems ‘Big’ because it doesn’t fit into memory, or the language was poorly designed, or it doesn’t have easy to use libraries for integrating into a database. And often, it crashes your computer.

You may be surprised that I am putting R on this list, for R is a well-known, broadly used tool in statistics and data analysis. However, in the context of the social sciences, and my above examples, we are still often forced take a variety of approaches to collecting, storing, and manipulating that data. This process is now fragmented, and R is just a piece. Python, on the other hand, can do it all. And quickly. It’s going to the foundation of the future of this trade.

At PyData, the community of about 500 listened to talks. The talks ranged from “Intro to Python Data Analysis in Wakari” to “PyParallel: How we Removed the GIL and Exploited all Cores”, which was about creating a parallelized language toolkit on Windows in 136 slides times 5 bullet points. It’s clear that the community is disparate, and there’s a lot going on – a lot to say at many different levels, from everything to tutorials to crafting the language. There are a substantial amount of people making python fast for datasets and simulations of arbitrary size and scale.

The Coolest new Python Data thing

PyData, PyCon, and SciPy are places where people often reveal the next ‘cool thing’ that’s going to revolutionize the way we do everything, like eat cereal. But realistically, something very important was demoed this weekend.

IPython Notebook is a fantastic tool for doing data analysis in python. As Philip Guo puts it,

Everything related to my analysis is located in one unified place. Instead of saving dozens or hundreds of code, output, and notes files for an experiment, everything is bundled together in one single file.

These notebooks can be printed to a variety of formats, including HTML and PDF (actually, a couple of python books have been written this way!) The IPython notebook makes it easy to reproduce research, as all of your documentation and code is inline with the plots, with the ability to do live manipulation if you use Wakari.

Brain Granger showed up and gave the crowd a tour of the new IPython Notebook feature called “interact.” Interact lets you create a function that has output, and manipulate the parameters to that function in near real-time. It is similar to Mathematica’s manipulate. Imagine a high school student seeing manipulation of a sin wave in real time by moving a slider, or a physics researcher being able to analyze a plot from multiple angles and easily explain parameter changes to their audience. It’s going to change everything in the programming and data analysis space, particularly in regard to teaching and reproducing results. The interact feature is due to release in January.

Conclusion

If we want to incorporate large datasets without filtering and corrupting our analysis with assumptions, we need to develop a toolkit that is easy to teach, easy to use, easy to reproduce, and easy to extend. Python is all of these things. We need to be able to understand and estimate the impact of Python and consider teaching it alongside R, or even in place of it.

The IPython notebook is a good entry point to make the switch to python for your data analysis, but it’s hard to learn a new tool. I recommend starting with tutorials on Wakari, and if you want to install packages locally, use Anaconda, not pip.

The time has come to upgrade our stack, and Python is my recommendation. Maybe I’ll see you at the next PyData.

Editor’s note: For full disclosure, Anaconda and Wakari are both products of the author’s employer, Continuum Analytics. – Alex

  • zmjones

    Python is great but I think this is a mischaracterization of R (you can do most of the things you list pretty well imo) and of what most social scientists do (not “big data”). The proportion of social scientists using Python is vanishingly small (though I’d like that to change as well). Social science has (much) bigger issues than this!

    • Abhijit Dasgupta

      As a long time R and Python user and the organizer of data-related meetups in DC, I see use cases, advantages and disadvantages for both R and Python (and both together). My current view is that it will take Python about 5 years to match the breadth of R’s ecosystem, but it may take forever for R to match Python’s speed and syntactic ease and integrability into web frameworks. Python (and the IPython-pandas combination) is spectacular for data ingestion and munging, but modeling capabilities leave a bit to be desired still. For example, there is no good survival analysis package in Python (lifelines is good, but doesn’t do Cox regression). As Wes McKinney (pandas’ creator) recently complained, there is no Python-ODBC connector. The pydata ecosystem was, and to some extent still is, driven by the engineering world; numpy, scipy and matplotlib were proposed as replacements for Matlab. The two places pandas help python enormously were in missing data handling and heterogeneous data containers (pandas.DataFrame)

      Even though I’ve used R for 15+ years, I have no problem with Python or any other platform becoming better at the data scientific realm than R. There is always room for improvement, and use cases and contexts and the market will determine what most people use. Today R wins primarily on breadth of the ecosystem and the ability to do most things pretty darn well. In 5 years, I think both Python (for integration to the web, production) and Julia (speed, computational efficiency) will truly be viable alternatives. SAS, IBM, Stata and the rest, beware.

  • Trey

    I think this is a somewhat unfair portrayal of R. As a user of both (and as one who has shifted more of my work to Python in recent years), it’s hardly fair to call R a ‘legacy tool’ or to act like Python makes interfacing with databases magically easy. You still have to have drivers, configure the database connection, etc. Python can also suffer from in-memory processing constraints. R has a much more robust package ecosystem for specific data analysis tasks, even though Python is rapidly catching up. Also, as Zach points out, most social scientists aren’t using ‘big data’ for the most part. Finally, this reads a little sales-pitch-y to me.

    • randyzwitch

      To extend your thought…both Python and R rely heavily on calling C, Fortran, LAPACK, BLAS, etc. for performance. To say that either tool is ‘better’ is really arguing for preferring the high-level language wrapper around the lower-level code.

    • karissamck

      From what I see, the future of social science study will leave R behind, as it has Stata for R. The Python ecosystem is growing, and the connection to databases, with tools like couchdb or mongodb, is easier than ever. Literally, it’s download the thing, running it, then 3 lines of python to create your first entry.

      Anaconda and Wakari are both free. Continuum supports many open-source projects and is filled with prior academics. I work here because I actually _believe_ that this toolset is easier to use, teach, and work with for data analysis. R is fine, but the syntax has a higher learning curve. I didn’t mean to make this a hit piece on R, or spark a debate about which martial art is better for chopping wood. I genuinely just want to get the word out about python and express how exciting it is to see the language maturing!

      I hope the blog post at least gives you a workflow and toolbelt to consider if you use or are considering python

      • zmjones

        The social sciences haven’t left STATA (or SPSS for that matter) behind. R users are still a minority in my experience.

        I do agree that python has more consistent naming conventions (and the namespace isn’t always a complete disaster), but these are not things that are terribly concerning (since they wouldn’t even be aware of them) to someone still using STATA or SPSS.

  • Trey

    (and by that I mean the bio indicates that you work as a software engineer for Continuum, but don’t acknowledge that each of the suggested tools are actually produced by Continuum).

    • David

      IPython is open-source tool produced largely by Fernando Perez (UCB) and Brian Granger (Cal Poly San Luis Obispo)

      • Trey

        Yes, but Wakari and Anaconda are not.

        • David

          And there’s an editor’s note explaining that.

          • Trey

            Yes, it wasn’t there when I made my reply. Thanks.

  • The iPython notebook and the interact feature sound remarkably similar to RStudio with knitr/markdown and Shiny. The former is very widely used throughout the R community and the latter is ~ 1 year old.

    IMHO – I doubt R is doomed quite yet (indeed I suspect its user base may still be expanding). It’s closer to the truth to say that R and Python are converging in their capabilities. Python is becoming more interactive and has even created it’s own ggplot2 clone recently – trying to steal R clothes. Meanwhile R developers look not to Python but to C++ or Hadoop/SQL for speed and power – which is actually a very good idea to let someone else worry about that hard stuff.

    So an R/Python user could probably concentrate his energies on one or other language these days.

  • Joe Cheng

    I like Python and admire your passion, but also feel compelled to point out that the “interact” feature has been present in RStudio for at least a couple of years now. http://www.rstudio.com/ide/docs/advanced/manipulate (And as you say, it’s been in Mathematica forever.) Indeed, educators have found it quite useful: http://web.warwick.ac.uk/statsdept/user-2011/TalkSlides/Lightening/Pruim.pdf

    For more advanced interactivity, we have Shiny. http://www.rstudio.com/shiny/

    Others have noted the similarity between iPython Notebook and Sweave/knitr/etc.

    Yes, you can argue that Python has an easier-to-learn syntax and it certainly has fewer language pitfalls. But I for one feel straightjacketed by its (artificially) limited lambdas, it lacks R’s pervasive vectorization, and it gives you much less flexibility when it comes to nonstandard evaluation (see formula expression). Not a slam dunk in either direction IMHO.

    (Disclosure: I work for RStudio.)

  • Maybe nowadays, they were already made something that would relate on this kind of data that they use in social science researches that will end up into something that is really useful. Through this, they can be able to promote some techniques that they need.