This is a guest post by Karissa McKelvey. She has a BA in Computer Science and Political Science from Indiana University. After graduating, she worked as a research assistant at the Center for Complex Networks and Systems Research at Indiana University on an NSF grant to analyze and visualize the relationship between social media expressions and political events. She is an active contributor to open source projects and continues to publish in computer supported cooperative work and computational social science venues. She currently works as a Software Engineer at Continuum Analytics.
Imagine you are a graduate student of some social or behavioral science (not hard, I assume). You want to collect some data: say I’m going to study the fluctuation of value of products over time on Craiglist, or ping the Sunlight Foundation’s open government data, or use the GDELT to study violent political events. There are a variety of tools I may end up using for my workflow:
- Retrieving the data: Python, BeautifulSoup
- Storing the data: CSV, Json, MySQL, MongoDB, bash
- Retrieving this stored data: SQL, Hive, Hadoop, Python, Java
- Manipulating the data: Python, CSV, R
- Running regressions, simulations: R, Python, STATA, Java
- Presenting the data: R, Excel, Powerpoint, Word, LaTeX
My workflow for doing research now requires a variety of tools, some of which I might have never used before. The number of tools I use seems to scale with the amount of work I try to accomplish. When I encounter a problem in my analysis, or can’t reproduce some regression or simulation I ran, what happened? Where did it break?
Should it really be this difficult? Should I really have to learn 10 different tools to do data analysis on large datasets? We can look at the Big Data problem in a similar light as surveys and regression models. According to IT companies such as Mustard IT, the largest and most fundamental part of the equation is just that this stuff is new – high-priority and well thoughout workflows have yet to be fully developed and stablized.
What if I told you that you could do all of this with the fantastically large number of open source packages in Python? In your web browser, on your iPad?
Python and PyData
I just returned from a trip to NYC at PyData. It’s a three-day conference, with one day of tutorials and two days of full-length talks all about big questions at the intersection of Python and data.
At the conference, you have your typical budding entrepreneurs, entrenched data scientists, and internal language engineers. Strikingly, there was a healthy amount of people who were amateur pythonistas, those who wanted to learn more about the language to see if they could make the switch.
I recently wrote my thoughts about what I will call “legacy tools,” such as R, SPSS, Matlab, and FORTRAN. These tools, particularly R, are becoming obsolete because they can’t handle your Big Data problem. When using these tools, your data often seems ‘Big’ because it doesn’t fit into memory, or the language was poorly designed, or it doesn’t have easy to use libraries for integrating into a database. And often, it crashes your computer.
You may be surprised that I am putting R on this list, for R is a well-known, broadly used tool in statistics and data analysis. However, in the context of the social sciences, and my above examples, we are still often forced take a variety of approaches to collecting, storing, and manipulating that data. This process is now fragmented, and R is just a piece. Python, on the other hand, can do it all. And quickly. It’s going to the foundation of the future of this trade.
At PyData, the community of about 500 listened to talks. The talks ranged from “Intro to Python Data Analysis in Wakari” to “PyParallel: How we Removed the GIL and Exploited all Cores”, which was about creating a parallelized language toolkit on Windows in 136 slides times 5 bullet points. It’s clear that the community is disparate, and there’s a lot going on – a lot to say at many different levels, from everything to tutorials to crafting the language. There are a substantial amount of people making python fast for datasets and simulations of arbitrary size and scale.
The Coolest new Python Data thing
PyData, PyCon, and SciPy are places where people often reveal the next ‘cool thing’ that’s going to revolutionize the way we do everything, like eat cereal. But realistically, something very important was demoed this weekend.
IPython Notebook is a fantastic tool for doing data analysis in python. As Philip Guo puts it,
Everything related to my analysis is located in one unified place. Instead of saving dozens or hundreds of code, output, and notes files for an experiment, everything is bundled together in one single file.
These notebooks can be printed to a variety of formats, including HTML and PDF (actually, a couple of python books have been written this way!) The IPython notebook makes it easy to reproduce research, as all of your documentation and code is inline with the plots, with the ability to do live manipulation if you use Wakari.
Brain Granger showed up and gave the crowd a tour of the new IPython Notebook feature called “interact.” Interact lets you create a function that has output, and manipulate the parameters to that function in near real-time. It is similar to Mathematica’s manipulate. Imagine a high school student seeing manipulation of a sin wave in real time by moving a slider, or a physics researcher being able to analyze a plot from multiple angles and easily explain parameter changes to their audience. It’s going to change everything in the programming and data analysis space, particularly in regard to teaching and reproducing results. The interact feature is due to release in January.
Conclusion
If we want to incorporate large datasets without filtering and corrupting our analysis with assumptions, we need to develop a toolkit that is easy to teach, easy to use, easy to reproduce, and easy to extend. Python is all of these things. We need to be able to understand and estimate the impact of Python and consider teaching it alongside R, or even in place of it.
The IPython notebook is a good entry point to make the switch to python for your data analysis, but it’s hard to learn a new tool. I recommend starting with tutorials on Wakari, and if you want to install packages locally, use Anaconda, not pip.
The time has come to upgrade our stack, and Python is my recommendation. Maybe I’ll see you at the next PyData.
Editor’s note: For full disclosure, Anaconda and Wakari are both products of the author’s employer, Continuum Analytics. – Alex