This is a guest post by Randy Zwitch (@randyzwitch), a digital analytics and predictive modeling consultant in the Greater Philadelphia area. Randy blogs regularly about Data Science and related technologies at http://randyzwitch.com.
A few months ago I passed the 10-year point in my analytics/predictive modeling career. While ‘Big Data’ and ‘Data Science’ have only become buzzwords in recent years, hitting the limit on computing resources has been something that has plagued me throughout my career. I’ve seen this problem manifest itself in many ways, from having analysts get assigned multiple computers for daily work, to continuously scraping together budget for more processors on a remote SAS server and spending millions on large enterprise databases just to get processing of data below a 24-hour window.
Luckily, advances in open source software & cloud computing have driven down the cost of data processing & analysis immensely. Using IPython Notebook along with Amazon EC2, you can now procure a 32-core, 60GB RAM virtual machine for roughly $0.27/hr (using a spot instance). This tutorial will show you how to setup a cluster instance at Amazon, install Python, setup IPython as a public notebook server and access this remote cluster via your local web browser.
To get started with this tutorial, you need to have an Amazon Web Services account. I also assume that you already have basic experience interacting with computers via the command line and know about IPython. Basically, that you are the average Bad Hessian reader…
Setting Up a Cluster EC2 Instance
Setting up a cluster instance on Amazon EC2 follows the same process as any other EC2 instance, with one minor difference: in order to use the cc2.8xl instance type, you need to choose an operating system that supports “HVM” (Hardware Virtual Machine). I use Ubuntu 12.04 LTS 64-bit with HVM support, for no other reason that I use Ubuntu 12.04 LTS on my local machine and am used to it.
Additionally, I set up my EC2 instances as spot instances, rather than on-demand; this is an aggressive cost-saving move that works because we are using IPython Notebook. Given that IPython Notebook runs in your local browser, even if your instance gets outbid (i.e. shut off by Amazon), you still retain the code locally. For my workflow, where I generally pull data from S3 or a relational database, the time developing my code far outweighs occasionally needing to re-run a script. However, if your work is mission critical, all of the steps are the same with on-demand instances, which will run until you shut them down (it just costs you 10x more per hour!).
The SlideShare presentation below outlines the steps needed to setup a remote IPython Notebook environment (or, PDF download).
Setting up IPython as a Remote Notebook Server
Once you have your EC2 cluster instance up and running, SSH into your instance and install Python/IPython. I use the Anaconda distribution because it’s easy and avoids having to worry about getting NumPy and related scientific packages installed correctly. Regardless of how you install IPython, the setup of the remote Notebook server is the same:
- Generate password for IPython Notebook server
- Create ‘nbserver’ IPython profile
- Create self-signed SSL certificate
- Modify ipython_notebook_config.py script
- Start IPython Notebook using nbserver profile
The following GitHub Gist outlines the required commands:
Once you submit the final line of code to turn on IPython Notebook using the nbserver profile, that’s it: You’re running an IPython Notebook public server!
Accessing Your IPython Notebook Server
With IPython Notebook running on your EC2 image, you can access the remote server by using the public DNS of your image, similar to:
Because we used a self-signed SSL certificate, you may get a warning from the browser warning you about a security issue. This is expected and you can proceed to type in your password. At that point, IPython Notebook will look and behave just as if you were using it on your local machine, except you’ll have a LOT more processing power! Just keep in mind that if you use a spot instance, Amazon can turn off your instance at any time without warning…
What If I Need MORE Processing Power?
By using a cc2.8xlarge EC2 instance, you get roughly 4x the number of cores and 8x the amount of RAM as an off-the-shelf Core i7 laptop, all the while not needing to modify your programming style (as you would if you went to MapReduce-style processing). I’ve also used this setup to avoid having to use HDF5 to work around out-of-memory issues (though HDF5 is a great way to handle medium-sized-but-larger-than-RAM data as well).
But if you find that you need even more power than a single cc2.8xlarge instance, there is the StarCluster project from MIT which allows for creating your own EC2 clusters of arbitrary size, provides elastic load balancing for increasing/decreasing your instances based on workload and lots of other useful features for scientific computing. I’ve never needed to use StarCluster in my daily work, but then again, I don’t really have truly ‘Big Data’…maybe just data that could use a salad every once in a while.