This is a guest post by Randy Zwitch (@randyzwitch), a digital analytics and predictive modeling consultant in the Greater Philadelphia area. Randy blogs regularly about Data Science and related technologies at http://randyzwitch.com.

A few months ago I passed the 10-year point in my analytics/predictive modeling career. While ‘Big Data’ and ‘Data Science’ have only become buzzwords in recent years, hitting the limit on computing resources has been something that has plagued me throughout my career. I’ve seen this problem manifest itself in many ways, from having analysts get assigned multiple computers for daily work, to continuously scraping together budget for more processors on a remote SAS server and spending millions on large enterprise databases just to get processing of data below a 24-hour window.

Luckily, advances in open source software & cloud computing have driven down the cost of data processing & analysis immensely. Using IPython Notebook along with Amazon EC2, you can now procure a 32-core, 60GB RAM virtual machine for roughly $0.27/hr (using a spot instance). This tutorial will show you how to setup a cluster instance at Amazon, install Python, setup IPython as a public notebook server and access this remote cluster via your local web browser.

To get started with this tutorial, you need to have an Amazon Web Services account. I also assume that you already have basic experience interacting with computers via the command line and know about IPython. Basically, that you are the average Bad Hessian reader…

Setting Up a Cluster EC2 Instance

Setting up a cluster instance on Amazon EC2 follows the same process as any other EC2 instance, with one minor difference: in order to use the cc2.8xl instance type, you need to choose an operating system that supports “HVM” (Hardware Virtual Machine). I use Ubuntu 12.04 LTS 64-bit with HVM support, for no other reason that I use Ubuntu 12.04 LTS on my local machine and am used to it.

Additionally, I set up my EC2 instances as spot instances, rather than on-demand; this is an aggressive cost-saving move that works because we are using IPython Notebook. Given that IPython Notebook runs in your local browser, even if your instance gets outbid (i.e. shut off by Amazon), you still retain the code locally. For my workflow, where I generally pull data from S3 or a relational database, the time developing my code far outweighs occasionally needing to re-run a script. However, if your work is mission critical, all of the steps are the same with on-demand instances, which will run until you shut them down (it just costs you 10x more per hour!).

The SlideShare presentation below outlines the steps needed to setup a remote IPython Notebook environment (or, PDF download).

Setting up IPython as a Remote Notebook Server

Once you have your EC2 cluster instance up and running, SSH into your instance and install Python/IPython. I use the Anaconda distribution because it’s easy and avoids having to worry about getting NumPy and related scientific packages installed correctly. Regardless of how you install IPython, the setup of the remote Notebook server is the same:

  • Generate password for IPython Notebook server
  • Create ‘nbserver’ IPython profile
  • Create self-signed SSL certificate
  • Modify ipython_notebook_config.py script
  • Start IPython Notebook using nbserver profile

The following GitHub Gist outlines the required commands:

Once you submit the final line of code to turn on IPython Notebook using the nbserver profile, that’s it: You’re running an IPython Notebook public server!

Accessing Your IPython Notebook Server

With IPython Notebook running on your EC2 image, you can access the remote server by using the public DNS of your image, similar to:

https://ec2-54-221-167-250.compute-1.amazonaws.com:8888

Because we used a self-signed SSL certificate, you may get a warning from the browser warning you about a security issue. This is expected and you can proceed to type in your password. At that point, IPython Notebook will look and behave just as if you were using it on your local machine, except you’ll have a LOT more processing power! Just keep in mind that if you use a spot instance, Amazon can turn off your instance at any time without warning…

What If I Need MORE Processing Power?

By using a cc2.8xlarge EC2 instance, you get roughly 4x the number of cores and 8x the amount of RAM as an off-the-shelf Core i7 laptop, all the while not needing to modify your programming style (as you would if you went to MapReduce-style processing). I’ve also used this setup to avoid having to use HDF5 to work around out-of-memory issues (though HDF5 is a great way to handle medium-sized-but-larger-than-RAM data as well).

But if you find that you need even more power than a single cc2.8xlarge instance, there is the StarCluster project from MIT which allows for creating your own EC2 clusters of arbitrary size, provides elastic load balancing for increasing/decreasing your instances based on workload and lots of other useful features for scientific computing. I’ve never needed to use StarCluster in my daily work, but then again, I don’t really have truly ‘Big Data’…maybe just data that could use a salad every once in a while.

  • Hezi

    How do you utilize 32 cores through your python code? which parallelism mechanism are you using?

    • randyzwitch

      There are a few ways to utilize multiple cores. One way is to use the multiprocessing module for Python. This will allow you to set up a Queue and workers, then you can have your code running in parallel. The downside of this is that you have to know how to convert serial code into a parallel algorithm. Here’s a great blog post that explains how to get started with the multiprocessing module:

      http://eli.thegreenplace.net/2012/01/16/python-parallelizing-cpu-bound-tasks-with-multiprocessing/

      The way I use a multicore system like this is through the scikit-learn module for machine learning, which has support for automatic parallelism. For example, the RandomForest classifier (and many others) takes an argument “n_jobs”, which sets the number of parallel processes you’d like to run:

      http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

      • Christian Jauvin

        Thanks for the excellent article and idea! I’d like to point out that sklearn’s parallelism module is actually a library in its own:

        http://pythonhosted.org/joblib/

        • Hezi

          Thank you both, indeed an excellent article and the reference to joblib is very helpful

  • Thomas

    You might also be interested in NotebookCloud, a service which starts Amazon instances running the IPython notebook. It doesn’t have the flexibility of setting things up yourself, but it is simple to get started with:

    https://notebookcloud.appspot.com/login

  • Justin Riley

    Nice guide and thanks for mentioning StarCluster! Just wanted to link you to StarCluster’s IPython plugin which does all of this setup automatically (including notebook with self-signed cert, etc):

    http://star.mit.edu/cluster/docs/latest/plugins/ipython.html

    Also StarCluster has an HVM AMI based on Ubuntu 12.04 that has NumPy/SciPy installed and linked to OpenBLAS, OpenMPI, NVIDIA driver and CUDA toolkit (for GPU cluster instances), and more (see ‘starcluster listpublic’). It’s also easy to extend the AMI for your needs (e.g. install Anaconda):

    http://star.mit.edu/cluster/docs/latest/manual/create_new_ami.html

  • Pingback: Using Amazon EC2 with IPython Notebook | randyzwitch.com()

  • Dan G

    We don’t put our notebook servers directly on the Internet. It’s only a password and its easy to password grind. Try using ssh port forwarding:

    ssh -L :: @

    ssh -L 8889:ec2.some.address.com:8888 ubuntu@ec2.some.address.com

  • Pingback: Cluster Computing for $0.27/hr using Amazon EC2 and IPython Notebook « Bad Hessian | shangguan's Clips()

  • sobach82

    Thank you for such a nice guide! One question: assuming, that I need this instance only for a few hours (but not once). First time – I use this guide to set up instance, compute smth. and what is my next step? Should I terminate instance, and next time I need it, repeat all the steps? Or there is a kind of “hibernate”-mode (no work – no payments – no need to set up instance again, just wake it up)?

    • randyzwitch

      It’s a trade-off between cost and permanence. If you start your instance as a spot instance, then you’ll need to do a backup to EBS to “save” your instance as an image. If you don’t save your instance, then it’s wiped out when you terminate. I’ve always found that doing the backup a hassle, so I keep my commands in a script and just rebuild my cluster instance each time.

      Alternatively, you could just start an on-demand instance, which does get saved upon shutdown. The downside is that it costs about 10x more per hour.

      • sobach82

        On-demand instances have unsavory prices 😉 in comparison with spot especially. But as I understood – it’s possible to create your own AMIs from EBS-snapshots (http://stackoverflow.com/a/22265180/3314290). Looks like that’s optimal solution for me. Thank you!

  • Pingback: Quora()

  • Pingback: Six of One (Plot), Half-Dozen of the Other « Bad Hessian()

  • Pingback: Neural Nets with Caffe Utilizing the GPU | joy of data()