Getting started with Hadoop

Hadoop is an open source Java implementation of Google's MapReduce, a distributed programming technique. I've been investigating Hadoop lately for some data processing tasks at work. It is a bit of a minefield at first, so I'm writing this post partly as a way for me to keep track of everything, and also to hopefully save somebody else some time, or maybe to spark an an interest in Hadoop.

Writing Hadoop applications

Hadoop takes care of distributing your application over several nodes, but the rest is down to you. You can write you application in Java, or as a "streaming" application in any language. Streaming apps simply reads data from stdin and write their output to stdout. Writing a Java application seems to give you much more control over various options, and presumably improved performance too, but I opted for streaming for simplicity.The following post gives a good example of a streaming program for Hadoop: Writing an Hadoop MapReduce program in Python.

One of a best things I've found with streaming jobs is you can test them on a small amount of data on the command line, with no need for Hadoop:

cat input-data | ./mapper | sort | ./reducer

Running Hadoop

Rather than installing a local copy of Hadoop I used virtual machine from Yahoo! which comes with Hadoop pre-installed. The virtual machine is available from their Hadoop tutorial page, which includes full details of how to run the VM and start runninng your hadoop application on it. The VM even includes an Eclipse plugin for Hadoop, which is mainly focused on writing Java application but it still useful if you're writing streming apps becuase it provides you with job progress details and easy access to the Hadoop file system on the VM.

Multiple Nodes

The Yahoo VM is a great way to test your Hadoop apps locally, but doing distributed computing on a single node kind of misses the point. If you want to do any serious processing with hadoop you'll need lots of machines. Amazon's Elastic Complete Cloud (EC2) is perfect for this.

Amazon actually provide an Elastic Map Reduce service which sits on top of EC2 and manages a cluster of EC2 instances for you automatically while running your Hadoop application. The article Finding Similar Items with Amazon Elastic MapReduce, Python, and Hadoop Streaming contains details of how to use the service with a real-world example.

An alternative option to running Hadoop on EC2 is to use the Cloudera distribution. Their guide gives full details on how to set this up, and also links to client side scripts which make it extremely simple to to bring up a cluster and start running your Hadoop job. Using the Cloudera distribution has the benefit of being cheaper than using the Elastic Cloud Compute service, because you only pay for your EC2 usage. Another advantage is that you have more control over your EC2 instances, so you can install any packages that you wish, so for example you could book up your instances with subversion or git installed and automatically run a script to checkout the latest version of your code and data and run your Hadoop program!

Alternatives to Java and streaming apps

There are several projects that are built on top of Hadoop that allow you to access or process your data in different ways. Hive allows you to write SQL like queries on your data, which are automatically converted to MapReduce tasks. Dumbo allows you to write Hadoop applications in Python at a lower level than you can with streaming apps.

For more information check out the main Hadoop page, which is packed full of documentation.

Posted on 29 Aug 2009

If you enjoyed reading this post you might want to follow @coderholic on twitter or browse though the full blog archive.