Tutorial - Deploying Hadoop at Amazon EC2

There is an old tutorial placed at Hadoop's wiki page: http://wiki.apache.org/hadoop/AmazonEC2, but recently I had to follow this tutorial and I noticed that it doesn't cover some new Amazon functionality.To follow this tutorial is recommended that you are already familiar with the basics of Hadoop, a very useful "how to start" tutorial can be found at Hadoop's homepage: http://hadoop.apache.org/. Also, you have to be familiar with at least Amazon EC2 internals and instance definitions.When you register an account at Amazon AWS you receive 750 hours to run t1.micro instances, but unfortunately, you can't successfully run Hadoop in such machines.

On the following steps, when a command starts with $ means that it should be executed into the local machine, and with # into the EC2 instance.

Create an X.509 Certificate

Since we gonna use ec2-tools, our account at AWS needs a valid X.509 Certificate:

  • Create .ec2 folder:
    $ mkdir ~/.ec2
    * Login in at AWS

Setting up Amazon EC2-Tools

  • Download and unpack ec2-tools;
  • Edit your ~/.profile to export all variables needed by ec2-tools, so you don't have to do it every time that you open a prompt:

Setting up Hadoop

After download and unpack Hadoop, you have to edit the EC2 configuration script present at src/contrib/ec2/bin/hadoop-ec2-env.sh.

  • AWS variables

  • Security variables

  • Select an AMI

Running!

  • You can add the content of src/contrib/ec2/bin to your PATH variable so you will be able to run the commands indepentend from where the prompt is open;
  • To launch a EC2 cluster and start Hadoop, you use the following command. The arguments are the cluster name (hadoop-test) and the number of slaves (2). When the cluster boots, the public DNS name will be printed to the console.

    $ hadoop-ec2 launch-cluster hadoop-test 2
    * To login at the master node from your "cluster" you type:
    $ hadoop-ec2 login hadoop-test
    * Once you are logged into the master node you will be able to start the job:

  • For example, to test your cluster, you can run a pi calculation that is already provided by the hadoop-examples.jar: You can check your job progress at http://MASTERHOST:50030/. Where MASTERHOST is the host name returned after the cluster started.

  • After your job has finished, the cluster remains alive. To shutdown you use the following command:
    $ hadoop-ec2 terminate-cluster hadoop-test
    * Remember that in Amazon EC2, the instances are charged per hour, so if you only wanted to do tests, you can play with the cluster for some more minutes.
comments powered by Disqus