Charming Hadoop

One of the biggest things currently happening in the Ubuntu Server world is the advent of easy service deployment using Juju, the Ubuntu service orchestration and deployment tool.

The community synthesizing all of the great DevOps knowledge about how to deploy services has been very active and there are now more that 90 Charms (the distillation of how to deploy and scale a service) in the Juju Charm Store.

I’ve been fairly active within the Charm community and have written Charms for deploying Hadoop, HBase and Zookeeper.

This post is the first of a few I’m planning which will dive into these charms in detail – starting with Hadoop.

So Whats Hadoop?

Hadoop is a software platform that makes it easy to write and run applications that process vast amounts of data.  It really likes ‘commodity hardware’ and can be scaled out to 1000′s of servers (which we have tested using Juju – see Mark Mims’s excellent post on scaling Juju).

Despite the fact that Hadoop makes its easy to write these applications, there is still a lot of DevOps knowledge involved in how to deploy Hadoop effectively.

The Juju Charm for Hadoop makes a start on distilling that knowledge into something that anyone can pickup and use to deploy Hadoop well.

Getting Started

As always, the first step to deploying services with Juju is to configure a Juju environment and bootstrap it.  In this example we will be using Amazon ec2 to deploy services:

default: hadoop-eu-west-1
environments:
  hadoop-eu-west-1:
    type: ec2
    control-bucket: <name>
    admin-secret: <random secret>
    access-key: <ec2 access key>
    secret-key: <ec2 secret key>
    region: eu-west-1
    default-series: precise
    ssl-hostname-verification: true
    juju-origin: ppa

Simple drop that into ~/.juju/environments.yaml (providing your own ec2 credentials) and then bootstrap the environment:

prompt> juju bootstrap
2012-08-16 15:46:50,541 INFO Bootstrapping environment 'hadoop-eu-west-1' (origin: ppa type: ec2)...
2012-08-16 15:46:58,311 INFO 'bootstrap' command finished successfully

After a few minutes you should be able to query the environment using:

prompt> juju status
2012-08-16 15:50:24,742 INFO Connecting to environment...
2012-08-16 15:50:27,498 INFO Connected to environment.
machines:
  0:
    agent-state: running
    dns-name: ec2-46-137-18-68.eu-west-1.compute.amazonaws.com
    instance-id: i-b189b0f9
    instance-state: running
services: {}
2012-08-16 15:50:28,130 INFO 'status' command finished successfully

You are now ready to start deploying Hadoop.

Deploy Hadoop!

Hadoop provides a number of different components that are combined together to form a complete Hadoop deployment:

  • Name Node: Master index of all data stored across the Hadoop cluster datanodes.
  • Data Node: Responsible for storage of data (lots of these).
  • Job Tracker: Co-ordinates Map Reduce jobs across the Hadoop cluster tasktrackers.
  • Task Tracker: Runs individual parts of the Map Reduce job (again lots of these).

The charm supports deploying these components in a few different ways; in this post I’ll be walking through what I call a combined deployment.

This is the type of Hadoop deployment that most people will be familiar with; multiple daemons are run on each service unit which gives some additional benefits with regards to data locality when processing Map Reduce jobs.

To start with we will deploy the Hadoop charm twice – once for the ‘Master’ (with just a single service unit) and once for the ‘Slave’ (with 5 service units).

juju deploy --constraints="instance-type=m1.large" hadoop hadoop-master
juju deploy --constraints="instance-type=m1.medium" -n 5 hadoop hadoop-slave

Note the use of the “–constraints” flag – this allows us to specify the size of instance that will be used by default for each service.

We now need to define the relationships between the master and slave:

juju add-relation hadoop-master:namenode hadoop-slave:datanode
juju add-relation hadoop-master:jobtracker hadoop-slave:tasktracker

Note the semantics – the relations are defined at the service level rather than the individual service unit level.

Once all of these relationships have been established, the service unit supporting the hadoop-master service will be running Name Node and Job Tracker daemons, and the service units supporting the hadoop-slave service will be running Data Node and Task Tracker daemons.

The charm uses a late binding technique; the role of a service is not decided until its related to another service.

You can now expose the hadoop-master and take a look at the web interface for the Name Node (port 50070) and Job Tracker (port 50030):

prompt> juju expose hadoop-master
prompt> juju status hadoop-master
2012-08-16 15:58:38,518 INFO Connecting to environment...
2012-08-16 15:58:41,359 INFO Connected to environment.
machines:
  1:
    agent-state: running
    dns-name: ec2-54-247-132-175.eu-west-1.compute.amazonaws.com
    instance-id: i-458db40d
    instance-state: running
services:
  hadoop-master:
    charm: cs:precise/hadoop-5
    exposed: true
    relations:
      jobtracker:
      - hadoop-slave
      namenode:
      - hadoop-slave
    units:
      hadoop-master/0:
        agent-state: started
        machine: 1
        open-ports:
        - 8020/tcp
        - 50070/tcp
        - 8021/tcp
        - 50030/tcp
        public-address: ec2-54-247-132-175.eu-west-1.compute.amazonaws.com
2012-08-16 15:58:43,555 INFO 'status' command finished successfully

The expose sub command opens up access to the service to the wider world – in the case of ec2 the internet at large.  The ports to expose are defined in the charm using the ‘open-port’ command provided by Juju.

Use Hadoop!

Now that you have a running Hadoop cluster, let’s run something on it.  Fortunately there are some standard benchmarks provided by Hadoop which you can use to try things out; we’re going to run the TeraSort benchmark which generates, sorts and validates a defined number of random 100 byte rows of data – in this case 10 million of them.

Juju provides a nice way to access service units:

juju ssh hadoop-master/0

Note that Juju will have automatically inserted you SSH public key into the service unit so you should drop straight in.

As I used this benchmark to test the charm during development, you will find a handy script in /usr/lib/hadoop/terasort.sh to execute the benchmark:

ubuntu@ip-10-48-249-101:~$ /usr/lib/hadoop/terasort.sh
Generating 10000000 using 100 maps with step of 100000
12/08/16 15:01:11 INFO mapred.JobClient: Running job: job_201208161457_0001
12/08/16 15:01:12 INFO mapred.JobClient:  map 0% reduce 0%
12/08/16 15:01:35 INFO mapred.JobClient:  map 1% reduce 0%
12/08/16 15:01:36 INFO mapred.JobClient:  map 3% reduce 0%
12/08/16 15:01:38 INFO mapred.JobClient:  map 5% reduce 0%
12/08/16 15:01:39 INFO mapred.JobClient:  map 8% reduce 0%
12/08/16 15:01:41 INFO mapred.JobClient:  map 9% reduce 0%
12/08/16 15:01:44 INFO mapred.JobClient:  map 10% reduce 0%
..

Congratulations! You are now running a map reduce job on your deployed cluster.  You should be able to see it running through the Job Tracker Web UI as well.

Scale Out!

One of the powerful features in Juju is the ability to scale up a service to provide additional capacity; this of course requires that the software you are deploying supports scale out!

juju add-unit -n 10 hadoop-slave

We have now instructed Juju to provision another 10 service units to support the hadoop-slave service; these will come online and be configured into the cluster and should start taking up the slack.

Monitor Hadoop!

Thats all well and good – but how do we know that everything is happy? Fortunately we have some other charms to help with this:

juju deploy ganglia
juju add-relation ganglia:ganglia-node hadoop-master
juju add-relation ganglia:ganglia-node hadoop-slave
juju expose ganglia

… a few minutes lates ganglia will be collecting metrics from Hadoop and the service units supporting it. You can check this out on http://ganglia-public-address/ganglia.

Configuration Options

The Hadoop charm also has a number of configuration options which can be used to tune the deployed service.  These include specifying the java memory configuration for the daemons, HDFS block sizes and numerous other performance tweaks.  I’d recommend not touching the majority of these unless you know what you are doing – the defaults should be OK for deployments of a few 100 nodes.  Each option is fully documented in the configuration file for the charm.

A few are worth mentioning:

  • pig: Apache Pig provides a high level language for expressing data analysis programs; setting this to ‘true’ will install pig alongside hadoop and configure it to use the deployed cluster.
  • webhdfs: Hadoop provides a RESTful API to HDFS – by default its not turned on.
  • hadoop.dir.base: Allows the base directory for data to be specified allowing you to make use of additional storage provided by the server; for example, ec2 instance-store images provide ephemeral storage in /mnt.  Note that this should only be used during initial deployment and should not be changed during operation of services.

Example configuration file:

hadoop-master:
  webhdfs: true
  pig: true
hadoop-slave:
  webhdfs: true

You can reconfigure you deployed services using:

juju set --config config.yaml hadoop-master
juju set --config config.yaml hadoop-slave

Configuration can also be provided on the command line:

juju set hadoop-slave pig=true

Or can be specified at deploy time:

juju deploy --config config.yaml hadoop hadoop-master

All-in-one

For the impatient here is all of the above in a single script:

echo "
hadoop-master:
  hadoop.dir.base: /mnt/hadoop
  webhdfs: true
  pig: true
hadoop-slave:
  hadoop.dir.base: /mnt/hadoop
  webhdfs: true
  pig: true" > config.yaml
juju bootstrap
juju deploy --config config.yaml --constraints="instance-type=m1.large" hadoop hadoop-master
juju deploy --config config.yaml --constraints="instance-type=m1.medium" -n 5 hadoop hadoop-slave
juju add-relation hadoop-master:namenode hadoop-slave:datanode
juju add-relation hadoop-master:jobtracker hadoop-slave:tasktracker
juju expose hadoop-master
juju add-unit -n 10 hadoop-slave
juju deploy ganglia
juju add-relation ganglia:ganglia-node hadoop-master
juju add-relation ganglia:ganglia-node hadoop-slave
juju expose ganglia
juju ssh hadoop-master/0 /usr/lib/hadoop/terasort.sh

Summary

Hopefully this has given you a feel on how easy it is to deploy Hadoop on Ubuntu Server 12.04 using Juju.

This example uses Amazon ec2 – however the Hadoop Charm can just as easily be deployed on an Openstack Cloud or directly to Bare Metal using Juju. Support for other Clouds is in the pipeline. You can even use lightweight LXC containers on an Ubuntu Desktop if you don’t want to fork out for ec2 instances!

Today the Hadoop Charm for Juju deploys Hadoop 1.0.x; over the rest of the Quantal release cycle I will be working on packaging Hadoop 2.0 and updating the charm to support this new version of Hadoop which brings additional benefits such as removing the single points of failure in the Name Node and Job Tracker that Hadoop 1.0.x suffers from.

My next post will be about combining the Hadoop charm with Zookeeper and HBase to turn Hadoop into a random access database.

Oh – and don’t forget to tear down your environment with ‘juju destroy-environment’ before you run up the ec2 bill from hell….

About these ads

4 thoughts on “Charming Hadoop

  1. [...] Hadoop « JavaCruft Posted: August 17, 2012 in Uncategorized 0 Charming Hadoop « JavaCruft. Share this:LinkedInTwitterDiggBloggerLike this:LikeBe the first to like [...]

  2. David Wu says:

    Hi there, thanks for the great post. When I run the command “juju add-relation ganglia hadoop-slave” I get the following error:

    2013-02-15 15:33:21,281 ERROR Ambiguous relation ‘ganglia hadoop-master’; could refer to:
    ‘ganglia:ganglia-node hadoop-master:ganglia’ (monitor client / monitor server)
    ‘ganglia:master hadoop-master:ganglia’ (monitor client / monitor server)
    ‘ganglia:node hadoop-master:ganglia’ (monitor client / monitor server)

    Could you please explain what can be done to resolve this? Thanks.

    • JavaCruft says:

      Since I wrote this post the ganglia charm has been refactored to support hierarchical monitoring.

      Scoping the relation to use from the ganglia charm will resolve this problem:

      juju add-relation hadoop-master ganglia:ganglia-node
      juju add-relation hadoop-slave ganglia:ganglia-node

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 157 other followers

%d bloggers like this: