Becoming a Rock Star Hadoop Administrator, Part 1

Hadoop is just storage and computing, so administering a Hadoop cluster should be a breeze, right? Well… not necessarily. When we’re talking about Hadoop, we’re talking about a fast-moving open source project that covers many disciplines and requires deep understanding of Linux, Java, and other ecosystem projects with funny names like ZooKeeper, Flume, and Sqoop. Fear not, in these posts, we hope to help you on your journey to becoming a rock star Hadoop administrator.

The Hadoop Administrator Challenge

From an end user perspective, Hadoop is deceptively simple – your developers can go from nothing to a functioning program in just a couple dozen keystrokes. With interfaces like Hive and Pig (and now Impala), end users can do useful work even without knowledge of the inner workings of HDFS or MapReduce.

Administrators, however, are not so lucky.

In order to properly set up and maintain a cluster at optimum performance, Hadoop admins need to have a decent understanding of (deep breath): Linux, HDFS, and MapReduce daemons (with two flavors of the latter and two ways to set up the former), Java (especially JVM configuration and tuning), and a handful of Hadoop ecosystem projects.

On top of all that, admins need tools to configure and manage dozens, hundreds, or thousands of independent servers from both a hardware and software perspective.

Perhaps the biggest challenge of such a system is perhaps the fast-moving open source nature of the core software itself. New versions of the software quickly make the “old way” of doing things obsolete and inefficient.

The Basics of Hadoop Hardware

As an administrator, you’re primary responsibility is configuring and maintaining your cluster for optimal performance. From a hardware perspective, this means choosing the best machines you can afford (dual bonded NICs, RAIDed drives, redundant power supplies, carrier class everything) for your “master” machines (which host your NameNode and JobTracker daemons), and affordable machines (JBOD disks, single NIC, power supplies, commodity class everything) for your slave nodes (which run your DataNode and TaskTracker daemons, in addition to Map and Reduce JVMs). For both, you probably want to get machines with some growth capacity left in them (allowing you to upgrade the hardware later on and extend their lifetimes). A good rule of thumb is that master machines should cost you no more than $10–12,000/machine, and slave nodes should be roughly half of that.

After you’ve got your hardware running, you’ll need to configure various parameters around your HDFS and MapReduce daemons, including things like JVM heap size, number of map and reduce slots, number of space for HDFS, and intermediate data.

Once you’re up and running, chances are you’re quickly going to need to both tweak existing hardware and software, as well as add new slave nodes (and possibly master nodes). Following best practices for each will save you tons of time and save your company tens or hundreds of thousands of dollars.

Best Practices: CDH4 and Cloudera Manager

First and foremost, when deploying your cluster, it makes really no sense to use anything but what nearly everyone in the industry is using, Cloudera’s Distribution of Apache Hadoop, version 4 (CDH4).

With CDH4, Cloudera has packaged and bundled not only base Hadoop, but also a couple hundred patches to Hadoop as well as over a dozen of the commonly used ecosystem projects such as Hive, Pig, Ooozie, Flume, and Sqoop (all pre-configured to work in concert together straight out of the gate).

With CDH4, you get Hadoop up and running in just a few minutes and with a high confidence factor that it’s all set up correctly for most use cases. Just as you can download Linux and build your own distribution, you can download base Hadoop, and build it from scratch (assuming you have a couple weeks and Hadoop experts in house who can choose and install the proper patches and ecosystem projects). For 99.9% of the organizations out there, CDH4 just makes the most sense.

It’s worth mentioning two things here, CDH4 is 100% open source and in no way locks you into Cloudera as a provider, and Cloudera is not the only packager of Hadoop. Hortonworks, a Yahoo-backed Cloudera competitor also packages up Hadoop in their own version, which again is 100% open source.

When one closely compares Cloudera and Hortonworks offerings, they find that Cloudera is more widely deployed, has more influential contributions (Impala, Flume, Sqoop, etc) to Hadoop, and employs more core members of the Hadoop community, including Doug Cutting, Tom White, Eric Sammer, Lars George, and Jeff Hammerbacher. I think competition is great and would love to be able to say that Cloudera and Hortonworks are equal, but for now, Cloudera has the history, the deployment base, and, most importantly, the core members of the community.

In addition to giving you a very easy way to get Hadoop up and running, Cloudera also provides an excellent GUI-based maintenance tool called Cloudera Manager. Cloudera Manager lets you easily control the arduous tasks of configuring and managing hundreds or thousands of independent servers in your cluster. It is 100% free to use, and since Cloudera lifted the previous “50 nodes for free” restriction, it can now be used completely free of charge for clusters of any size.

Cloudera Manager can also easily be disabled later should you choose to use another service or homegrown solution to manage your cluster (although it’s hard to argue why you’d do that).

Please support our Sponsors here :