Become a Rock Star Hadoop Developer

So you want to learn how to become a Hadoop Developer? A little bit of Java, a little bit of Linux, you definitely need skills that span multiple disciplines. It’s certainly not an easy task but here are some tips and tricks that I’ve learned as a Cloudera instructor to help you on your way to becoming a Hadoop rock star!

Hadoop is written in Java, and if you’re a Java Developer already, you’ll be very comfortable getting around in the codebase. If you’re like me, however, and Java is not the primary language you write code in, you may find the following quick refresher valuable.

In addition to understanding some basic Java concepts like JVMs, Generics, and annotators, you’ll also want to get familiar with some actual products like eclipse, JUnit, and Log4J. Let’s first look at some of the concepts.

Basic Java

A JVM or “Java Virtual Machine” is where your actual Java code (or rather bytecode) runs. Think of it as a daemon that runs on your machine which has it’s own “sandboxed” area for memory, processor, etc – if you lock up your JVM, the other daemons on the machine can continue running just fine. You’ll find that Hadoop uses multiple daemons for both core “master” functionalities, as well as for running individual jobs. This makes Hadoop very fault tolerant, as each of these JVMs are isolated from each other.

To get your raw Java ready to run in Hadoop, you’ll need to first run the java compiler, javac, which generates the bytecode that can run in any JVM. You’ll next collect all of your compiled code into a single Java Archive or *.jar (nearly the same as a tarball), and you’ll submit that .jar file into Hadoop to run. The whole process would look something like this:

javac –classpath ‘hadoop classpath‘ *.java // compile
jar cvf wc.jar *.class // build archive file
hadoop jar wc.jar WordCount shakespeare wordcounts // run in hadoop

In that last command, the third parameter (WordCount) tells Hadoop which Java class has the static void “main” method, a special reserved method which is where Java starts executing.

In Hadoop, you’ll be writing at least a “Mapper” which will process key-value pairs of your input data one by one. Your Mapper class will also have a “map” method, which accepts key-value pairs. Let’s look at a sample of such a method from the actual Developer course:

Public class WordMapper extends Mapper<LongWritable, Text, Text, IntWriteable> {
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

In this example, you can see that we define a class called “WordMapper” which extends the base Hadoop “Mapper” class. You can probably also tell that this allows the constructor to accept some list of arguments (LongWritable, Text…), but what’s up with those angle brackets “< >”?

Angle brackets in Java designate what are known as Generics. Generics are fairly recent additions to Java, which are commonly used to allow collections to specify the types of objects that they hold. Essentially, this allows your code to fail at compile time (rather than run time) if you pass in a variable that doesn’t match the Generic type definition.

In our example above, the first two Generics “LongWritable” and “Text” specify the input key and value, and the last two Generics “Text” and “IntWriteable” specify the output key and value.

The next line “@Override” is what is known as an annotation. Annotations start with “@” and provide some information to the JVM at runtime about the line which follows them. In this case, we’re telling the JVM that we want to override the “map” method of the base Mapper class, which we then do on the following line.

In class, we also use annotations in the JUnit (a popular Unit Testing framework) and Log4J (a popular Logging framework).

Linux even more important

Having good Java skills helps a ton in class, but what may actually help even more is a solid understanding of Linux.

Hadoop was written to run in Linux, and that’s where it’s happiest. In most of the Cloudera Certified Hadoop classes, Linux is used exclusively for the labs.

In addition to basic operations (like cd, pwd, ls, cat, grep, cp, etc.), you’ll also want to get familiar with some more advanced commands like find, sort, awk, sed, and be comfortable with a Linux text editor like “vim” or “emacs.”

For example, we run a MapReduce job called “WordCount” which simply peruses over a large dataset and returns a list of words, then the number of times each appears. For an example of how useful some simple Linux commands can be, let’s look at the following output from the entire works of Shakespeare:

a — 1162
the — 1836
Othello — 128
Romeo — 42

Hadoop, by default returns all keys in sorted order (a, the, Othello, Romeo…), but in this case, we may want to sort on that 2^nd column (the # of times that word appears). If our output was in a file called “TotalWordCount,” we could get those results, simply by using the “sort” command as follows

sort –k2 –r –n TotalWordCount

The “k” switch tells the sort command to sort on second column (word frequency), the “r” switch tells sort to do a reverse sort (so highest number appears at top of results), and the “-n” switch tells sort to use a numeric sort (so that “10” appears higher up than “1”).

Although we don’t have time to go into find, awk, and sed, a few minutes with the “man” pages on each will give you a good head start of how to use them in class.

Additional Reading

In addition to some time spent on the Linux command line and inside of an eclipse editor, spend some time spent reading the following books to really ensure that you begin any training (or actual Hadoop work) as a complete rock star.

Effective Java by Joshua Bloch
Unix Power Tools by Peek, Powers, O’Reilly, Loukides
Hadoop, the Definitive Guide by Tom White

About the Author: Rich Morrow

Rich brings two decades of experience in IT as a developer, system administrator, trainer, mentor, and team builder. As a consultant, Rich is focused on development and maintenance of large-scale mission-critical custom web applications, particularly those leveraging LAMP. Coming from a startup-centric, development operations background, Rich has a special appreciation and love for enabling technologies like cloud and big data.

Please support our Sponsors here :