Refactor Hadoop job: old to new API

In this post I want to improve the existing hadoop-job I have created earlier: wiki-pagerank-with-hadoop

The goals:

Update to the new stable hadoop version.
Use the new API.
Setup a single-node hadoop-cluster.
Easily build a job-jar with maven.

I have used version 0.20.204.0 for the initial hadoop-jobs. The hadoop committers have been busy in the mean time, the latest stable version at the moment is available 2.2.0.

1. Update Hadoop Version

Updating to the latest version is just a matter of updating the version in the pom.xml and maven will do the rest for you, as long as the version is available in the repository. In my pom I have configured the mirror ibiblio which contains the latest version of hadoop. The client classes to build a job are now in a separate dependency.


 ...
  
    org.apache.hadoop
    hadoop-core
    2.2.0
  

  
  
    org.apache.hadoop
    hadoop-client
    2.2.0
  
  ...

Lets download the new version

mvn clean compile

2. Use new API

In the IDE I noticed that the classes from org.apache.hadoop.mapred.* are not deprecated anymore. In the old version some of the classes are marked as deprecated, because at that time there was a new API available. The old API is un-deprecated, because its here to stay. Note: both old and new api will work.

This is an overview of the most important changes:

What	Old	New
classes location	org.apache.hadoop.mapred	org.apache.hadoop.mapreduce
conf class	JobConf	Configuration
map method	map(k1, v1, OutputCollector, Reporter)	map(k1, v1, Context)
reduce method	reduce(k2, Iterator, OutputCollector)	reduce(k2, Iterable, Context)
reduce method	while (values.hasNext())	for (v2 value : values)
map/reduce	throws IOException	throws IOException, InterruptedException
before map/reduce		setup(Context ctx)
after map/reduce	close()	cleanup(Context ctx)
client class	JobConf / JobClient	Job
run job	JobClient.runJob(job)	job.waitForCompletion(true)

Migrating to the new API was not so difficult. Delete all the imports with org.apache.hadoop.mapred.* and re-import with the correct package. Then fix all the errors one by one. You can checkout github with the new code and the changes I made to support the new API. The functionality is not changed.

Old api:

public class RankCalculateMapper extends MapReduceBase implements MapperLongWritable, Text, Text, Text>{
  public void map(LongWritable key, Text value, OutputCollectorText, Text> output, Reporter reporter) throws IOException {
    ...
    output.collect(new Text(page), new Text("|"+links));
  }
}

New api:

public class RankCalculateMapper extends MapperLongWritable, Text, Text, Text> {
  public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    ...
    context.write(new Text(page), new Text("|" + links));
  }
}

3. Setup a single-node hadoop-cluster.

I the previous post I had a plugin for eclipse to run the main class directly in the hadoop cluster. Actually the plugin creates the job-jar and sends it to the cluster. Since I have switched from IDE I can’t use the plugin anymore and I want an independent job builder so the job can be created and run without Eclipse. Installing your own single node cluster is very easy these days.

Install a virtual machine: VirtualBox or VMwarePlayer and download a quickstart distribution from your favorite vendor:

Cloudera: http://go.cloudera.com/vm-download
Hortonworks: http://hortonworks.com/products/hortonworks-sandbox/

After starting the virtual machine, a ‘cluster’ is available! (Pseudo Distributed mode) Outside the virtual machine, from your own machine, you should be able to access the HUE page http://localhost:8888.

To move files between host and virtual-machine there are multiple options.

Configure a shared folder.
Use the HUE webinterface to upload files
Use secure copy over ssh.

I went for option 3: scp. For that option I need to access the virtual machine over ssh on port 22.
My VirtualBox was configured with network settings NAT and port-forwarding. I have added a port-forwarding rule for host from port 2222 to guest 22.

Now I can copy files from my machine to the virtual machine using:

scp -P 2222 data_subset.xml cloudera@localhost:~

On the virtual machine we should add the data in HDFS on the correct place.

hadoop fs -mkdir wiki
hadoop fs -mkdir wiki/in
hadoop fs -put data_subset.xml /user/cloudera/wiki/in/

3. Easily build a job-jar with maven.

In the original setup the code and eclipse were tightly coupled. To make a separate jar I used the maven-assembly-plugin. Make sure you do not include the hadoop-common and hadoop-client in your jar by marking these dependencies as scope: provided. The dependencies not marked as scope test/provided like commons-io will be included in your jar.


  
    ...
    
      maven-assembly-plugin
      2.4
      
        
          jar-with-dependencies
        
      
    
  




  
    org.apache.hadoop
    hadoop-common
    2.2.0
    provided
  

  
    org.apache.hadoop
    hadoop-client
    2.2.0
    provided
  

  ...

Now you can create your job:

mvn assembly:assembly

[INFO] --- maven-assembly-plugin:2.4:assembly (default-cli) @ hadoop-wiki-pageranking ---
[INFO] Building jar: /Users/abij/projects/github/hadoop-wiki-pageranking/target/hadoop-wiki-pageranking-0.2-SNAPSHOT-jar-with-dependencies.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2.590s
[INFO] Finished at: Fri Jan 31 13:25:18 CET 2014
[INFO] Final Memory: 15M/313M
[INFO] ------------------------------------------------------------------------

Now copy the fresh baked job to the cluster and run it.
Local machine:

scp -P 2222 hadoop-wiki-pageranking-*-dependencies.jar cloudera@localhost:~

Virtual machine:

hadoop jar hadoop-wiki-*.jar com.xebia.sandbox.hadoop.WikiPageRanking

Watch the process in HUE:
http://localhost:8888/jobbrowser/

Recap

In this blog we have updated the vanilla hadoop map-reduce job from the old API to the new API. Updating was not hard, but touches all the classes. We used maven to generate a job from the code and run it in a cluster. The new source code is available from my https://github.com/abij/hadoop-wiki-pageranking