Automated install of CDH5 Hadoop on your laptop with Ansible

Installing CDH5 from the tarball distribution is not a really difficult, but getting the pseudo-distributed configuration right is all but straightforward. And since there are a few bugs that need fixing and configuring that needs to be done I automated it.

Automating the steps

All steps that need to be automated are described in my previous blog: Local and Pseudo-distributed CDH5 Hadoop on your laptop

All I needed to do was write some Ansible configuration scripts to perform these steps. For now I automated the steps to download and install CDH5, Spark, Hive, Pig and Mahout. Any extra packages are left as an exercise to the reader. I welcome your pull requests.

Configuration

Ansible needs some information from the user about the directory to install the software into. I first tried to use ansible’s vars_prompt module. this kind of works, but the scope of the variable is within the same yml file only. And I need it to be a global variable. After testing several of ansibles ways to provide variables I decided upon using a bash script to get the user’s input and provide ansible with that information through the --extra-vars command line option.

Next to that we want to use ansible to run a playbook. This means that we need to have the ansible-playbook command available. We assume ansible-playbook is on the PATH and will work.

Getting the install scripts

Getting the install scripts is done by issuing a git clone command:

$ git clone git@github.com:krisgeus/ansible_local_cdh_hadoop.git

Install

Installing the software has become a single line command:

$ start-playbook.sh

The script will ask the user for a directory to install the software into. Then it will start to download the packages into the $HOME.ansible-downloads directory. And it will unpack into the install directory the user provided.

In the install directory the script will create a bash_profile add-on to set the correct aliases.

$ source ${INSTALL_DIR}/.bash_profile_hadoop

Testing Hadoop in local mode

$ switch_local_cdh5

Now all the familiar hadoop commands should work. There is no notion of HDFS other then your local filesystem so the hadoop fs -ls / command will show you the same output as ls /

$ hadoop fs -ls /

    drwxrwxr-x   - root admin       2686 2014-04-18 09:47 /Applications
    drwxr-xr-x   - root wheel       2210 2014-02-26 02:46 /Library
    drwxr-xr-x   - root wheel         68 2013-08-25 05:45 /Network
    drwxr-xr-x   - root wheel        136 2013-10-23 03:05 /System
    drwxr-xr-x   - root admin        204 2013-10-23 03:09 /Users
    drwxrwxrwt   - root admin        136 2014-04-18 12:34 /Volumes
    [...]

$ ls -l /

    drwxrwxr-x+ 79 root  admin   2.6K Apr 18 09:47 Applications
    drwxr-xr-x+ 65 root  wheel   2.2K Feb 26 02:46 Library
    drwxr-xr-x@  2 root  wheel    68B Aug 25  2013 Network
    drwxr-xr-x+  4 root  wheel   136B Oct 23 03:05 System
    drwxr-xr-x   6 root  admin   204B Oct 23 03:09 Users
    drwxrwxrwt@  4 root  admin   136B Apr 18 12:34 Volumes

Running a MapReduce job should also work out of the box.

$ cd $HADOOP_PREFIX

$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar pi 10 100

    Number of Maps  = 10
    Samples per Map = 100
    2014-04-19 18:05:01.596 java[74281:1703] Unable to load realm info from SCDynamicStore 14/04/19 18:05:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Wrote input for Map #0
    Wrote input for Map #1
    Wrote input for Map #2
    Wrote input for Map #3
    Wrote input for Map #4
    Wrote input for Map #5
    Wrote input for Map #6
    Wrote input for Map #7
    Wrote input for Map #8
    Wrote input for Map #9
    Starting Job
    ....
    Job Finished in 1.587 seconds
    Estimated value of Pi is 3.14800000000000000000

Testing Hadoop in pseudo-distributed mode

$ switch_psuedo_cdh5
$ hadoop namenode -format
$ start-dfs.sh
$ hadoop fs -ls /
$ hadoop fs -mkdir /bogus
$ hadoop fs -ls /
    2014-04-19 19:46:32.233 java[78176:1703] Unable to load realm info from SCDynamicStore
    14/04/19 19:46:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-
    java classes where applicable
    Found 1 items
    drwxr-xr-x   - user supergroup          0 2014-04-19 19:46 /bogus

Ok HDFS is working, now on to a MapReduce job

$ start-yarn.sh
    starting yarn daemons
    starting resourcemanager, logging to /cdh5.0.0/hadoop-2.3.0-cdh5.0.2/logs/yarn-user-resourcemanager-localdomain.local.out
    Password:
    localhost: starting nodemanager, logging to /cdh5.0.0/hadoop-2.3.0-cdh5.0.2/logs/yarn-user-nodemanager-localdomain.local.out

$ cd $HADOOP_PREFIX
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar pi 10 100
    Number of Maps  = 10
    Samples per Map = 100
    2014-04-20 10:21:56.696 java[80777:1703] Unable to load realm info from SCDynamicStore
    14/04/20 10:22:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Wrote input for Map #0
    Wrote input for Map #1
    Wrote input for Map #2
    Wrote input for Map #3
    Wrote input for Map #4
    Wrote input for Map #5
    Wrote input for Map #6
    Wrote input for Map #7
    Wrote input for Map #8
    Wrote input for Map #9
    Starting Job
    14/04/20 10:22:12 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    14/04/20 10:22:12 INFO input.FileInputFormat: Total input paths to process : 10
    14/04/20 10:22:12 INFO mapreduce.JobSubmitter: number of splits:10
    14/04/20 10:22:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1397969462544_0001
    14/04/20 10:22:13 INFO impl.YarnClientImpl: Submitted application application_1397969462544_0001
    14/04/20 10:22:13 INFO mapreduce.Job: The url to track the job: http://localdomain.local:8088/proxy/application_1397969462544_0001/
    14/04/20 10:22:13 INFO mapreduce.Job: Running job: job_1397969462544_0001
    14/04/20 10:22:34 INFO mapreduce.Job: Job job_1397969462544_0001 running in uber mode : false
    14/04/20 10:22:34 INFO mapreduce.Job:  map 0% reduce 0%
    14/04/20 10:22:53 INFO mapreduce.Job:  map 10% reduce 0%
    14/04/20 10:22:54 INFO mapreduce.Job:  map 20% reduce 0%
    14/04/20 10:22:55 INFO mapreduce.Job:  map 30% reduce 0%
    14/04/20 10:22:56 INFO mapreduce.Job:  map 40% reduce 0%
    14/04/20 10:22:57 INFO mapreduce.Job:  map 50% reduce 0%
    14/04/20 10:22:58 INFO mapreduce.Job:  map 60% reduce 0%
    14/04/20 10:23:12 INFO mapreduce.Job:  map 70% reduce 0%
    14/04/20 10:23:13 INFO mapreduce.Job:  map 80% reduce 0%
    14/04/20 10:23:15 INFO mapreduce.Job:  map 90% reduce 0%
    14/04/20 10:23:16 INFO mapreduce.Job:  map 100% reduce 100%
    14/04/20 10:23:16 INFO mapreduce.Job: Job job_1397969462544_0001 completed successfully
    ...
    Job Finished in 64.352 seconds
    Estimated value of Pi is 3.14800000000000000000

Testing Spark in local mode

$ switch_local_cdh5
$ spark-shell
    SLF4J: Class path contains multiple SLF4J bindings.
    ...
    2014-04-20 09:48:25,238 INFO  [main] spark.HttpServer (Logging.scala:logInfo(49)) - Starting HTTP Server
    2014-04-20 09:48:25,302 INFO  [main] server.Server (Server.java:doStart(266)) - jetty-7.6.8.v20121106
    2014-04-20 09:48:25,333 INFO  [main] server.AbstractConnector (AbstractConnector.java:doStart(338)) - Started SocketConnector@0.0.0.0:62951
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _ / _ / _ `/ __/  '_/
       /___/ .__/_,_/_/ /_/_   version 0.9.0
          /_/

    Using Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_15)
    Type in expressions to have them evaluated.
    Type :help for more information.
    ...
    Created spark context..
    Spark context available as sc.

    scala>

And we’re in!!

Testing Spark in pseudo-distributed mode

Now as a final test we check if spark will work on our pseudo distributed Hadoop config

$ switch_pseudo_cdh5
$ start-dfs.sh
$ start-yarn.sh
$ hadoop fs -mkdir /sourcedata
$ hadoop fs -put somelocal-textfile.txt /sourcedata/sometext.txt
$ spark-shell
    scala> val file = sc.textFile("/sourcedata/sometext.txt")
           file.take(5)

           res1: Array[String] = Array("First", "five lines", "of", "the", "textfile" )

The current version of the ansible scripts are set to install the CDH version 5.0.2 packages. When a new version becomes available this version is easily changed by updating the vars/common.yml Yaml file.

If you have created ansible files to add other packages I welcome you to send me a pull request.