Single Cluster Hadoop – from Zero to Hadoop

Before I get into the installation… what is Hadoop anyway?  Normally I don’t like the “cut-past” approach of blogging, but in this case I make an exception from the Apache Docs: “The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.”  There quite a lot of information on the Apache site that goes into great detail for hadoop, at some point, I will boil that down into something simpler to digest, but for now, let’s get a node up-and-running. (Image credit Apache Documentation)

http://hadoop.apache.org/docs/r0.20.2/hdfs_design.html#NameNode+and+DataNodes

There are quite a few of this sort of “how to install” posts out there… so I am sure there is nothing ground-breaking here, however, I did have to “cobble together” a number of different posts (with my configuration) to get things working.  This post is as much for me to document (and be able to re-create) the install as it is to help others make it work.  This should have your hadoop services up and running in (hopefully) less than an hour on a single node cluster.

The first step… build a Linux box, which really is not a big deal, even if you have never done it before.  I started with the “Virtual Box” that will run on windows.  Straight forward install to get the virtual box software running.

Install Linux

From there, you need to download an .iso file for some version of Linux… my choice, for compatibility with other development environments at work was Centos6.  I went with the minimal install, if you want to have the desktop environment etc, you can pull down the full install.  There is no real magic to the base installation.  When configuring the host, you will need > 512MB ram (you can scale back later if necessary), I chose to create a 30gb drive with the install as well.  The drive is larger than I need for the install… but the hope is to get hadoop running, which will need some space to put things.  Depending on the hardware that you are hosting this on, you may also need to select ‘pae enable’ in the settings… if the installation fails because of pae support, go back to the host settings, and select the pae enabled checkbox.

Since I did the minimal install iso, there are a few things to add/enable after the install, any of the yum install commands will tell you there is “nothing to do” if you already have the components installed.

  1. bridge the network adapter (in the VM manager software)
  2. /etc/sysconfig/network-scripts/ifcfg-eth0 “yes”  (will enable the network interface on boot)
  3. restart VM
  4. ifconfig (to verify a valid IP has been obtained)
  5. yum install make
  6. yum install vim
  7. yum install ant
  8. yum install java
  9. yum install svn
  10. yum install wget

At this point… shutdown the Linux VM, and make a clone… you now have the Linux install built, no need to re-do that part if you mess things up.

Hadoop

Get the version of Hadoop you want… I used 1.1.1:

In the /bin directory (or if you want /usr/local/bin… but then change the future reference to the /bin directory for simplicity, as this is not a production machine, I am just dropping it in /bin):

wget http://mirror.sdunix.com/apache/hadoop/common/hadoop-1.1.1/hadoop-1.1.1.tar.gz

  • gunzip the file
  • unpack (tar -xvf)
  • mv hadoop-1.1.1 hadoop

Once you have Hadoop unpacked, you need to set some environmental things before you can use things… in your .

  • export JAVA_HOME=/usr/bin
  • PATH=$JAVA_HOME/bin:$PATH:$HOME/bin
  • export HADOOP_HOME=/bin/hadoop
  • #optional
  • export PS1=”[\u@\h \w]”
  • #allow incoming requests for the services
  • service iptables stop
  • setenforce 0

SSH public key authentication

Hadoop requires SSH public access to manage nodes (remote machines), in this case we are setting up the local host with the public key

in your ~/.ssh directory (make it if you need to)

ssh-keygen -t rsa

Then press <enter> to accept the id_rsa file name
Then put in a pass-phrase (no passphrase is easier for startup)

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

ssh localhost

Then type ‘yes’ for the fingerprint then press <enter>(no passphrase… you will have an easier time at startup)

Disable IPv6

Edit /etc/modrobe.d/blacklist.conf and add:

#disable IPv6
blacklist ipv6

Increase file limits

Edit /etc/security/limits.conf

root – nofile 100000root – locks 100000

In shell run: ulimit – n 1000000

Hadoop config

In /bin/hadoop/conf/hadoop-env.sh un-comment and set the java home value

In /bin/hadoop/conf/hdfs-site.xml set the temp directory to your install directory

<property>
<NAME>hadoop.tmp.dir</NAME>
<VALUE>/usr/hadoop/datastore/hadoop-${user.name}</VALUE>
<DESCRIPTION>Temp directories</DESCRIPTION>
</property>

in the /bin/hadoop/conf/core-site.xml

<property>
<NAME>fs.default.name</NAME>
<VALUE>hdfs://localhost:9100</VALUE>
</property>

in the /bin/hadoop/conf/mapred-site.xml

<property>
<NAME>mapred.job.tracker</NAME>
<VALUE>localhost:9101</VALUE>
</property>

Almost there… Format the nodename

The first thing you need is someplace to put data… which is the HDFS file system node, so you need to initialize the space  In this case our cluster is ONLY on the local machine… please do not do this on a running hadoop host… it WILL (as the name implies) format the file system, and you loose all your hadoop data.

From $HADOOP_HOME

bin/hadoop namenode -format

You will get a bunch of messages… just prior to the shutdown message, you should see a directory with a “has been successfully formatted” message

Start the “cluster”

$HADOOP_HOME/bin/start-all.sh

(enter your root password for all three prompts)

Verify

http://<IP of host>:50030/      (Map/Reduce admin)

http://<IP of host>:50060/      (task tracker status)

http://<IP of host>:50070/     (namenode status)

Run the example map-reduce

bin/hadoop dfs -copyFromLocal LICENSE.txt testWordCount

bin/hadoop dfs -ls   (should show 1 item)

bin/hadoop jar hadoop-examples-1.1.1.jar wordcount testWordCount testWordCount -output

bin/hadoop dfs -ls (will show multiple items now…)… pick the output-?-0000 name for the next command

bin/hadoop dfs -cat testWordCount-output/part-r-00000 |more

Stop the cluster

If you want to shut-down…

$HADOOP_HOME/bin/stop-all.sh

Tags: , , ,

4 Comments on “Single Cluster Hadoop – from Zero to Hadoop”

  1. psn code generator 2014 no survey download August 25, 2014 at 7:08 pm #

    amazing insight. Really enjoyed looking over this blog.
    Keep up the good work and to everyone keep on learning!

Trackbacks/Pingbacks

  1. From Hadoop to Hive « StefBauer's Blog – @StefBauer - January 3, 2013

    […] Single Cluster Hadoop – from Zero to Hadoop […]

  2. Hadoop +1 (add a node that is…) « StefBauer's Blog – @StefBauer - January 7, 2013

    […] Single Cluster Hadoop – from Zero to Hadoop […]

  3. Hadoop/Hive a few lessons learned « StefBauer's Blog – @StefBauer - January 19, 2013

    […] Single Cluster Hadoop – from Zero to Hadoop […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: