Single Cluster Hadoop – from Zero to Hadoop

Before I get into the installation… what is Hadoop anyway? Normally I don’t like the “cut-past” approach of blogging, but in this case I make an exception from the Apache Docs: “The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.” There quite a lot of information on the Apache site that goes into great detail for hadoop, at some point, I will boil that down into something simpler to digest, but for now, let’s get a node up-and-running. (Image credit Apache Documentation)

There are quite a few of this sort of “how to install” posts out there… so I am sure there is nothing ground-breaking here, however, I did have to “cobble together” a number of different posts (with my configuration) to get things working. This post is as much for me to document (and be able to re-create) the install as it is to help others make it work. This should have your hadoop services up and running in (hopefully) less than an hour on a single node cluster.

The first step… build a Linux box, which really is not a big deal, even if you have never done it before. I started with the “Virtual Box” that will run on windows. Straight forward install to get the virtual box software running.

Install Linux

From there, you need to download an .iso file for some version of Linux… my choice, for compatibility with other development environments at work was Centos6. I went with the minimal install, if you want to have the desktop environment etc, you can pull down the full install. There is no real magic to the base installation. When configuring the host, you will need > 512MB ram (you can scale back later if necessary), I chose to create a 30gb drive with the install as well. The drive is larger than I need for the install… but the hope is to get hadoop running, which will need some space to put things. Depending on the hardware that you are hosting this on, you may also need to select ‘pae enable’ in the settings… if the installation fails because of pae support, go back to the host settings, and select the pae enabled checkbox.

Since I did the minimal install iso, there are a few things to add/enable after the install, any of the yum install commands will tell you there is “nothing to do” if you already have the components installed.

bridge the network adapter (in the VM manager software)
/etc/sysconfig/network-scripts/ifcfg-eth0 “yes” (will enable the network interface on boot)
restart VM
ifconfig (to verify a valid IP has been obtained)
yum install make
yum install vim
yum install ant
yum install java
yum install svn
yum install wget

At this point… shutdown the Linux VM, and make a clone… you now have the Linux install built, no need to re-do that part if you mess things up.

Hadoop

Get the version of Hadoop you want… I used 1.1.1:

In the /bin directory (or if you want /usr/local/bin… but then change the future reference to the /bin directory for simplicity, as this is not a production machine, I am just dropping it in /bin):

wget http://mirror.sdunix.com/apache/hadoop/common/hadoop-1.1.1/hadoop-1.1.1.tar.gz

gunzip the file
unpack (tar -xvf)
mv hadoop-1.1.1 hadoop

Once you have Hadoop unpacked, you need to set some environmental things before you can use things… in your .

export JAVA_HOME=/usr/bin
PATH=$JAVA_HOME/bin:$PATH:$HOME/bin
export HADOOP_HOME=/bin/hadoop
#optional
export PS1=”[\u@\h \w]”
#allow incoming requests for the services
service iptables stop
setenforce 0

SSH public key authentication

Hadoop requires SSH public access to manage nodes (remote machines), in this case we are setting up the local host with the public key

in your ~/.ssh directory (make it if you need to)

ssh-keygen -t rsa

Then press <enter> to accept the id_rsa file name
Then put in a pass-phrase (no passphrase is easier for startup)

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

ssh localhost

Then type ‘yes’ for the fingerprint then press <enter>(no passphrase… you will have an easier time at startup)

Disable IPv6

Edit /etc/modrobe.d/blacklist.conf and add:

#disable IPv6
blacklist ipv6

Increase file limits

Edit /etc/security/limits.conf

root – nofile 100000root – locks 100000

In shell run: ulimit – n 1000000

Hadoop config

In /bin/hadoop/conf/hadoop-env.sh un-comment and set the java home value

In /bin/hadoop/conf/hdfs-site.xml set the temp directory to your install directory

<property>
<NAME>hadoop.tmp.dir</NAME>
<VALUE>/usr/hadoop/datastore/hadoop-${user.name}</VALUE>
<DESCRIPTION>Temp directories</DESCRIPTION>
</property>

in the /bin/hadoop/conf/core-site.xml

<property>
<NAME>fs.default.name</NAME>
<VALUE>hdfs://localhost:9100</VALUE>
</property>

in the /bin/hadoop/conf/mapred-site.xml

<property>
<NAME>mapred.job.tracker</NAME>
<VALUE>localhost:9101</VALUE>
</property>

Almost there… Format the nodename

The first thing you need is someplace to put data… which is the HDFS file system node, so you need to initialize the space In this case our cluster is ONLY on the local machine… please do not do this on a running hadoop host… it WILL (as the name implies) format the file system, and you loose all your hadoop data.

From $HADOOP_HOME

bin/hadoop namenode -format

You will get a bunch of messages… just prior to the shutdown message, you should see a directory with a “has been successfully formatted” message

Start the “cluster”

$HADOOP_HOME/bin/start-all.sh

(enter your root password for all three prompts)

Verify

http://<IP of host>:50030/ (Map/Reduce admin)

http://<IP of host>:50060/ (task tracker status)

http://<IP of host>:50070/ (namenode status)

Run the example map-reduce

bin/hadoop dfs -copyFromLocal LICENSE.txt testWordCount

bin/hadoop dfs -ls (should show 1 item)

bin/hadoop jar hadoop-examples-1.1.1.jar wordcount testWordCount testWordCount -output

bin/hadoop dfs -ls (will show multiple items now…)… pick the output-?-0000 name for the next command

bin/hadoop dfs -cat testWordCount-output/part-r-00000 |more

Stop the cluster

If you want to shut-down…

$HADOOP_HOME/bin/stop-all.sh

Tags: centos6, hadoop, HDFS, Linux

← Amazon – Give it a try

From Hadoop to Hive →

4 Comments on “Single Cluster Hadoop – from Zero to Hadoop”

psn code generator 2014 no survey download August 25, 2014 at 7:08 pm #

amazing insight. Really enjoyed looking over this blog.
Keep up the good work and to everyone keep on learning!

Reply

Trackbacks/Pingbacks

From Hadoop to Hive « StefBauer's Blog – @StefBauer - January 3, 2013
[…] Single Cluster Hadoop – from Zero to Hadoop […]
Hadoop +1 (add a node that is…) « StefBauer's Blog – @StefBauer - January 7, 2013
[…] Single Cluster Hadoop – from Zero to Hadoop […]
Hadoop/Hive a few lessons learned « StefBauer's Blog – @StefBauer - January 19, 2013
[…] Single Cluster Hadoop – from Zero to Hadoop […]

Follow Me

Single Cluster Hadoop – from Zero to Hadoop

Install Linux

Hadoop

SSH public key authentication

Disable IPv6

Increase file limits

Hadoop config

Almost there… Format the nodename

Start the “cluster”

Verify

Stop the cluster

4 Comments on “Single Cluster Hadoop – from Zero to Hadoop”

Trackbacks/Pingbacks

Leave a comment Cancel reply

My Latest

Community

Tools

Useful Links

Tags

Archives

Follow Me

Single Cluster Hadoop – from Zero to Hadoop

Install Linux

Hadoop

SSH public key authentication

Disable IPv6

Increase file limits

Hadoop config

Almost there… Format the nodename

Start the “cluster”

Verify

Stop the cluster

Rate this:

Share this:

Related

4 Comments on “Single Cluster Hadoop – from Zero to Hadoop”

Trackbacks/Pingbacks

Leave a comment Cancel reply

My Latest

Community

Tools

Useful Links

Tags

Categories

Archives