Before I get into the installation… what is Hadoop anyway? Normally I don’t like the “cut-past” approach of blogging, but in this case I make an exception from the Apache Docs: “The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.” There quite a lot of information on the Apache site that goes into great detail for hadoop, at some point, I will boil that down into something simpler to digest, but for now, let’s get a node up-and-running. (Image credit Apache Documentation)
There are quite a few of this sort of “how to install” posts out there… so I am sure there is nothing ground-breaking here, however, I did have to “cobble together” a number of different posts (with my configuration) to get things working. This post is as much for me to document (and be able to re-create) the install as it is to help others make it work. This should have your hadoop services up and running in (hopefully) less than an hour on a single node cluster.
The first step… build a Linux box, which really is not a big deal, even if you have never done it before. I started with the “Virtual Box” that will run on windows. Straight forward install to get the virtual box software running.
Install Linux
From there, you need to download an .iso file for some version of Linux… my choice, for compatibility with other development environments at work was Centos6. I went with the minimal install, if you want to have the desktop environment etc, you can pull down the full install. There is no real magic to the base installation. When configuring the host, you will need > 512MB ram (you can scale back later if necessary), I chose to create a 30gb drive with the install as well. The drive is larger than I need for the install… but the hope is to get hadoop running, which will need some space to put things. Depending on the hardware that you are hosting this on, you may also need to select ‘pae enable’ in the settings… if the installation fails because of pae support, go back to the host settings, and select the pae enabled checkbox.
Since I did the minimal install iso, there are a few things to add/enable after the install, any of the yum install commands will tell you there is “nothing to do” if you already have the components installed.
- bridge the network adapter (in the VM manager software)
- /etc/sysconfig/network-scripts/ifcfg-eth0 “yes” (will enable the network interface on boot)
- restart VM
- ifconfig (to verify a valid IP has been obtained)
- yum install make
- yum install vim
- yum install ant
- yum install java
- yum install svn
- yum install wget
At this point… shutdown the Linux VM, and make a clone… you now have the Linux install built, no need to re-do that part if you mess things up.
Hadoop
Get the version of Hadoop you want… I used 1.1.1:
In the /bin directory (or if you want /usr/local/bin… but then change the future reference to the /bin directory for simplicity, as this is not a production machine, I am just dropping it in /bin):
wget http://mirror.sdunix.com/apache/hadoop/common/hadoop-1.1.1/hadoop-1.1.1.tar.gz
- gunzip the file
- unpack (tar -xvf)
- mv hadoop-1.1.1 hadoop
Once you have Hadoop unpacked, you need to set some environmental things before you can use things… in your .
- export JAVA_HOME=/usr/bin
- PATH=$JAVA_HOME/bin:$PATH:$HOME/bin
- export HADOOP_HOME=/bin/hadoop
- #optional
- export PS1=”[\u@\h \w]”
- #allow incoming requests for the services
- service iptables stop
- setenforce 0
SSH public key authentication
Hadoop requires SSH public access to manage nodes (remote machines), in this case we are setting up the local host with the public key
in your ~/.ssh directory (make it if you need to)
ssh-keygen -t rsa
Then press <enter> to accept the id_rsa file name
Then put in a pass-phrase (no passphrase is easier for startup)
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh localhost
Then type ‘yes’ for the fingerprint then press <enter>(no passphrase… you will have an easier time at startup)
Disable IPv6
Edit /etc/modrobe.d/blacklist.conf and add:
#disable IPv6
blacklist ipv6
Increase file limits
Edit /etc/security/limits.conf
root – nofile 100000root – locks 100000
In shell run: ulimit – n 1000000
Hadoop config
In /bin/hadoop/conf/hadoop-env.sh un-comment and set the java home value
In /bin/hadoop/conf/hdfs-site.xml set the temp directory to your install directory
<property>
<NAME>hadoop.tmp.dir</NAME>
<VALUE>/usr/hadoop/datastore/hadoop-${user.name}</VALUE>
<DESCRIPTION>Temp directories</DESCRIPTION>
</property>
in the /bin/hadoop/conf/core-site.xml
<property>
<NAME>fs.default.name</NAME>
<VALUE>hdfs://localhost:9100</VALUE>
</property>
in the /bin/hadoop/conf/mapred-site.xml
<property>
<NAME>mapred.job.tracker</NAME>
<VALUE>localhost:9101</VALUE>
</property>
Almost there… Format the nodename
The first thing you need is someplace to put data… which is the HDFS file system node, so you need to initialize the space In this case our cluster is ONLY on the local machine… please do not do this on a running hadoop host… it WILL (as the name implies) format the file system, and you loose all your hadoop data.
From $HADOOP_HOME
bin/hadoop namenode -format
You will get a bunch of messages… just prior to the shutdown message, you should see a directory with a “has been successfully formatted” message
Start the “cluster”
$HADOOP_HOME/bin/start-all.sh
(enter your root password for all three prompts)
Verify
http://<IP of host>:50030/ (Map/Reduce admin)
http://<IP of host>:50060/ (task tracker status)
http://<IP of host>:50070/ (namenode status)
Run the example map-reduce
bin/hadoop dfs -copyFromLocal LICENSE.txt testWordCount
bin/hadoop dfs -ls (should show 1 item)
bin/hadoop jar hadoop-examples-1.1.1.jar wordcount testWordCount testWordCount -output
bin/hadoop dfs -ls (will show multiple items now…)… pick the output-?-0000 name for the next command
bin/hadoop dfs -cat testWordCount-output/part-r-00000 |more
Stop the cluster
If you want to shut-down…
$HADOOP_HOME/bin/stop-all.sh
amazing insight. Really enjoyed looking over this blog.
Keep up the good work and to everyone keep on learning!