This one gets a little finicky depending on you configuration, and how much horsepower you have available to you. If you started of with my first post, and built a VM … ideally … you made a clone of the host once you had Hadoop running, which will make this “easier”. You are quickly getting into the realm of things that really will require multiple machines and multiple hosts… The second part of this series got us into Hive…. this one is back to Hadoop and getting data spread out a bit. So here goes…
On the clone (which will be your slave), in the VM software, go to the networking section and obtain a new MAC address. Both the guest hosts will be bridging to the the host’s adapter, and you need to have separate addresses in you network config to make it work.
Update the /etc/sysconfig/network-scripts/ifcfg-eth0 file with the new mac address
You should now have a new IP for the slave…. if you don’t have eth0, or your IP is not different than your master… you MUST resolve that before you continue.
After you have the new host-name, and IP… go through the .ssh setup (just like you did on the first host) to create a new rsa_id.pub and authorized_keys.
Once you have that, fire-up the clone, and add an entry in the .bash_profile to separate the hosts this will be my slave:
On the slave edit /etc/hosts create the IP entries for BOTH hosts
On the master edit /etc/hosts create the IP entries for BOTH hosts, and COMMENT OUT the 127.0.0.1 entry
On the slave $HADOOP_HOME/conf/
Put the slave IP into the slaves files
put the master IP into the masters file
put the IP for the master into the mapred-site.xml for the mapred.job.tracker value
put the IP for the master into the core-stie.xml for fs.default.name value
On the Master $HADOOP_HOME/conf/
ADD the IP address for the slave into the slaves file (there will be 2 lines here)
Check the masters file, should have 1 entry for JUST the master host
Format the node
First … you need to remove the nodename format you did when you originally built this… (this is after all a clone of the other host)
rm -fr hadoop-root
$HADOOP_HOME/bin/hadoop nodename -format
Restart both hosts… start clean with no processes running….
On the Master $HADOOP_HOME/bin/start-all.sh
It will ask you for the slave passwords, it will add the ssh key to the master on the first execution etc… but as a result… you should have all the processes running on the master that you would expect (jps to check)… in addition, the tasktracker, and datanode processes on the slave were automatically started as well (jps on the slave to check)
You have already done jps on both hosts, and see the expected processes.
for the Master http://_ip_address_:50070/
You should see something …. with “live nodes” 2 like this:
You should also see 2 entries in the nodename when you click on “live nodes”
Also… on the master:
$HADOOP_HOME/bin/hadoop fsck /
Will give you the health of your data notes, status of replication etc… you will see the 2 data nodes listed. Do not be alarmed that your data is “missing replicas” and that you are under-replicated… those will start making more sense once you go beyond 2 nodes….
If you have made it this far… you now have 2 nodes… in 1 rack (and hopefully) have a much better working understanding of how Hadoop (and Hive) both work, and how they work together.
As I mentioned in the others posts, I will be digging through some of the available documentation for both Hadoop and Hive to distill a few things. There is a lot of documentation, with this series of posts, as well as the ones I have planned, hopefully there is something useful in what I have put together to get you up and running with relative ease.