Hadoop/Hive a few lessons learned

It has been a few days since the last set of posts, and quite honestly did not want to leave it hanging even this long, so I wanted to give a brief update to (at least help) cut-off some of the frustrations I have faced for those of you attempting the same path through the big-data wilderness that I am.  As you have seen in my earlier posts, I started off on a virtual host on my windows machine… although a fantastic learning experience, and I do not recommend the short-cut to the endpoint… I quickly realized 3 hosts on a virtual machine that was a underpowered windows host in the first-place was not going to get me anything that I would be able to use.  This then sent me off to take advantage of some of the fantastic (free) resources that I talked about in a prior post at Amazon.  The free tier actually gives you quite a bit of computing time… for reasons that will become clear (probably not in this post) I have chosen to try my next install (rather than the manual build… although a valuable experience) with the Cloudera Manager (4.1).  So I spun up 5 reserved RedHat instances at Amazon, and was really quite quickly up and running.  However, the Redhat (a prerequisite for Cloudera) was not entirely free… so beware, 55 hours of cpu time on the reserved Micros (which was my 5 machines for a little more than 10 hours) plus some data transfer costs got me to about $5 for the day (OK.. not a huge cost… just don’t want someone yelling at me about the fact that I said it was free).

Those 5 micros… are … well… really … Micro.  If  you want performance… Micro is not it, and in fact, I had to pretty much restart the main host-node after each step of the install because it became unresponsive, and would no longer answer the ports for the website…. so $5 well spent on understanding the installation…. but…. not exactly useful as a end-result.

What to do?….  Lucky for me, I have some supportive folks at work that really want me to figure some of this stuff out… and we already have a development account setup and ready to go with Amazon… so with a small budget, I am now off-to the next level…

I have started 5 spot instances.  For those of you not familiar with the spot instances… basically you are buying resources from the available pool in the availability zone of your choice.  The only catch to the low price you are paying.. is once it reaches your max, they will (and did to me… when I picked the wrong data center the first time around) unceremoniously and with no warning simply take your machines back, and you loose everything.  So be careful with what availability zone you pick… and be careful that you set a max price that you can live with (above the averages for the data center).  So for somewhere between .03 and .30 cents per hour per machine, I now have 5 large instances 2 cores, 7gb memory and 8Gb of data space each…. really not a bad deal.  (Actually tonight, I added 3 more to a second rack, moved the secondary namenode and rebalanced the whole thing). My total cluster capacity is now at 6.3 Tb with a running cost of something less than $12 per day.

Next post… I will get into the specifics of Cloudera, how to use the Amazon AMI instead of the RedHat instance… and some general things I have found about Clourdera. …. the short (initial) look at what Cloudera brings to the table, I am impressed.

Tags: , , ,

