Hadoop – some basic setup

For those of you coming along on this journey, I want to take a quick step back. Rather than assuming what you do and don’t know a little background to how to make this installation go. This post is not really “interesting findings” but rather a how-to for the install process.

As I explained in prior posts, I started with some virtual machines, and quickly discovered the limitations there, and moved things to Amazon. Amazon gives you a variety of options for instances of hardware, none of which on the surface give you what you are looking for… but go with spot instances with Amazon AMI. First off, check the spot pricing history for the different availability zones, and see what makes sense… for me, us-east-1d is by far the least expensive, most stable cost for the m1-large instances (2-core) that I am using. At the point when you make the instance request you will see what the current spot price is, most likely somewhere around 3 cents per hour. You need to pick a maximum per hour cost that you are willing to pay, (unless you want to start over) make sure you set it high enough to cover any short-term spikes in spot pricing. For me, I set my max at .30 cents per hour, which is high enough that I do not expect to have the machines reclaimed. You can see the historical pricing by instance size for each availability zone by selecting “Pricing History”.

The spot-request wizard is relatively straight forward, you can create several instances at the same time. I am running 11 instances (1 name-node instance, and 10 data-nodes)… so pick how many you want, you can always add more later as well, so two instances to start (one namenode, and 1 data node) is fine for setup. Accept the defaults for the device-id’s and Kernal ID’s. In the storage configuration, add 2 storage volumes as instance storage (Ephemeral 0 and Ephemeral 1), you will need data-drives to put the files on. If you choose 2 volumes, the data-nodes will automatically detect both of the mounted volumes and provide the redundancy across both volumes. You can choose EBS if you wish… but in my case, I am just building test systems that are not really going to live on, and I don’t really need the EBS. (Yes, before you post comments… ephemeral will go away.. redundancy across ephemeral storage is kinda pointless… but stick with me… you will want to see the features in Cloudera later). Once you have storage selected, select (or create) your key-pair. You will need to access these instances via SSH. Choose (or create) your security group… and you are good to go…. it will take a few (something like 10) minutes for your instances to be provisioned, and you are up-and-running.

Now that you have instances, you need to setup the security group so you can access them.. First off.. allow access to all ports from the security group itself. This will allow each of the machines in the cluster to have full access to the other machines, you can restrict if you want… but within the security group, if these are the only instances, there is not much need to get too creative with the security. Next, for the IP address (or range of addresses) you are coming from (please don’t just open it up to everyone) set the following incoming ports (for the various services and products):

22(ssh)
3306 (mysql)
7180
7183
8000-8020
8888
11000
44444
50000-50099
60000-60050

Almost there….

Gather the pubic IP’s from the EC2 management console for the machines you created. Edit the /etc/hosts file on your primary node (so you can access the nodes via some naming convention) and the primary (what will be your namenode) can reach the other machines in the cluster.

Setup SSH access from the primary node (what will be your namenode) to reach each of the other machines in your cluster.

Create directories, edit /etc/fstab and mount -a to setup the two ephemeral dives you created during the instance creation.

Create a file “/etc/redhat-release” with the single line “CentOS release 6.3 (Final)” This is required on each machine… to “fool” the Cloudera installer into thinking it is installing on a Redhat release…. which is what Amazon AMI really is… it just needs the label for the installer to run.

Download and execute the installer from Cloudera… The installation really is straight forward, Cloudera also provides great documentation for the installation.

I hope this helps you get up-and-running. It looks a lot worse as a long-winded blog-post… don’t be intimidated by the process and give it a try, I think you will be pleasantly surprised.

Tags: Cloudera, hadoop, HDFS, Linux

← Hadoop/Hive a few lessons learned

“Big Data” with Amazon Redshift – Intro →

No comments yet.

Follow Me

Hadoop – some basic setup

Leave a comment Cancel reply

My Latest

Community

Tools

Useful Links

Tags

Archives

Follow Me

Hadoop – some basic setup

Rate this:

Share this:

Related

Leave a comment Cancel reply

My Latest

Community

Tools

Useful Links

Tags

Categories

Archives