How big is your data?

 

 

I will not take credit for theInfographic here, it is something I ran across, but struck me as so very appropriate for the topic I have been beating the drum about since my post about zetabyte storage needs.  It is with those “extreme” storage needs that you need to build and design today’s systems with.  I call then “extreme”, as in reality, as things are progressing, in a short time we will all be reminiscing about the days when 347 blog posts per minute was such a small number. (picked on that since I am adding to that pile)

DatainOneMinute

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Clearly, not everyone will be capturing all of these data points, however, as you look at your own unique storage needs build for the future…. a BIG DATA future… keep in mind that your data will be growing at a very fast pace!  It takes a special attention to scaling, IO needs, and an understanding of your analytical requirements to be able to handle data volumes of this size… or… you will be left with the difficult decisions about what data is “not important” and will be simply left behind.  That is not to say, that simply because a byte of data is produced it must be stored forever, but you do need to be in a position to make analytical use (at least short term) of all your data.  Be ready, more data is coming your way!

Posted in Cloud. Tags: . Leave a Comment »

What does that checkbox do? (Redshift Encryption)

**EDIT**     I don’t normally go back and edit things in a prior post… but… in this case the additional link to Werner Vogels blog  that came out after I wrote this is worthwhile to the topic at hand.           **EDIT**

As I alluded to in the prior post, I have been busy…   I have actually been busy writing, and now that all of the first-draft copies have been delivered to the publisher and I am in the mode of writing things.   I figured I would grab a small topic for a short post.  We are not far enough along in the publishing process for me to make formal book announcements yet, but it will become clear relatively soon.

OK, so you are creating a Redshift cluster and you want to enable encryption… not a bad Idea, but what does that mean in Redshift you ask? Not a bad question, as the documentation is relatively sparse on what is going on in this part of the engine.  Normally, when you enable encryption on something, the first thing you need to do is provide some sort of strong key… so here you have a checkbox?… well yes, that is in-fact what you have.  It is really not a bad thing, as the key would have to be stored someplace at Amazon anyway (more on that in a second).  Basically, your Root-key (private-key) is generated for you, so it is likely something better than you are going to type in anyway.  That generated key is then stored on your Amazon Control Plane network (as are your other AMI credentials, keys, SSH Keys etc.).  That control plane is on the other side of a firewall from your physical instance of whatever you are running (in this case Redshift).  That authentication through the firewall is done at startup of your instance, which is authenticated before that key is provided as a valid key.  That key is then kept in memory (never written to disk) on the cluster that is running Redshift, which will allow for the decryption of your data.  The data is encrypted (using hardware accelerated AES 256 encryption)  before it is put back on disk, so any data at rest will be encrypted (much like Microsoft transparent data encryption TDE).  Thus, if one of those drives, backups or other data were to somehow become compromised, there is nothing anyone could do with it, as they do not have the root key, which can only be obtained by your valid Redshift cluster through the firewall to the control plane network.

So, the reason (either way) you would have to store the key at Amazon is quite simply, otherwise you would have to be physically involved in the starting of the cluster.  After maintenance, after a reboot, after a crash… no matter the reason… the cluster cannot start without first obtaining the key.  If that key is not stored in a way that allows for automatic authentication, and access to that key, quite simply, the cluster could not start.  If that were the case, it would require you to access a console, command line, or other interface to provide the key externally for each start-up of the cluster.

More blog-posts to come…

I know in my last post, I indicated there would be more blog posts shortly on Amazon Redshift.  Those posts are coming… this is not that post… (sorry).  My bog-posts have taken a short detour, that in the coming weeks will become more clear.  For now, know that I have not forgotten you!… I will have those how-to sections written soon.

“Big Data” with Amazon Redshift – Intro

If you have been following along on my blog, you have seen the various technical and other ramblings for my recent R&D efforts around “Big Data”.  A little while ago I wrote about giving Amazon a try which I still believe is true… and long those lines, am now giving Amazon Redshift (their new entry into the big data market) a test-drive.  Air-Finger quotes around “Big Data” … that in itself is another blog-post entirely … however, for my purpose here, lets just call it data that is larger in volume and number of rows than can easily be stored in a single traditional database instance.  Basically, I have found that Hadoop will do a fantastic job for large volumes of data, essentially “at rest”, not a bad thing really, particularly if you are looking to keep logs of particular events, or other high-volume generating activities of things that will not change or update over time.  That all being said, the next stone I am picking up is Redshift.  I am not sure how many nodes will be needed in the cluster to make things go, but that really is what this exploration is about… what will it take to make it go…  I will have some blog posts about our “adventures with Redshift”, but for now… let’s get setup.

So my initial setup is with the on-demand pricing structure… honestly… think there should be some better incentive from Amazon to “try before you commit”  with some spot pricing…   I cannot imagine you would run on-demand for a long term solution (especially given the effort of implementing a solution), all that said, I am starting a single on-demand XL node ( 2 TB Disk, 2 core, 15 GB memory).

The setup, as with most of the Amazon products, is relatively straight forward.  A few clicks, a few questions, an encryption key and you are pretty much ready.  The default install of the product (which is not a bad thing) does not allow access to anything from anywhere, so anything you want to let in, you need to make a conscious decision. So some setup for the Redshift specific security groups, and you should be good to go. (After you install some client tools…)  I can say however, as you are using a postgres back-end, there are better tools available than what Amazon suggests.  Personally, I have chosen to use the EMS freeware version of their SQL Manager product.  You can make your own choice on that… but that is what I ended up with .

Take a look at the set-by-step instructions if you have questions, really no need for me to paste all that here, if you have made it this far, you are probably have a pretty good understanding of what you are trying to get to.

 

As a tease to my next blog posts… I will get into the loading of data (what we learned),  and how things are performing so far, etc.

Hadoop – some basic setup

For those of you coming along on this journey, I want to take a quick step back.  Rather than assuming what you do and don’t know a little background to how to make this installation go. This post is not really “interesting findings” but rather a how-to for the install process.

As I explained in prior posts, I started with some virtual machines, and quickly discovered the limitations there, and moved things to Amazon.  Amazon gives you a variety of options for instances of hardware, none of which on the surface give you what you are looking for… but go with spot instances with Amazon AMI.  First off, check the spot pricing history for the different availability zones, and see what makes sense… for me, us-east-1d is by far the least expensive, most stable cost for the m1-large instances (2-core) that I am using.  At the point when you make the instance request you will see what the current spot price is, most likely somewhere around 3 cents per hour.  You need to pick a maximum per hour cost that you are willing to pay, (unless you want to start over) make sure you set it high enough to cover any short-term spikes in spot pricing.  For me, I set my max at .30 cents per hour, which is high enough that I do not expect to have the machines reclaimed.  You can see the historical pricing by instance size for each availability zone by selecting “Pricing History”.

The spot-request wizard is relatively straight forward, you can create several instances at the same time.  I am running 11 instances (1 name-node instance, and 10 data-nodes)… so pick how many you want, you can always add more later as well, so two instances to start (one namenode, and 1 data node) is fine for setup.  Accept the defaults for the device-id’s and Kernal ID’s.  In the storage configuration, add 2 storage volumes as instance storage (Ephemeral 0 and Ephemeral 1), you will need data-drives to put the files on.  If you choose 2 volumes, the data-nodes will automatically detect both of the mounted volumes and provide the redundancy across both volumes.  You can choose EBS if you wish… but in my case, I am just building test systems that are not really going to live on, and I don’t really need the EBS. (Yes, before you post comments… ephemeral will go away.. redundancy across ephemeral storage is kinda pointless… but stick with me… you will want to see the features in Cloudera later).  Once you have storage selected, select (or create) your key-pair.  You will need to access these instances via SSH.  Choose (or create) your security group… and you are good to go…. it will take a few (something like 10) minutes for your instances to be provisioned, and you are up-and-running.

Now that you have instances, you need to setup the security group so you can access them.. First off.. allow access to all ports from the security group itself.  This will allow each of the machines in the cluster to have full access to the other machines, you can restrict if you want… but within the security group, if these are the only instances, there is not much need to get too creative with the security. Next, for the IP address (or range of addresses) you are coming from (please don’t just open it up to everyone) set the following incoming ports (for the various services and products):

22(ssh)
3306 (mysql)
7180
7183
8000-8020
8888
11000
44444
50000-50099
60000-60050

Almost there….

Gather the pubic IP’s from the EC2 management console for the machines you created.  Edit the /etc/hosts file on your primary node (so you can access the nodes via some naming convention) and the primary (what will be your namenode) can reach the other machines in the cluster.

Setup SSH access from the primary node (what will be your namenode) to reach each of the other machines in your cluster.

Create directories, edit /etc/fstab  and mount -a to setup the two ephemeral dives you created during the instance creation.

Create a file “/etc/redhat-release” with the single line “CentOS release 6.3 (Final)”  This is required on each machine… to “fool” the Cloudera installer into thinking it is installing on a Redhat release…. which is what Amazon AMI really is… it just needs the label for the installer to run.

Download and execute the installer from Cloudera…  The installation really is straight forward, Cloudera also provides great documentation for the installation.

I hope this helps you get up-and-running.  It looks a lot worse as a long-winded blog-post… don’t be intimidated by the process and give it a try, I think you will be pleasantly surprised.

Hadoop/Hive a few lessons learned

It has been a few days since the last set of posts, and quite honestly did not want to leave it hanging even this long, so I wanted to give a brief update to (at least help) cut-off some of the frustrations I have faced for those of you attempting the same path through the big-data wilderness that I am.  As you have seen in my earlier posts, I started off on a virtual host on my windows machine… although a fantastic learning experience, and I do not recommend the short-cut to the endpoint… I quickly realized 3 hosts on a virtual machine that was a underpowered windows host in the first-place was not going to get me anything that I would be able to use.  This then sent me off to take advantage of some of the fantastic (free) resources that I talked about in a prior post at Amazon.  The free tier actually gives you quite a bit of computing time… for reasons that will become clear (probably not in this post) I have chosen to try my next install (rather than the manual build… although a valuable experience) with the Cloudera Manager (4.1).  So I spun up 5 reserved RedHat instances at Amazon, and was really quite quickly up and running.  However, the Redhat (a prerequisite for Cloudera) was not entirely free… so beware, 55 hours of cpu time on the reserved Micros (which was my 5 machines for a little more than 10 hours) plus some data transfer costs got me to about $5 for the day (OK.. not a huge cost… just don’t want someone yelling at me about the fact that I said it was free).

Those 5 micros… are … well… really … Micro.  If  you want performance… Micro is not it, and in fact, I had to pretty much restart the main host-node after each step of the install because it became unresponsive, and would no longer answer the ports for the website…. so $5 well spent on understanding the installation…. but…. not exactly useful as a end-result.

What to do?….  Lucky for me, I have some supportive folks at work that really want me to figure some of this stuff out… and we already have a development account setup and ready to go with Amazon… so with a small budget, I am now off-to the next level…

I have started 5 spot instances.  For those of you not familiar with the spot instances… basically you are buying resources from the available pool in the availability zone of your choice.  The only catch to the low price you are paying.. is once it reaches your max, they will (and did to me… when I picked the wrong data center the first time around) unceremoniously and with no warning simply take your machines back, and you loose everything.  So be careful with what availability zone you pick… and be careful that you set a max price that you can live with (above the averages for the data center).  So for somewhere between .03 and .30 cents per hour per machine, I now have 5 large instances 2 cores, 7gb memory and 8Gb of data space each…. really not a bad deal.  (Actually tonight, I added 3 more to a second rack, moved the secondary namenode and rebalanced the whole thing). My total cluster capacity is now at 6.3 Tb with a running cost of something less than $12 per day.

Next post… I will get into the specifics of Cloudera, how to use the Amazon AMI instead of the RedHat instance… and some general things I have found about Clourdera. …. the short (initial) look at what Cloudera brings to the table, I am impressed.

Hadoop +1 (add a node that is…)

This one gets a little finicky depending on you configuration, and how much horsepower you have available to you.  If you started of with my first post, and built a VM … ideally … you made a clone of the host once you had Hadoop running, which will make this “easier”.  You are quickly getting into the realm of things that really will require multiple machines and multiple hosts… The second part of this series got us into Hive…. this one is back to Hadoop and getting data spread out a bit.  So here goes…

Networking

On the clone (which will be your slave), in the VM software, go to the networking section and obtain a new MAC address.  Both the guest hosts will be bridging to the the host’s adapter, and you need to have separate addresses in you network config to make it work.

Update the /etc/sysconfig/network-scripts/ifcfg-eth0 file with the new mac address

ifconfig renew
ifconfig

You should now have a new IP for the slave…. if you don’t have eth0, or your IP is not different than your master… you MUST resolve that before you continue.

After you have the new host-name, and IP… go through the .ssh setup (just like you did on the first host) to create a new rsa_id.pub and authorized_keys.

Once you have that, fire-up the clone, and add an entry in the .bash_profile to separate the hosts this will be my slave:
hostname hadoop2

Config

On the slave edit /etc/hosts create the IP entries for BOTH hosts

On the master edit /etc/hosts create the IP entries for BOTH hosts, and COMMENT OUT the 127.0.0.1 entry

On the slave $HADOOP_HOME/conf/
Put the slave IP into the slaves files
put the master IP into the masters file
put the IP for the master into the mapred-site.xml for the mapred.job.tracker value
put the IP for the master into the core-stie.xml for fs.default.name value

On the Master $HADOOP_HOME/conf/

ADD the IP address for the slave into the slaves file (there will be 2 lines here)
Check the masters file, should have 1 entry for JUST the master host

Format the node

First … you need to remove the nodename format you did when you originally built this… (this is after all a clone of the other host)
cd /tmp
rm -fr hadoop-root

$HADOOP_HOME/bin/hadoop nodename -format

Start Cluster

Restart both hosts… start clean with no processes running….

On the Master $HADOOP_HOME/bin/start-all.sh

It will ask you for the slave passwords, it will add the ssh key to the master on the first execution etc… but as a result… you should have all the processes running on the master that you would expect (jps to check)… in addition, the tasktracker, and datanode processes on the slave were automatically started as well (jps on the slave to check)

Verify

You have already done jps on both hosts, and see the expected processes.

for the Master http://_ip_address_:50070/
You should see something …. with “live nodes” 2 like this:

cluster summary

You should also see 2 entries in the nodename when you click on “live nodes”
namenode view

Also… on the master:

$HADOOP_HOME/bin/hadoop fsck /

Will give you the health of your data notes, status of replication etc… you will see the 2 data nodes listed.  Do not be alarmed that your data is “missing replicas” and that you are under-replicated… those will start making more sense once you go beyond 2 nodes….

If you have made it this far… you now have 2 nodes… in 1 rack (and hopefully) have a much better working understanding of how Hadoop (and Hive) both work, and how they work together.

http://hadoop.apache.org/docs/r0.20.2/hdfs_design.html#NameNode+and+DataNodes

As I mentioned in the others posts, I will be digging through some of the available documentation for both Hadoop and Hive to distill a few things.  There is a lot of documentation, with this series of posts, as well as the ones I have planned, hopefully there is something useful in what I have put together to get you up and running with relative ease.

Follow

Get every new post delivered to your Inbox.

Join 373 other followers