From Hadoop to Hive

This is the next step after you have completed the initial setup to get Hadoop running, this will walk you through the steps to get Hive running, everything in the prior post is a prerequisite to this setup.  Just as my prior post, this is nothing “ground breaking”, but hopefully will provide a consolidated place to look.  This install is (really) about 15 minutes, including some time to make a new clone of the system, the hard part is already done.

So you say… what is Hive?? …. “Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.”

As with the Hadoop install, there is a great deal of documentation about Hive that I will distill into something in the near future, this post is about getting it running.

Getting started

In /bin  (or /usr/local/bin … or wherever you want it)

wget http://apache.tradebit.com/pub/hive/stable/hive-0.8.1.tar.gz

gunzip and tar -xvf that file

mv hive-0.8.1 hive

Environmental

Set the HIVE_HOME variable to the installation directory

Add $HIVE_HOME/bin to your PATH

HDFS setup

There are a couple of HDFS file systems that you will need (tmp is likely already there from running the Hdoop tests)

$HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

GO!

$HIVE_HOME/bin/hive

If all goes well… you will get a message about what file the history is going to, and you will now have a “hive>” prompt.

Verify (DDL)

hive> create table pokes (foo int, bar string);

hive> show tables;

Load some data…

If you were in /bin/hive when you started hive… the path is relative to the the starting directory

hive> load data local inpath ‘./examples/files/kv1.txt’ overwrite into table pokes;

hive> select * from pokes;

 

Map Reduce… you are there!

hive> select count(*), foo
> from pokes
> where foo > 0
> group by foo;

You will see the map reduce jobs execute… and then the results…

Single Cluster Hadoop – from Zero to Hadoop

Before I get into the installation… what is Hadoop anyway?  Normally I don’t like the “cut-past” approach of blogging, but in this case I make an exception from the Apache Docs: “The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.”  There quite a lot of information on the Apache site that goes into great detail for hadoop, at some point, I will boil that down into something simpler to digest, but for now, let’s get a node up-and-running. (Image credit Apache Documentation)

http://hadoop.apache.org/docs/r0.20.2/hdfs_design.html#NameNode+and+DataNodes

There are quite a few of this sort of “how to install” posts out there… so I am sure there is nothing ground-breaking here, however, I did have to “cobble together” a number of different posts (with my configuration) to get things working.  This post is as much for me to document (and be able to re-create) the install as it is to help others make it work.  This should have your hadoop services up and running in (hopefully) less than an hour on a single node cluster.

The first step… build a Linux box, which really is not a big deal, even if you have never done it before.  I started with the “Virtual Box” that will run on windows.  Straight forward install to get the virtual box software running.

Install Linux

From there, you need to download an .iso file for some version of Linux… my choice, for compatibility with other development environments at work was Centos6.  I went with the minimal install, if you want to have the desktop environment etc, you can pull down the full install.  There is no real magic to the base installation.  When configuring the host, you will need > 512MB ram (you can scale back later if necessary), I chose to create a 30gb drive with the install as well.  The drive is larger than I need for the install… but the hope is to get hadoop running, which will need some space to put things.  Depending on the hardware that you are hosting this on, you may also need to select ‘pae enable’ in the settings… if the installation fails because of pae support, go back to the host settings, and select the pae enabled checkbox.

Since I did the minimal install iso, there are a few things to add/enable after the install, any of the yum install commands will tell you there is “nothing to do” if you already have the components installed.

  1. bridge the network adapter (in the VM manager software)
  2. /etc/sysconfig/network-scripts/ifcfg-eth0 “yes”  (will enable the network interface on boot)
  3. restart VM
  4. ifconfig (to verify a valid IP has been obtained)
  5. yum install make
  6. yum install vim
  7. yum install ant
  8. yum install java
  9. yum install svn
  10. yum install wget

At this point… shutdown the Linux VM, and make a clone… you now have the Linux install built, no need to re-do that part if you mess things up.

Hadoop

Get the version of Hadoop you want… I used 1.1.1:

In the /bin directory (or if you want /usr/local/bin… but then change the future reference to the /bin directory for simplicity, as this is not a production machine, I am just dropping it in /bin):

wget http://mirror.sdunix.com/apache/hadoop/common/hadoop-1.1.1/hadoop-1.1.1.tar.gz

  • gunzip the file
  • unpack (tar -xvf)
  • mv hadoop-1.1.1 hadoop

Once you have Hadoop unpacked, you need to set some environmental things before you can use things… in your .

  • export JAVA_HOME=/usr/bin
  • PATH=$JAVA_HOME/bin:$PATH:$HOME/bin
  • export HADOOP_HOME=/bin/hadoop
  • #optional
  • export PS1=”[\u@\h \w]“
  • #allow incoming requests for the services
  • service iptables stop
  • setenforce 0

SSH public key authentication

Hadoop requires SSH public access to manage nodes (remote machines), in this case we are setting up the local host with the public key

in your ~/.ssh directory (make it if you need to)

ssh-keygen -t rsa

Then press <enter> to accept the id_rsa file name
Then put in a pass-phrase (no passphrase is easier for startup)

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

ssh localhost

Then type ‘yes’ for the fingerprint then press <enter>(no passphrase… you will have an easier time at startup)

Disable IPv6

Edit /etc/modrobe.d/blacklist.conf and add:

#disable IPv6
blacklist ipv6

Increase file limits

Edit /etc/security/limits.conf

root – nofile 100000root – locks 100000

In shell run: ulimit – n 1000000

Hadoop config

In /bin/hadoop/conf/hadoop-env.sh un-comment and set the java home value

In /bin/hadoop/conf/hdfs-site.xml set the temp directory to your install directory

<property>
<NAME>hadoop.tmp.dir</NAME>
<VALUE>/usr/hadoop/datastore/hadoop-${user.name}</VALUE>
<DESCRIPTION>Temp directories</DESCRIPTION>
</property>

in the /bin/hadoop/conf/core-site.xml

<property>
<NAME>fs.default.name</NAME>
<VALUE>hdfs://localhost:9100</VALUE>
</property>

in the /bin/hadoop/conf/mapred-site.xml

<property>
<NAME>mapred.job.tracker</NAME>
<VALUE>localhost:9101</VALUE>
</property>

Almost there… Format the nodename

The first thing you need is someplace to put data… which is the HDFS file system node, so you need to initialize the space  In this case our cluster is ONLY on the local machine… please do not do this on a running hadoop host… it WILL (as the name implies) format the file system, and you loose all your hadoop data.

From $HADOOP_HOME

bin/hadoop namenode -format

You will get a bunch of messages… just prior to the shutdown message, you should see a directory with a “has been successfully formatted” message

Start the “cluster”

$HADOOP_HOME/bin/start-all.sh

(enter your root password for all three prompts)

Verify

http://<IP of host>:50030/      (Map/Reduce admin)

http://<IP of host>:50060/      (task tracker status)

http://<IP of host>:50070/     (namenode status)

Run the example map-reduce

bin/hadoop dfs -copyFromLocal LICENSE.txt testWordCount

bin/hadoop dfs -ls   (should show 1 item)

bin/hadoop jar hadoop-examples-1.1.1.jar wordcount testWordCount testWordCount -output

bin/hadoop dfs -ls (will show multiple items now…)… pick the output-?-0000 name for the next command

bin/hadoop dfs -cat testWordCount-output/part-r-00000 |more

Stop the cluster

If you want to shut-down…

$HADOOP_HOME/bin/stop-all.sh

Amazon – Give it a try

A little more than a week ago Amazon made some announcements, near and dear to my heart, warehousing.  Only this time, as you would expect from Amazon, it is moving to the cloud… with the announcement of their “Redshift” product.  I have been intrigued by some of the advantages that cloud computing provides, however, from a data warehouse perspective, the tools, data storage methods, BI integration, user experience etc. just don’t seem like they are a fit with cloud computing (yet).  This is why Redshift may turn out to be some of the initial steps in a direction that begins to solve some of the issues (we shall see).  Redshift or otherwise, it is impossible to deny (at least the desire) to move warehousing in that direction. [Amazon Blog]

After some reading in a variety of places, I decided a test-drive of Redshift might be worthwhile… to fill-out the request form, you need a AWS account number… which lead me to looking at the AWS account levels… and signed up for the free-tier…. which brings me to the title of my post “Amazon – Give it a try”.

This is literally my first experience using this services, so I struck out, remaining in my comfort zone, and started a RDS database instance for sql-server 2012, and within 15 minutes (including provisioning time) had a 20gb database instance up-and-running, automatic patching, backups and all….   The alerts and notifications with CloudWatch rival what you can get in most any well constructed commercially available product ranging from CPU utilization to queue lengths and beyond,  at the global as well as instance level, overall, I am impressed so far.  I still have much to learn, and play with… hopefully… will be able to report some findings on Redshift, which is what kicked me off on this Amazon journey in the first place.

So for now… take my advice… Give it a try… there is something there for everyone (for free).

You need a Zetta what?

Data, data, everywhere!  I have been doing some reading on data scale, and quite honestly it reminds me a lot of the kinds of feelings I had while taking an astronomy class in college… the scale is simply too large to get your head around. Before we get too far into this, try to imagine in the simplest terms, the amount of digital data you personally create in a day.  In terms of raw numbers, let’s take a look at some simple definitions of terms, starting with the world most of us live in, and can relate to in some fashion.  If you think in terms of internet upload/download speeds.. we all jump mentally to a Megabyte.  You think about picture storage, how much space Google, Apple or other provider gives you for free storage, you think Gigabytes.  You think about your local storage device, backup devices and other general storage needs Terabytes are a normal size that you can imagine what you would do with … how much storage do I need for my pictures, music, applications etc… those are all relatively normal terms for most everyone.

1 kilobyte = 1024^1 = 1,024 bytes
1 megabyte = 1024^2 = 1,048,576 bytes
1 gigabyte = 1024^3 = 1,073,741,824 bytes
1 terabyte = 1024^4 = 1,099,511,627,776 bytes

Not so long ago, there was “news” of petabyte storage systems for large enterprise storage, we are now getting into sizes that most of us can’t quickly work back into the “how much can I store on that” size… 1 petabyte = 1024^5 = 1,125,899,906,842,624 bytes

But how much can you store in 1 Exabyte??  exabyte = 1024^6 = 1,152,921,504,606,846,976 bytes in 2006 there was an estimated 160 exabytes of available hard-drive storage world-wide… no small amount of storage… and by 2009 one of the hard-drive manufacturers estimated a total of 330 exabytes of available storage…

Now enter the Zettabyte (this is where we get to the astronomy  … not sure how to relate this thing to me size). 1 zettabyte = 1024^7 = 1,180,591,620,717,411,303,424 bytes  (that is 1 trillion Gigabytes), it is estimated that available storage crossed the Zettabyte mark in 2010, and 1.8 Zettabytes in 2011.  Forget available storage… Cisco is estimating internet IP traffic alone will cross 4 Zettabytes by 2016.

I can draw the graph for you… but I think you get it!

Where does all this data come from? …. Well… just this blog… has a certain number of bytes that I created by putting my thoughts to digital “ink”.  The twitter post about publishing this blog… a few more bytes… the click on the link you performed to get here… a few more bytes… some information gathered about locations, clicks, and other blog-stats.. a few more bytes.  Now obviously small scale, but put this on the global scale, and simply the browsing about on the internet is generating huge amounts of data (ask Google analytics).  Add to that, digital images, movie streaming, voice-over-IP phone calls, cell phones, cloud based backup services, you name it… it all copies, replicate, generate, aggregate, analyze…. data…. more data today then yesterday, and there will be more data tomorrow than today. We are hurtling forward at a truly amazing pace.

Posted in General. Tags: . 1 Comment »

Change is in the air

Fall has come (and almost gone) … turkeys have been eaten, and we are heading head-long into the holiday season, so part of this post is a early “resolution” of sorts.  For a variety of reasons (that I am going to skip the exact specifics on for now but will become clear over the coming weeks) change is in the air.  All good things, all exciting, and hopefully will be proving me an opportunity to explore warehousing, data analytics and BI in a whole new way.  Before anyone gets too excited…. I am still with the same company, still working with the same great group of people…

As I look forward to some new challenges, new learning oportunites with both uncertanty about the “how”, but also the excitement of building something new and exciting, it is not too early to plan.  One of the commitments to myself (and anyone that cares to come along for the ride), will be to blog more… share my journey through what will be an interesting minefield, and hopefully provide some interesting, educational, (maybe thought provoking) information.  Traditionally my blog as be mostly technical, I expect (knowing who I am) this will remain to be mostly true, however, as part of this blog more often commitment, there may very well be some non-technical stuff along the way… we will see how it goes.  So for now… that is all…

Denali Dependency Services

Introduction

I have been on the “What’s New In CTP1 for Denali” speaking “tour” (if you count 2 as a tour) for the past week. What I can say from my interactions, and the feedback from the presentation, there is some real excitement for the new release (although dauntingly far away for some people). In my presentation I go over the “what’s new” for SSIS… and finish up with a look at the new dependency services piece. It is clearly the dependency services that has the BI folks hooked! To support the interest that people have expressed, I wanted to write a quick post to cover some of the dependency services parts, and get you rolling with what will be a very cool, very important part of the BI infrastructure in Denali.

Install

The installation is not complicated, however, is not available from within SSMS. The command line documentation from MSDN is easy to follow, and will get you running with a few simple commands.

Once you have it installed, you need to create extraction points for the things you want to look at. There is a wizard (right-click) that will get you started with the configuration.

In my example, I created a “test” extraction point for SSIS that is pulling from a project that I have already deployed to the server. By right-clicking on that extraction point, you now have two steps to take:

First, you need to select “update catalogs”, then right-click again and select “Sync Now”. Once you have done these two steps, double- click the “default” under views, and you will be able to navigate through the dependencies that you have defined in your extraction points.

Conclusion

I believe you will very quickly feel comfortable navigating the structures, viewing dependencies, and gaining an appreciation for how quick you will be able to do impact analysis with this new tool.

PASS – Part 2

Introduction

 I have to say that this has not turned into the “series” that I projected so optimistically as I departed for the PASS summit. It may still be a series, who knows, but for now, as I sit in the airport waiting for my flight back home I figure I would get a few thoughts together on the sessions that I attended.

 I tried to hit a cross section of things I understand, and would like to do better, things I really need to learn, and a few things that I am not sure how they apply to my life (education for the sake of it). I think I hit that goal…

Tuesday

AD371S – Grant Fritchey (one of the smarter people you will run into) presented “Identifying and Fixing performance problems using execution plans.” This one fits into the area of things I know, and just wanted to get a smart persons perspective, and validate some of the things I do. An excellent sessions, filled with great discussion well done, and well worth the time!

BID275M – Pej Javaheri, Lynn Langit, Donald Farmer (actually had a stand in for Donald… unfortunately, don’t remember his name right now), and a few others… presented “Business Intelligence Power Hour” – (should be named the comedy hour) This was fun (and educational) Short witty presentations on the different tools in the BI space, what is new in Denali, and why BI can help solve problems like why people with higher taxes that drink more are happier. This fit a few of my criteria of things I know, and things I would like to know better… and besides it was fun!

DBA237 Aaron Nelson PACKED the room with over 380 people for “The Dirty Dozen: PowerShell scripts for the Busy DBA” This one is in the area of things I need to learn how to do, and Aaron presented some scripts, and REALLY made the whole thing easy to understand. I honestly think we scared him a bit, this was his first PASS presentation, and a (literally) standing room only crowd was a bit intimidating, but he hung in there, and did really well with it.

DBA391S Kevin Kline presented “End-to-End Troubleshooting for SQL server” Kevin is always good to listen to, and this was no exception. My head was rather full by this time, so I decided to go with another topic that I am comfortable with, and just validate some of my methods and assumptions about how things are.

Wednesday

DBA388S Grant Fritchey presented “DMV’s as a shortcut to Procedure tuning” Since Grant’s presentation on Tuesday was so good, I figured I would jump back into learning mode with the DMV’s. Something I need to do better, and Grant is just the guy to get you there.

 AD311 Rob Farley had an absolutely OUT OF CONTROL funny educational BLAST of a sessions “The Incredible Shrinking Execution Plan” I really did no know what I was in for with this one… but Rob had us rolling the whole time! He literally broke about every “rule” of presenting (including the never type in a presentation)… he started off with a blank SSMS window, no slides, and a comedy routine that was some of the funnier stuff of the conference. With all that said, by the end of the presentation, my head HURT with the amount of things I need to evaluate how I look at queries, views, and how the optimizer views joins in general.

BIA380M Matt Masson presented “What’s Coming Next in SSIS” This one fit simply because SSIS is a core part of what I do, and Denali is changing, fixing, enhancing etc a huge portion of how SSIS works. After this session, I was truly (really) upset that we are only on ctp-1, and it will be sometime next year before we can actually push what is a drastic change to production.

PD163 Christine Valdes presented (with some help from Brent and others) “SQL Image Wardrobe Governor: The Newest Feature in R2” This one fit in the… I can’t think anymore, Brent is funny, and why not do something for the sake of education. Fun interactive session talking about why you dress the way you do (or should).

Thursday

Keynote address from David Dewitt. I have to say he is likely the smartest person that I have ever been in the same room with. I suddenly have more faith in the query optimizer than just about any technology we use on a daily basis. He really tried to talk to us in a way we could understand… and left us ALL in the dust so many times it was crazy. I wish we could have more presentations from folks like him, even if I don’t totally understand, something might just rub-off and stick in my pea-brain. Truly an impressive keynote, that I will be watching (a few times) on the DVD set.

BIA379S Marco Russo presented “Monitoring Cube Performance and Usage”

This starts getting into the area of things I need to learn. Analysis services, performance of cubes etc. A good presentation, a percentage of which I need to re-watch when I get the DVD’s. This session came right after the keynote address for David DeWitt, my mind was a bit liquified when we started this session.

BIA206 Stacia Misner presented “Real World Analysis Services Stored Procedures” This was jumping neck deep into things that I don’t understand enough to be able to use (at this level). This is one was mostly education for the sake of education. Great presentation, MDX, and stored procedures in analysis services, DEEP stuff.

DBA247 Ken Simmons presented “Enforcing Compliance with Policy-Based Management”

This one I went to for a few reasons. I use policy based management, and wanted to see Ken’s take. I needed something to make me feel like I actually know something (the keynote, followed by cube performance, followed by SSAS procedures… I needed something to boost me back up) and not the least of the reasons, is I really like Ken, and he was stressed about the session not having enough questions, so I came to support him (and only had to ask one question, there were PLENTY from the audience) and Ken handled the session like the pro that he is.

BID216 Andreas Wolter presented “Report Builder 3 What’s in it For You” At this point in the conference I have to say my tank was empty. Late nights (that might be another post… maybe), early mornings, and more information crammed into my head than I have had in a LONG time. So I figured I would wrap up the day with a session that I understand, pick up a few pointers (which I did), and think about how we might get some of the spacial data included in the work that we currently do.

Conclusion

I can honestly say this was a great (understatement of the week) conference. I learned a ton, learned that I still have a ton to learn, and met some of the nicest, smartest people you will ever meet. It was great to put faces to many of those that I talk with regularly, and meet some great new friends. Not to mention the fact that I now have a bunch of work to do to get the presentations ready that I have promised to a variety of folks, including my new friends in Australia. One heck of a week…

Follow

Get every new post delivered to your Inbox.

Join 373 other followers