Warehouse Design Tip / Partitioning

Introduction

This topic seems so simple, and so obvious to me, and do not mean to insult anyone’s intelligence.  However, it is shocking to me the number of people that I have talked to (running large databases) that are not taking advantage  of one of the best performance, and manageability enhancements out there!  Think about the classic phone-book example everyone uses for index’s…. but broaden the example a little.  Consider you have a phone book from NY City, LA, Miami, and Chicago all combined into a single book…. even well indexed it is going to a little bit of effort to find all of the records for a given name in a given city.  Now if you partition that giant book on city, you are not even going to go look in the books that you’re not interested in. …OR… lets say you wanted to find a given name in all four cities, if you have one giant book… you will start looking… if you have the book partitioned into separate books, you could hand one book to three of your friends (threads) and each of you start looking in parallel… clearly divide and concur will get you the answer much more quickly.  

 In a warehouse environment I suggest finding some sort of logical annual partition.  It may be a calendar year, it may be a fiscal year, it may be some annual business cycle.  Make sure what you pick will make sense to your users (to make use if this power they need to include it as filter/join criteria in their queries).  If you pick an annual partition based on calendar year, however the business data is always looked at based on a fiscal year ending in July, even if the queries are well written to filter on the year, you will always use 2 partitions to answer a question because the fiscal year splits 2 calendar years.  

Example

 You can use the wizard in SSMS, or scripts either way you will need to create a partition function  

CREATE PARTITION FUNCTION [year_PF1](INT)
   AS range LEFT
   FOR VALUES (2005, 2006, 2007, 2008,
               2009, 2010, 2011, 2012,
               2013, 2014)  

Once you have the partition function, you will need the partition scheme to be associated to the files groups using a partition scheme  

CREATE PARTITION SCHEME [year_PS1]
   AS PARTITION [year_PF1] TO
   ([data_2005], [data_2006],
    [data_2007], [data_2008],
    [data_2009], [data_2010],
    [data_2011], [data_2012],
    [data_2013], [data_2014],
    [data_2014])  

Now you can create a table on the partition scheme allowing all data to be grouped together by year. The key portion of this statement is the “on” at the end of the table create statement.  

CREATE TABLE partition_example
   (      
      example_year  INT,
       fname         VARCHAR(100),
       lname         VARCHAR(100),
       rec_create_dt DATETIME,
       rec_mod_dt    DATETIME,
       audit_key     INT
   )  ON year_ps1(example_year)  

   

Conclusion

This is an overly simple example, however, please do yourself the favor and understand how partitioning works.  See my prior discussions on disk management to see how you can simply move data files from one drive to another, as well as the great optimization, and parallelism performance improvements you will see.

Presenting Data Warehouse Disk Configuration for PASS BI

I am honored to present Data Warehouse Disk Configuration to the PASS BI Virtual Chapter Thursday June 10th at Noon EDT via LiveMeeting! Details here.

Speaker Rate Me!

This is a FULL hour, normally a 75 minute presentation, so be prepared for a lot of information at a quick pace!  I hope you can join me for some layout, recommendations, and ways to look at your disk IO with something other than “it depends”.

My Favorite New(ish) 2008 Feature (Resource Governor)

#TSQLTuesday

I have written in the past about setting up the resource governor, which I am counting as my favorite new(ish) feature, so I am taking this TSQL2sday post opportunity do my promised followup for monitoring of the resource governor.

Introduction

To start with, my advice about the resource governor is to start small.  Have some idea of what you want when you are done… do you need three pools, five pools… how may groups do you envision etc.  Once you “think” you know where you want to go, start small.  Start with a basic A/B split (user interactive Vs Batch process for example) and work your way up from there, monitoring the whole way.  Well balanced, the resource governor works very well… all your resources being consumed in a single pool gets you nowhere (and depending on your max settings can set you back quite a bit)…. so to achieve the “well balanced” lets look at monitoring.

The primary method for monitoring the internals of the resource governor is through the Performance Monitor statistics made available.

Through these metrics you will be able to see both the pool and group level information in much the same way that you would normally see sql server performance statistics, however they are now broken down at a lower level (so you can see how your allocations are performing).  Keep in mind an important point about resource governor, if you allocate 10% of the CPU as a max to a pool, that pool can still get 100% of the CPU if nothing else is running on the machine.  The resource governor ONLY “kicks in”, holds back resources, splits resources among pools (however you wish to think of it) if there is competition between pools.  This makes it important to determine how many, and which resources you are going to assign to the pools.  Pool Stats:

I have setup

  • Default – PLEASE don’t get rid of the default pool… I had discussion of this in my prior resource governor post.
  • Internal – Internal DB processes run here
  • adhoc_pool_high – This is where I have my “normal” interactive user queries, reports.
  • adhoc_pool_mde – This is where I have the “High Utilization” tools such as SPSS.  There are tools that will select large amounts of data for mining or other analysis where runtime is not measured in seconds, and can be separated to prevent those very large queries from dominating the machine while still running in reasonable times.
  • load_pool_high – This is where I have batch processes.  There is a scheduled job that changes priorities for this pool to a higher-level overnight for my batch-load processes.
  • maintenance_pool_low – This is where I run long-running maintenance processes that I just need to get done, but don’t care how long it takes… compressing historical data etc.  Things I want to keep out of the way of the rest of the system.

As you can see in the screenshot there is little else running right at this moment, other than the adhoc_pool_high, so in reality that pool has access to 100% of the cpu until there is something that shows up in one of the other pools that needs resources.

Workgroup setups

Assigned to each of the pools contains one or more workgroups, in this case you can see that the 74% of the adhoc_pool_high is all being used by the “special_user” workgroup.  This is where starting simple will help you greatly, there is no association (other than your knowledge… or looking in the database) between workgroups and pools that you can see in performance monitor.

My associations are

Default pool
   Default workgroup
Internal pool
   internal workgroup
Adhoc_pool_high
   Special_user (specific ID’s that I wish to GUARANTEE performance to)
   dashboard_user (Interactive BI dashboard that NEEDS to perform well)
   User Interactive (interactive reports, management studio, etc_
Adhoc_pool_med
   report_user (scheduled reports)
   user_tools (General bucket for tools other than interactive queries)
   user_spss (SPSS gets it own resource allocation to prevent it from overrunning the machine)
Load_Pool_High
   load_batch
Maintenance_Pool_low
   maintenance

It is at this level where you will be monitoring the resource allocations to see that they are meeting your expectations.  Resource governor is really where art and science meet, you will need to determine which workgroups/pools you wish to give priority to, and which ones will take the back-seat.  Once things are running that cross between workgroups that are assigned to different pools you will see the cpu control % numbers take effect.

In addition to the performance monitor stat, which I have personally found most useful for immediate feedback on the performance of the pools there are a few DMV’s that collect information along the way.  It is through these DMV’s that you can understand the associations between the workload group and the associated pool, as well as what your configurations are.

dm_resource_governor_workload_groups  Where you can see what your allocations/configurations are.
dm_resource_governor_resource_pools Where you can see what your allocations/configurations are.
dm_resource_governor_resource_configuration Is it enabled?

Follow

Get every new post delivered to your Inbox.

Join 386 other followers