Warehouse Design Tip / Partitioning

Introduction

This topic seems so simple, and so obvious to me, and do not mean to insult anyone’s intelligence.  However, it is shocking to me the number of people that I have talked to (running large databases) that are not taking advantage  of one of the best performance, and manageability enhancements out there!  Think about the classic phone-book example everyone uses for index’s…. but broaden the example a little.  Consider you have a phone book from NY City, LA, Miami, and Chicago all combined into a single book…. even well indexed it is going to a little bit of effort to find all of the records for a given name in a given city.  Now if you partition that giant book on city, you are not even going to go look in the books that you’re not interested in. …OR… lets say you wanted to find a given name in all four cities, if you have one giant book… you will start looking… if you have the book partitioned into separate books, you could hand one book to three of your friends (threads) and each of you start looking in parallel… clearly divide and concur will get you the answer much more quickly.  

 In a warehouse environment I suggest finding some sort of logical annual partition.  It may be a calendar year, it may be a fiscal year, it may be some annual business cycle.  Make sure what you pick will make sense to your users (to make use if this power they need to include it as filter/join criteria in their queries).  If you pick an annual partition based on calendar year, however the business data is always looked at based on a fiscal year ending in July, even if the queries are well written to filter on the year, you will always use 2 partitions to answer a question because the fiscal year splits 2 calendar years.  

Example

 You can use the wizard in SSMS, or scripts either way you will need to create a partition function  

CREATE PARTITION FUNCTION [year_PF1](INT)
   AS range LEFT
   FOR VALUES (2005, 2006, 2007, 2008,
               2009, 2010, 2011, 2012,
               2013, 2014)  

Once you have the partition function, you will need the partition scheme to be associated to the files groups using a partition scheme  

CREATE PARTITION SCHEME [year_PS1]
   AS PARTITION [year_PF1] TO
   ([data_2005], [data_2006],
    [data_2007], [data_2008],
    [data_2009], [data_2010],
    [data_2011], [data_2012],
    [data_2013], [data_2014],
    [data_2014])  

Now you can create a table on the partition scheme allowing all data to be grouped together by year. The key portion of this statement is the “on” at the end of the table create statement.  

CREATE TABLE partition_example
   (      
      example_year  INT,
       fname         VARCHAR(100),
       lname         VARCHAR(100),
       rec_create_dt DATETIME,
       rec_mod_dt    DATETIME,
       audit_key     INT
   )  ON year_ps1(example_year)  

   

Conclusion

This is an overly simple example, however, please do yourself the favor and understand how partitioning works.  See my prior discussions on disk management to see how you can simply move data files from one drive to another, as well as the great optimization, and parallelism performance improvements you will see.

2008 Resource Governor – Does it work?

Clip to Evernote

Yes….it works….

 

Introduction

I am not going to pretend to be the expert on this topic, so please feel free to comment.  I am running resource governor in production (in a Data Warehouse environment), and simply want to share some of my observations, tips that worked, and “words of wisdom”.  This post is not a detailed instruction on how to setup the resource governor, there is a GREAT doc “Using the Resource Governor” Sql Server Technical Article by Aaron Bertrand, and Boris Baryshnikov.

Things to note:

  1. Read the doc!  There are lots of practical hints, monitoring recommendations etc.  Don’t jump in before you understand it.
  2. Keep the proc simple – especially to start off with.  I over-engineered by first attempt at the classifier proc with lookup tables for the different groups.  My aim was to make the process dynamic, and by adding/removing things from the tables at different times I could control how the groups behaved.  Not a great result…. this proc is called A LOT… make sure what you build is well optimized.
  3. Create a failsafe “out” for your account.  The FIRST thing I check in the proc is a test for my user-name and assign myself to a group.  May be paranoid here, but, if this proc goes badly, you may not get a connection to be able to alter it easily…. the last thing you want is to have to shutdown a production instance because you are trying out code that just simply did not work….. I KNOW we test, but things happen that shouldn’t.
  4. Keep the group setup simple – even if you have grand plans, start simple.  Start with your worst offending application, create the group for it, and let everything else fall into another group.  This will basically get you default (system processes), BigApp group (for me that is SPSS),  User Group (everything else).
  5. Understand the utilization you currently have.  I am not being smart… you need to do some monitoring before you start.  What does your CPU/memory utilization look like before you start.  How much parallelism are your large queries getting (the ones you’re going to put into the BigApp group).  If you don’t know where you are starting from, it is hard to know if things are getting better (or worse).
  6. Understand your target, nothing precise, this is “art”.  For me, I had a few applications that needed “special” treatment to allow everything to live together and get consistent results. So I wanted to reduce the parallelism on SPSS so it would not take-over (for long periods of time) and reduce performance for everything else.
  7. Set reasonable limits.  Especially at first, and even after running, never cut your default pool too short, you will be in for a surprise (not the good kind) if something drops into default, and you have allocated all the available resources to the other pools.
  8. Keep it simple (now I am being smart, but really… start slow with this thing you will thank me later)
  9. Use the resource governor!  A healthy dose of caution is good, but, it really does work.

Understanding the pools

Assuming you have an idle system, and a query comes into a pool that is allocated a max of 50% of cpu, it will be allowed to use all 100%, the governor does not prevent “over allocation” to a pool if the resources are idle. 

There are times (untill the machine has been truly idle for a little while) that you will see cpu utilization right at your minimum cpu guarantee.  This caused me concern at first… I had set aside a guarantee of 15% for a adhoc_pool_high, and then saw a jump (to 15%) in cpu utilization.  I thought that the resource governor was suddenly using 15% cpu overhead, that is not the case.  The governor was holding onto 15%, and anything submitted to that pool was automatically given that 15%.  You will reach an idle cpu condition, and the 15% will drop off, but if you see it, don’t assume the worst…. it is likely working exactly as you intend.

 

Monitoring

This will be another topic for at least one (if not more) posts…. but…. because I like DMV’s so much, here are a few to go play with:

  • dm_exec_sessions  (memory/cpu usage) either individually for each session…. or by group, the group_id is here.
  • dm_exec_requests (waits and times) too much cxpacket you may want to lower DOP for the group.
  • dm_exec_cached_plans  (plans cached, for the pool)  freeproccache also takes a new argument to free cache for a given pool-id
  • dm_exec_query_memory_grants (bunches of memory request info) it even gives an ideal_memory_kb for the ideal amount of memory for a request.  Particularly if you don’t have lots of memory to play with, this will give you good tuning advice.
  • dm_exec_query_resource_semaphores  – are you getting the memory you specified in the config of the pool.
  • dm_os_memory_brokers – How is memory being utilized over time.
  • dm_os_memory_cache_entries – detailed info about cache usage buffer vs non buffer.

In addition to the DMV’s there are changes to perf-mon data that you can see at the pool level.  All of this goes back to my earlier point about understanding your system before you start playing with the resource governor.  You will be quickly overwhelmed by stats, and the more pools you setup, the more stats you will be overwhelmed with.

What I ended up with

This is one of those things that changes, gets tweaked over time… but right now here is where I am:

  1. adhoc_pool_high – min cpu 10% (I set this down to 0% via a job-step during load times), but during the day this guarantees the interactive report-user (from SSRS) available CPU.
    1. Report_user group – the common account we use from SSRS lands here
    2. special_user group – the “failsafe” group for my account
  2. adhoc_pool_med – This is where all of the “normal” query tools, and users connect
    1. user_interactive group – has the highest priority in the pool, and 50% of the memory, this is basically all the users that are not otherwise specified
    2. user_spss group – has the lowest priority in the pool, 25% of the memory, and reduced parallelism to 4.  These are LARGE long running queries, this help prevent them from taking over.
    3. user_tools group – has medium priority in the pool, and contains things like excel, office services, tableau.  Reduced parallelism to 7 for this group, not a big reduction, but allows for even execution among the tools.
  3. Default – Default pool …. things land here that are not classified
  4. internal – Internal engine pool…. anything running as SA, or the account the sql engine is running as lands here
  5. load_pool_high – pool for batch-load jobs. (controled by user-account and machine they are connecting from)
  6. maintenance_pool_low – pool with reduced CPU max, this is intended for maintenance tasks that can run for a long time, and reduce the impact on the overall performance of the system. 

I hope this helps you take the steps to begin working with the resource governor!  It will take some work to get everything playing together just the way you want, and there will always be some queries that break the mold, but it is well worth the efforts.

Help – My disk performance III

Introduction
This is the third post in a series on disk performance
Help — My disk performance       (RAID how/why what to put where)
Help — My disk performance II  (Physical connections to the SAN)

 Now that we have covered a large portion of the physical disk (leaving aside physically aligned partitions for a moment), I am going to jump over the next layer in the stack which would be filegroups and files to talk about partitioning  and compression first.  The logic of the filegroups and files will be easier to cover after this discussion.

Sticking with the basic layout that we arrived at earlier with two data drives, and understanding that we have a range partition scheme available to us (I wish we could subpartition as in Oracle… maybe someday), but given the range you need to decide what is a logical split.  The split needs to be logical to you… as well as your users… partition elimination in queries is a powerful thing (throw in compression, and start scanning all partitions because you picked something not logical to your users it is equally powerful in a bad way).  Making a general statement, in a data warehouse you can generally find a way that annual partitions make sense to everyone.  It might be calendar year, it might be fiscal year, it might be a sales year ending  in August… but generally every business will have some sort of logical annual cycle.  Given the fact that your building a warehouse, your likely interested in looking at comparing data between last year, and how performance is this year…. multi year trends etc.

Jumping right in, to create a partitioned table, you need to have the partition function, which defines your ranges, and a partition scheme which creates the connection between the file-group and the partition values that are to go into that range.  For your own sake, come up with a naming convention that makes sense.  This seems like an obvious point, but if you have an annual range for 2005 data and the filegroup it is associated with is named database_name_filegroup_6, (which then has files associated) you will (look at my picture) pull out your hair.  Once we get to filegroups, number of files etc I will go over this in more detail.  For now, it is sufficient to say that there are file-groups for each of the years, and each filegroup is associated with a given range.

CREATE PARTITION FUNCTION [Annual_Partition_PF1](int) AS RANGE LEFT FOR VALUES (2005, 2006, 2007, 2008, 2009, 2010);

CREATE PARTITION SCHEME [Annual_Partition_PS1] AS PARTITION [Annual_Partition_PF1] TO ([data_2005], [data_2006], [data_2007], [data_2008], [data_2009], [data_2010], [data_3]);

 

Tying this partitioning discussion back to the disks that we talked about previously, each partition (by way of file-group) is tied to a drive letter, which is one of the two data luns from the SAN.

Thus, if you are requesting data to compare 2010 and 2009 data you are accessing data (in parallel) from two different drives, which are connected to two different storage processors, which are connected to 2 different LUNS…load balanced across two switches connected to your dual-port HBA. I hope you’re starting to see this relatively clearly.

  

   

Compression: I  (at some point) I will go into greater depth on this topic alone, but for our purposes here, keeping it in the world of disk IO performance… proceed (with measured caution) but please do yourself the favor of working through the process of getting things compressed.  There are some gotchas that can tank your query performance, but the upside is HUGE.   Imagine each read of an 8k page brings back 16k of data, now we are getting disk IO performance!

 ALTER TABLE [responder] REBUILD PARTITION = 5 WITH(DATA_COMPRESSION = PAGE )

  

  

The space savings are obvious, and as we discussed with RAID-10 configuration, every MB of actual storage requires 2 MB of physical disk to allow for mirroring. If you can save 300Gb you have realized quite a substantial savings in disk.

The downside… DO NOT (at least not with regularity) update/delete/insert to the compressed partitions.  You WILL pay a penalty large enough that you will notice.  (I have seen cases on large tables with compressed index’s, etc, where the compression can cause a 4x slower response for update/inserts/deletes).

Remembering that we are talking about a warehouse layout, it is relatively safe to compress the data that is a few years old.  IF on the outside chance there is a historical update to the data it will run longer… but the space savings, and read rates are well worth the tradeoff.

Help – My disk performance II

 

 

OK, Post number two in two days on a rather weighty topic… don’t expect this forever! I am prepping for a presentation, and using this to collect my thoughts, as well as provide a place to point to during the presentation… there is no way I can get to all this detail in a presentation, so my hope is to provide a place for some self-paced study of my rambling thoughts.

Introduction
This is the second post in a series on disk performance
Help — My disk performance   (RAID how/why / what to put where)

 Now that we have covered some of the basics about disk, raid configurations, number and size of drives, today’s discussion will move up the stack (just a little) to the connections, and layout of those disks.  This discussion is more theoretical than specific to a particular SAN vendor, the logic applies regardless of the hardware provider, be it EMC, Hitachi, HP, etc.

Some basic assumptions (Your SAN admin folks are going to have done a good job protecting you at the san/network levels with redundancy.)

  1. SAN fabric has a minimum of two switches (This provides failover and redundancy in the network layer) should a switch fail, someone unplug something, break a cable, etc, the two switches are independent and can provide complete access to the SAN.
  2. There are a minimum of two storage processors internally in the SAN.  This provides load-balancing internally, as well as failover internally within the SAN should a processor fail.

I have seen quite good performance from a single dual-port card connected at 2gbps. My recommendation is to connect at 4gbps to effectively eliminate discussions about HBA latency (more on this another time).  A single dual-port card does little for redundancy and failover on the server, but it does provide load-balancing across the two ports, and does provide for connections to both halves of the SAN, which does have redundancy in the fabric as well as internally in the SAN.

 

As you can see in the illustration, the dual port card has a connection to each switch and each switch is connected to a storage processor.  Working with the setup of a minimum of two data drives, each drive is a 1tb lun spanning 40 physical drives(Previous post).  Each LUN gets assigned a primary storage processor, given two drives, and two storage processors (each with their own read-ahead and buffer caches) each of the two luns should be split between the storage processors.

Monitoring, is topic for another day, but here are some basic numbers to live by (if they don’t make sense… stay tuned, I promise I will get there)

  • IOPS, when calculating IO, use 180 – 200 per drive in the LUN
  • Access time 5ms – 8ms (as measured on the SAN)
  • Disk queue 2 per drive in the LUN
  • 1tb per LUN

The filegroups, numbers of files, file sizes are all going to be part of a later post.

Help – My disk performance

First of all, let me start by saying this topic (for those who care) seems to generate quite a bit of conversation.  This is the first installment of what will be many on this topic.  Just in writing this one, you will see my struggle to stick to the topic.  The context of this ramble is based on a few assumptions:

  • We are talking about a Data Warehouse
  • You are running SQL Server (hopefully 2008)
  • You are connected to a SAN (Storage Area Network)

OK, with that out of the way, we are going to start at the bottom of the stack with some basics.

  • RAID-0   Please don’t use this…. stripe set built across multiple disks, performs well, but if you lose 1 disk, you have lost the whole stripe set.
  • RAID-5 Data is striped across 3 or more disks.  This configuration tolerates 1 disk failure in the set (performance will be degraded until the disk is replaced).  This is an effective layout for database log-files.  First,  I would recommend running a warehouse in Simple-mode so the log storage is only important during the length of a transaction.  Secondly, performance is good for sequential write activity (ideal for simple-mode logging).  Third save your money (and SAN slots) and get more pysical storage from a raid-5 group due to the lack of mirroring.
  • RAID-10 (1+0) has 4 or more drives.  Mirrored data (raid-1) between two sets of drives which are then striped across multiple disks (raid-0).  This will give you the performance gains of raid-0, the data security of allowing any drive in the set to fail, as well as leveraging some serious SAN technology with read-ahead, buffering, etc, as the SAN has 2 copies of the data to work with.

Basics of RAID covered, so how many drives (LUNS) should I present to the server?  (Number of data files and many other things will be in a later post)

  1. Log-drive (RAID-5)  150Gb or larger.  There should be NOTHING on these drives in the SAN other than log data.  If your sharing the SAN with other applications, DO NOT put other data on these drives.  Other Logs from SQL server databases…. sure…. other application data…. don’t you dare.  The SAN will leverage quite a bit of technology to optimize the read-write activities for a given set of drives in the array, log activity is all similar, and thus buffer allocation to those drives will be optimized for that activity.  If you mix the activity you will blow the ability of the SAN to help optimize the kind of activity you’re sending.
  2. TempDB (RAID-10) 150GB or larger.  This should be a separate drive from the other drives you present to the server.  If you can… put this on its own set of drives that shares other tempdb activity from other servers.  This drive is also the drive for the other system databases.
  3. SSAS (RAID-10) 400Gb or larger.  I also have a user work database that I allocate on this drive.  Basically allocate all of the user-related “uncontrollable” …. ok uncontrollable is not the right word, but the fact that users can use data-mining structures in excel that create back-end cubes, or users that can dump rows into a work table is a little uncontrollable from an admin perspective.  (more on the controls later)…. but you get the point…. if you put all of this stuff on its own logical drive, you will limit both the performance impact as well as space growth of the adhoc activity.
  4. Data Drive (x)  (RAID-10)  1Tb each(minimum 2 drives), across 40 physical drives.  These should be 146Gb (or smaller) … FAST drives.  Given that we are in a raid 10 configuration, 146Gb drives format to 133Gb, gives you (5320/2)=2660GB of “allocateable” space.  Thus, in those 40 drives you can allocate 2 1tb luns , as well as a 600Gb lun for SSAS/User space.  The reason for a 1tb cap on the drives and how to lay datafiles onto these 2 (or more) drives are all later topics.  (This ramble will go on for a while — stay tuned).
  5. Historical data drive (RAID-10) - 1Tb drive of larger – slower drives. Feel free to deploy 300Gb drives, allocated across fewer (20 maybe even 10) physical drives.  I am trying desperately to stay away from “it depends” in all of this, and give as many actual numbers that I can… but this depends on how often the old data that you put on this drive is accessed.  Utilizing compression (more on this later…. really, hopefully this post will turn into something useful someday) you can get very good read-rates.  300Gb drives in smaller number will NEVER perform like the smaller counter-parts, so long as your users can tolerate historical reporting, and analysis to perform slower… this should be fine.  In addition, the day-to-day analysis of this older data will likely reside in a SSAS partition that will not normally be refreshed.
Follow

Get every new post delivered to your Inbox.

Join 373 other followers