My Favorite New(ish) 2008 Feature (Resource Governor)

#TSQLTuesday

I have written in the past about setting up the resource governor, which I am counting as my favorite new(ish) feature, so I am taking this TSQL2sday post opportunity do my promised followup for monitoring of the resource governor.

Introduction

To start with, my advice about the resource governor is to start small.  Have some idea of what you want when you are done… do you need three pools, five pools… how may groups do you envision etc.  Once you “think” you know where you want to go, start small.  Start with a basic A/B split (user interactive Vs Batch process for example) and work your way up from there, monitoring the whole way.  Well balanced, the resource governor works very well… all your resources being consumed in a single pool gets you nowhere (and depending on your max settings can set you back quite a bit)…. so to achieve the “well balanced” lets look at monitoring.

The primary method for monitoring the internals of the resource governor is through the Performance Monitor statistics made available.

Through these metrics you will be able to see both the pool and group level information in much the same way that you would normally see sql server performance statistics, however they are now broken down at a lower level (so you can see how your allocations are performing).  Keep in mind an important point about resource governor, if you allocate 10% of the CPU as a max to a pool, that pool can still get 100% of the CPU if nothing else is running on the machine.  The resource governor ONLY “kicks in”, holds back resources, splits resources among pools (however you wish to think of it) if there is competition between pools.  This makes it important to determine how many, and which resources you are going to assign to the pools.  Pool Stats:

I have setup

  • Default – PLEASE don’t get rid of the default pool… I had discussion of this in my prior resource governor post.
  • Internal – Internal DB processes run here
  • adhoc_pool_high – This is where I have my “normal” interactive user queries, reports.
  • adhoc_pool_mde – This is where I have the “High Utilization” tools such as SPSS.  There are tools that will select large amounts of data for mining or other analysis where runtime is not measured in seconds, and can be separated to prevent those very large queries from dominating the machine while still running in reasonable times.
  • load_pool_high – This is where I have batch processes.  There is a scheduled job that changes priorities for this pool to a higher-level overnight for my batch-load processes.
  • maintenance_pool_low – This is where I run long-running maintenance processes that I just need to get done, but don’t care how long it takes… compressing historical data etc.  Things I want to keep out of the way of the rest of the system.

As you can see in the screenshot there is little else running right at this moment, other than the adhoc_pool_high, so in reality that pool has access to 100% of the cpu until there is something that shows up in one of the other pools that needs resources.

Workgroup setups

Assigned to each of the pools contains one or more workgroups, in this case you can see that the 74% of the adhoc_pool_high is all being used by the “special_user” workgroup.  This is where starting simple will help you greatly, there is no association (other than your knowledge… or looking in the database) between workgroups and pools that you can see in performance monitor.

My associations are

Default pool
   Default workgroup
Internal pool
   internal workgroup
Adhoc_pool_high
   Special_user (specific ID’s that I wish to GUARANTEE performance to)
   dashboard_user (Interactive BI dashboard that NEEDS to perform well)
   User Interactive (interactive reports, management studio, etc_
Adhoc_pool_med
   report_user (scheduled reports)
   user_tools (General bucket for tools other than interactive queries)
   user_spss (SPSS gets it own resource allocation to prevent it from overrunning the machine)
Load_Pool_High
   load_batch
Maintenance_Pool_low
   maintenance

It is at this level where you will be monitoring the resource allocations to see that they are meeting your expectations.  Resource governor is really where art and science meet, you will need to determine which workgroups/pools you wish to give priority to, and which ones will take the back-seat.  Once things are running that cross between workgroups that are assigned to different pools you will see the cpu control % numbers take effect.

In addition to the performance monitor stat, which I have personally found most useful for immediate feedback on the performance of the pools there are a few DMV’s that collect information along the way.  It is through these DMV’s that you can understand the associations between the workload group and the associated pool, as well as what your configurations are.

dm_resource_governor_workload_groups  Where you can see what your allocations/configurations are.
dm_resource_governor_resource_pools Where you can see what your allocations/configurations are.
dm_resource_governor_resource_configuration Is it enabled?

Help – My disk performance VI

Introduction
Help — My disk performance      (RAID how/why what to put where)
Help — My disk performance II  (Physical connections to the SAN)
Help — My disk performance III  (Partitioning and compression)
Help — My disk performance IV (filegroups and files)  
Help — My disk performance V   (Monitoring)  

This post will wrap-up some additional monitoring topics, and a few other odds and ends leftover from the other posts in this series.

Jumping right in, there are a few quick-checks relative to IO utilization that you can see from the database side using DMV’s.

Two Quick-Check DMV’s
sys.dm_io_pending_io_requests — If you have a large quantity of pending IO, this is an early indicator that you will see other metrics above the allowable limits for you disk configuration.  This is by no means a guarantee that you have an issue, however, is a good (quick) sanity check.

sys.dm_io_virtual_file_stats — I started the discussion about hotspots and how files are filled by SQL Server back in part IV of this series.  This view provides lots of insight into how things are being allocated on a per-file basis.

This sql is intended as a starting point … not a final solution, your usage will depend largely on how many databases you want to look at, how to report the data, etc.  The intent here is to show some output from a very useful and versatile DMV.

1) Create a table to grab the results from when done
CREATE TABLE #tmp_file_io
   ( dbname        VARCHAR(256),
       type_desc     VARCHAR(50),
       file_name     VARCHAR(256),
       read_xfer_ms  INT,
       write_xfer_ms INT   )

2) setup some variables
 DECLARE @count     INT,
                     @num_files INT
SET @count = 1

3) How many files do we need to look at?
 SELECT @num_files = (SELECT Count(*)
                     FROM   sys.database_files)

4) Go collect the data
WHILE @COUNT <= @num_files
  BEGIN

      INSERT INTO #tmp_file_io
                  (dbname,
                   type_desc,
                   file_name,
                   read_xfer_ms,
                   write_xfer_ms)

      SELECT Db_name(6)    AS dbname, <–obvious hardcode here, if your looping over databases, etc
             b.type_desc,
             b.name,
             io_stall_read_ms / num_of_reads   AS ‘disk read xfer ms’,
             io_stall_write_ms / num_of_writes AS ‘disk write xver ms’
      FROM   sys.Dm_io_virtual_file_stats(6, @count) a
             INNER JOIN sys.database_files b
                ON @count = b.file_id

      SET @count = @count + 1
  END

As a result of this, you will have a table with the IO rates by file.  You will be happy that you followed some sort of naming convention when you created the files, because this will quickly point out issues within a filegroup caused by autogrowth of files, or uneven allocation to files.  You don’t want to have gone through all the efforts to balance IO across 8 files, and then, by way of uneven allocation, have all the IO concentrated on a single file.  Again, this query (by itself) does not tell you if you have a problem, you need to review this for trends as the data is collected from instance start.  If you know you have queries that regularly access multiple years of data, across a large cross section of data, you would expect to see reasonably even distribution of IO across all the files in the filegroup.

There is certainly more detail, more script and more to write on this topic, and at some point, I am sure that I will continue with this series.  For now, I am going to wrap this one up.  There are a few things that I mentioned as I went through this series, and will be posting on all of these in the not too distant future. 

The “outstanding” topics that I touched upon, but will be covered in the future include:
–Indepth look at compression
–Resource Governor – How to
–Automated index reorg/rebuild
–Query tuning and explain plans
–Memory configuration, and considerations
–SSIS… modular design, dimension loading, etc.

Help – My disk performance V

Introduction
This is the fifth post in a series on disk performance pulled it together quickly to be included in some blogging fun for “IO, IO, It’s off to disk we go!”  on T-SQL Tuesday #004
Help — My disk performance      (RAID how/why what to put where)
Help — My disk performance II  (Physical connections to the SAN)
Help — My disk performance III  (Partitioning and compression)
Help — My disk performance IV (filegroups and files)  

We have covered a lot of territory in the previous 4 posts, hopefully you have an understanding of where you want to be.  In this post I start moving away from the physical layout towards monitoring.  I can’t say it strongly enough, you must monitor you servers, don’t believe someone is going to watch for you, and even if you have people with the responsibility of watching… you need to watch anyway.  Several reasons for this, first off, your responsibility is the Data Warehouse.  I am not speaking badly of other groups monitoring systems, but they have their hands FULL, and may very well have a different tolerance for unacceptable performance due to the number of systems they are responsible for and each systems unique needs. Secondly, each Data Warehouse has a “feel” of its own that you will begin to understand with consistent monitoring.  You will get to know your normal performance levels, and at a glance at a graph be able to know someone has a query that needs tuning, or an index opportunity.  The monitoring is where the “art” and “science” begin to intersect. (Query tuning… a topic for another time) is the other place where “art” and “science” cross paths.   

At the most basic, monitoring for disk-IO starts with server monitoring from the windows perf-mon tool.  Start a collection process on the server that writes to a new file every 24 hours, in this collection (at a minimum) collect the avg disk queue (for each drive), avg disk reads per sec (for each drive), avg disk writes per sec (for each drive), avg cpu (total is fine).  There are other measures that are worthwhile – sql memory, kernal memory and quite a few others…. but this topic we are looking a disk-IO… trying to stay on topic (and will talk about monitoring tools that go beyond perfmon shortly).  

Now that you’re collecting data, you also need to know when things are happening on your server.  At the highest level:  

  1. When does your index maintenance run, and for how long? (you have automated index reorganize/rebuilds based on percent fragmentation right?!?… also another topic for later)
  2. When does your backup run, and for how long? (are you using SQL Server 2008 backup compression? … you should)
  3. Do you have other maintenance that is regularly scheduled (you’re doing a weekly dbcc checkdb right?)
  4. When does your normal nightly data-load start, and how long does it normally run?
  5. Do you have scheduled reporting? (monthly reports, thing that run off-hours due to their size etc)
  6. Do you have reporting that is triggered by the completion of a load?
  7. Do you have resource group allocations scheduled to change based on time/load etc? (Are you using SQL Server 2008 resource governor? …also another topic for later, but you should)

Loading up a representative day (don’t start this exercise with a Saturday if that is your least utilized time, nor should you start by looking at Sunday if that is the day you have a checkdb running for 10 hours) into the performance monitor tool to review the results.  Ultimately, you need to understand all of the time periods, including maintenance windows, but for now, lets start with a normal representative day.  

Average disk queue
Assuming you have 40 drives for each of you data drives, then the simple answer is, your avg disk queue of 2 per drive should stay bellow 80 on each of your data drives.  Don’t worry too much at this point about a spike above 80… if you sustain a queue above 80 you will have performance issues.  

It is possible to have a disk-queue reported by perf-mon on the windows side, and the disks be idle, if you are seeing disk queues, you should get your SAN administrator involved to look at the disk statistics for the LUN in question on the SAN.  IF you have no disk pressure in the SAN, and the queue is high in Windows, you have HBA latency that needs to be addressed.  

HBA Latency 
 This is really unlikely – even with a dual-port 2gb card this is not likely the slow part,  with a dual-port 4gb card it is very very unlikely. Most likely if you are seeing latency on the HBA there is something that is set incorrectly, or needs a firmware upgrade.
If you are actually seeing latency:
–Validate your port speed setting for both the card and the SAN switch (I have seen a switch port set to 100mb, you WILL have a performance impact)
–Validate your HBA firmware versions and compatibility with SAN OS versions, your SAN vendor will have very specific recommendations.
–Validate the SAN connections to the switches

Average reads/writes
Assuming you have 40 drives for each of your data drives, the simple answer is, the reads/writes should stay below 180-200 per drive.  This allows for 7200 – 8000 IO’s per second on each drive.  Again, don’t worry about spikes above this level unless they are sustain, or very close together. 

Average CPU
This has less to do with disk-IO, and more to do with overall performance measures.  An important thing to understand, and provides a large amount of insight into the overall responsiveness of your server.  Get to know what your normal CPU profile looks like in the different points of time I mentioned above. 

Examples
This chart shows a 9 drive array that has issues.  A little hard to read here, but the queues (blue) are WELL above 18 and the avg read/write activity is WELL above the 1600 – 2000 level.

This chart shows a 40 drive array that is performing within acceptable tolerances.  These queues show some spikes in activity, still remaining bellow the allowable 80 for this configuration, and going back to an earlier point about knowing what happens when…. those are all during maintenance and backup windows.

Follow

Get every new post delivered to your Inbox.