Richmond CodeCamp Disk IO Presentation

This is a very similar post to the one from the user-group presentation post… and in fact the presentation itself is very similar.  It has been updated, and a few things added since the last presentation, so I did want to post the updated version which was presented at codecamp.   I hope that the presentation left you with a few things to go think about, and gave you some tools to work with.

If you came out to see this presentation, please provide me some feedback through speaker-rate.  The feedback helps me build better presentations… and hopefully provide better, more informative sessions to you in the future.

I enjoyed the whole codecamp experience, and I hope that you got as much out of the day as I did!!  A truly impressive group of presenters with a great breadth of topics, I was glad to be a small part of the great day.

As I mentioned, everything I referenced in the presentation can be found my series on disk performance:

Help — My disk performance      (RAID how/why what to put where)
Help — My disk performance II  (Physical connections to the SAN)
Help — My disk performance III  (Partitioning and compression)
Help — My disk performance IV (filegroups and files)  
Help — My disk performance V   (Monitoring)  
Help — My disk performance VI (Additional monitoring and conclusions)

Also, as promised, the presentation is available for reference on my Presentations page

Help – My disk performance VI

Introduction
Help — My disk performance      (RAID how/why what to put where)
Help — My disk performance II  (Physical connections to the SAN)
Help — My disk performance III  (Partitioning and compression)
Help — My disk performance IV (filegroups and files)  
Help — My disk performance V   (Monitoring)  

This post will wrap-up some additional monitoring topics, and a few other odds and ends leftover from the other posts in this series.

Jumping right in, there are a few quick-checks relative to IO utilization that you can see from the database side using DMV’s.

Two Quick-Check DMV’s
sys.dm_io_pending_io_requests — If you have a large quantity of pending IO, this is an early indicator that you will see other metrics above the allowable limits for you disk configuration.  This is by no means a guarantee that you have an issue, however, is a good (quick) sanity check.

sys.dm_io_virtual_file_stats — I started the discussion about hotspots and how files are filled by SQL Server back in part IV of this series.  This view provides lots of insight into how things are being allocated on a per-file basis.

This sql is intended as a starting point … not a final solution, your usage will depend largely on how many databases you want to look at, how to report the data, etc.  The intent here is to show some output from a very useful and versatile DMV.

1) Create a table to grab the results from when done
CREATE TABLE #tmp_file_io
   ( dbname        VARCHAR(256),
       type_desc     VARCHAR(50),
       file_name     VARCHAR(256),
       read_xfer_ms  INT,
       write_xfer_ms INT   )

2) setup some variables
 DECLARE @count     INT,
                     @num_files INT
SET @count = 1

3) How many files do we need to look at?
 SELECT @num_files = (SELECT Count(*)
                     FROM   sys.database_files)

4) Go collect the data
WHILE @COUNT <= @num_files
  BEGIN

      INSERT INTO #tmp_file_io
                  (dbname,
                   type_desc,
                   file_name,
                   read_xfer_ms,
                   write_xfer_ms)

      SELECT Db_name(6)    AS dbname, <–obvious hardcode here, if your looping over databases, etc
             b.type_desc,
             b.name,
             io_stall_read_ms / num_of_reads   AS ‘disk read xfer ms’,
             io_stall_write_ms / num_of_writes AS ‘disk write xver ms’
      FROM   sys.Dm_io_virtual_file_stats(6, @count) a
             INNER JOIN sys.database_files b
                ON @count = b.file_id

      SET @count = @count + 1
  END

As a result of this, you will have a table with the IO rates by file.  You will be happy that you followed some sort of naming convention when you created the files, because this will quickly point out issues within a filegroup caused by autogrowth of files, or uneven allocation to files.  You don’t want to have gone through all the efforts to balance IO across 8 files, and then, by way of uneven allocation, have all the IO concentrated on a single file.  Again, this query (by itself) does not tell you if you have a problem, you need to review this for trends as the data is collected from instance start.  If you know you have queries that regularly access multiple years of data, across a large cross section of data, you would expect to see reasonably even distribution of IO across all the files in the filegroup.

There is certainly more detail, more script and more to write on this topic, and at some point, I am sure that I will continue with this series.  For now, I am going to wrap this one up.  There are a few things that I mentioned as I went through this series, and will be posting on all of these in the not too distant future. 

The “outstanding” topics that I touched upon, but will be covered in the future include:
–Indepth look at compression
–Resource Governor – How to
–Automated index reorg/rebuild
–Query tuning and explain plans
–Memory configuration, and considerations
–SSIS… modular design, dimension loading, etc.

Help – My disk performance V

Introduction
This is the fifth post in a series on disk performance pulled it together quickly to be included in some blogging fun for “IO, IO, It’s off to disk we go!”  on T-SQL Tuesday #004
Help — My disk performance      (RAID how/why what to put where)
Help — My disk performance II  (Physical connections to the SAN)
Help — My disk performance III  (Partitioning and compression)
Help — My disk performance IV (filegroups and files)  

We have covered a lot of territory in the previous 4 posts, hopefully you have an understanding of where you want to be.  In this post I start moving away from the physical layout towards monitoring.  I can’t say it strongly enough, you must monitor you servers, don’t believe someone is going to watch for you, and even if you have people with the responsibility of watching… you need to watch anyway.  Several reasons for this, first off, your responsibility is the Data Warehouse.  I am not speaking badly of other groups monitoring systems, but they have their hands FULL, and may very well have a different tolerance for unacceptable performance due to the number of systems they are responsible for and each systems unique needs. Secondly, each Data Warehouse has a “feel” of its own that you will begin to understand with consistent monitoring.  You will get to know your normal performance levels, and at a glance at a graph be able to know someone has a query that needs tuning, or an index opportunity.  The monitoring is where the “art” and “science” begin to intersect. (Query tuning… a topic for another time) is the other place where “art” and “science” cross paths.   

At the most basic, monitoring for disk-IO starts with server monitoring from the windows perf-mon tool.  Start a collection process on the server that writes to a new file every 24 hours, in this collection (at a minimum) collect the avg disk queue (for each drive), avg disk reads per sec (for each drive), avg disk writes per sec (for each drive), avg cpu (total is fine).  There are other measures that are worthwhile – sql memory, kernal memory and quite a few others…. but this topic we are looking a disk-IO… trying to stay on topic (and will talk about monitoring tools that go beyond perfmon shortly).  

Now that you’re collecting data, you also need to know when things are happening on your server.  At the highest level:  

  1. When does your index maintenance run, and for how long? (you have automated index reorganize/rebuilds based on percent fragmentation right?!?… also another topic for later)
  2. When does your backup run, and for how long? (are you using SQL Server 2008 backup compression? … you should)
  3. Do you have other maintenance that is regularly scheduled (you’re doing a weekly dbcc checkdb right?)
  4. When does your normal nightly data-load start, and how long does it normally run?
  5. Do you have scheduled reporting? (monthly reports, thing that run off-hours due to their size etc)
  6. Do you have reporting that is triggered by the completion of a load?
  7. Do you have resource group allocations scheduled to change based on time/load etc? (Are you using SQL Server 2008 resource governor? …also another topic for later, but you should)

Loading up a representative day (don’t start this exercise with a Saturday if that is your least utilized time, nor should you start by looking at Sunday if that is the day you have a checkdb running for 10 hours) into the performance monitor tool to review the results.  Ultimately, you need to understand all of the time periods, including maintenance windows, but for now, lets start with a normal representative day.  

Average disk queue
Assuming you have 40 drives for each of you data drives, then the simple answer is, your avg disk queue of 2 per drive should stay bellow 80 on each of your data drives.  Don’t worry too much at this point about a spike above 80… if you sustain a queue above 80 you will have performance issues.  

It is possible to have a disk-queue reported by perf-mon on the windows side, and the disks be idle, if you are seeing disk queues, you should get your SAN administrator involved to look at the disk statistics for the LUN in question on the SAN.  IF you have no disk pressure in the SAN, and the queue is high in Windows, you have HBA latency that needs to be addressed.  

HBA Latency 
 This is really unlikely – even with a dual-port 2gb card this is not likely the slow part,  with a dual-port 4gb card it is very very unlikely. Most likely if you are seeing latency on the HBA there is something that is set incorrectly, or needs a firmware upgrade.
If you are actually seeing latency:
–Validate your port speed setting for both the card and the SAN switch (I have seen a switch port set to 100mb, you WILL have a performance impact)
–Validate your HBA firmware versions and compatibility with SAN OS versions, your SAN vendor will have very specific recommendations.
–Validate the SAN connections to the switches

Average reads/writes
Assuming you have 40 drives for each of your data drives, the simple answer is, the reads/writes should stay below 180-200 per drive.  This allows for 7200 – 8000 IO’s per second on each drive.  Again, don’t worry about spikes above this level unless they are sustain, or very close together. 

Average CPU
This has less to do with disk-IO, and more to do with overall performance measures.  An important thing to understand, and provides a large amount of insight into the overall responsiveness of your server.  Get to know what your normal CPU profile looks like in the different points of time I mentioned above. 

Examples
This chart shows a 9 drive array that has issues.  A little hard to read here, but the queues (blue) are WELL above 18 and the avg read/write activity is WELL above the 1600 – 2000 level.

This chart shows a 40 drive array that is performing within acceptable tolerances.  These queues show some spikes in activity, still remaining bellow the allowable 80 for this configuration, and going back to an earlier point about knowing what happens when…. those are all during maintenance and backup windows.

Help — My disk performance IV

Introduction
This is the fourth post in a series on disk performance
Help — My disk performance       (RAID how/why what to put where)
Help — My disk performance II  (Physical connections to the SAN)
Help– My disk performance III  (Partitioning and compression)

Now that we have partitioning covered, filegroups should actually be relatively simple because it just becomes a physical discussion about the logical splits we just arrived at.  Just to restate the obvious, PLEASE come up with a naming convention for your filegroups, file-names, and directory structures. 10 minutes (literally) of thought about this up front will make monitoring, scripting, moving, etc MUCH easier later.

Data Files
The Windows OS operates on file-handles, and depending on the number of processors and read/write to more than 1 file, you will get many opinions on this one, and there are some “it depends” sort of answers…. what I have found through testing with 8 processors and the disk configuration discussed to this point in this blog, 1 file per processor for data files +1 file. (Log files in a moment)  More than 8 files in the filegroup seemed to add little measurable value, the 9th file is my insurance policy for unexpected growth.  Understanding how Sql Server writes to the files is also important, the engine will do it’s best to keep the amount of data equal in all the files in a filegroup.  So having a file-group with 7 files at 75% full, if you add a 8th file all the new data will go to the 8th file till it is 75% full, effectively removing the benefit of multiple files and creating a hotspot on that file.  (I will have more specifics, DMV’s, etc, for monitoring in a later post).Some basic rules:

  1. Pre-allocate space to all 8 datafiles
  2. Do not allow autogrowth on the 8 main files in the filegroup.
  3. Set the initial size small on the 9th file, and allow autogrowth to some number your’re comfortable with (I allow the ‘extra’ file to have growth to the same size as the primary files) I don’t like the autogrowth, but I like things failing with space issues less.
  4. Monitor the extra file for growth…if it is growing, you messed up your space projections, and you need to add space to the other 8 files.
  5. It is best to get the calculations correct, and preallocate the 8 files, this will prevent physical disk fragmentation as the files are added to.
  6. Monitor disk fragmentation on the server.

 Primary – personally, nothing against primary, but I put nothing into primary, and monitor what is in that space.  IF something went to primary, someone (probably me) built something without thinking all the way through.  I would allow for 8 data files preallocated to a reasonable size relative to your other dataspaces.  IF something goes into primary, you don’t want things failing or performing poorly, you can move things around in the background if something that goes there needs to be kept.

Expanding upon a diagram we looked at earlier, you can see the number of filegroups, and their relationship to the physical disks

Beyond the annual partitions, you see a filegroup for “Dim Data” (dimension data), “Dim Idx” (dimension indexs) “Data 1/2/3″ (generic data spaces)  You are going to have some data that does not belong to dimensions, and is not partitioned.  Hopefully all of the large stuff is partitioned, but, you need to have a non-partitioned home for some large tables as well.  If you KNOW two of these large tables are always used together, put them in data spaces that are on different drives. “Index” (non-partitioned indexs).  Depending on the quantity of non-partitioned tables, you may want to consider an 2″ space to allow for separation of the index IO from the table IO. 

Log files
Basically the same principles apply to log files (remember, they are going to their own drive).  Through my testing, due to the high level of activity,  logging can benefit from additional files (3 per cpu).  To avoid the over head associated with growing log files set them to their maximum size and do not allow for autogrowth.  Because there is no autogrowth on these files, you need to ensure you have sufficient log space available to manage data loading and queries, as well as monitoring in place to detect logs filling.  In addition to these file management recomendations, logging for a Data Warehouse should be set to simple mode. (Backup and recovery design for a Data Warehouse is another topic).

Naming Conventions
Each data drive should have a sqldata directory followed by a directory for each database (please don’t be one of those people that drop a few hundred files from all the databases on a server in one directory).  Beyond that, each filegroup should get a name.

J:\                                                                      <—- Data Drive
      SQLDATA                                                  <—- SQL Data directory (this will matter when you have ssas on the same drive as data) or sqllog
            Example_DB                                      <—- Each database has a directory
                  Example_db_data_2005_1  <—- Name each file – dbname + file type (data/log) + year + number
                  Example_db_data_2005_2
                  Example_db_data_2005_3
                                ……
                  Example_db_data_2005_7
                  Example_db_data_2005_8
                  Example_db_data_2006_1
                             …….

 As you can well imagine, there are going to be quite a number of files, so maintaining order within the naming will give you the ability to script, and otherwise programatically monitor things.  This will also provide a very clear delineation of which data resides in which file.  When it comes time to move the oldest data to a different drive, it is as simple as issuing the command in the database to provide the new location of the file, shut-down the database, physically copy the file to the new location, restart the database.  This will become important once your primary “fast” drives start to fill, and the old (least used) data can be migrated to larger (slower), less expensive LUNS.

Help – My disk performance III

Introduction
This is the third post in a series on disk performance
Help — My disk performance       (RAID how/why what to put where)
Help — My disk performance II  (Physical connections to the SAN)

 Now that we have covered a large portion of the physical disk (leaving aside physically aligned partitions for a moment), I am going to jump over the next layer in the stack which would be filegroups and files to talk about partitioning  and compression first.  The logic of the filegroups and files will be easier to cover after this discussion.

Sticking with the basic layout that we arrived at earlier with two data drives, and understanding that we have a range partition scheme available to us (I wish we could subpartition as in Oracle… maybe someday), but given the range you need to decide what is a logical split.  The split needs to be logical to you… as well as your users… partition elimination in queries is a powerful thing (throw in compression, and start scanning all partitions because you picked something not logical to your users it is equally powerful in a bad way).  Making a general statement, in a data warehouse you can generally find a way that annual partitions make sense to everyone.  It might be calendar year, it might be fiscal year, it might be a sales year ending  in August… but generally every business will have some sort of logical annual cycle.  Given the fact that your building a warehouse, your likely interested in looking at comparing data between last year, and how performance is this year…. multi year trends etc.

Jumping right in, to create a partitioned table, you need to have the partition function, which defines your ranges, and a partition scheme which creates the connection between the file-group and the partition values that are to go into that range.  For your own sake, come up with a naming convention that makes sense.  This seems like an obvious point, but if you have an annual range for 2005 data and the filegroup it is associated with is named database_name_filegroup_6, (which then has files associated) you will (look at my picture) pull out your hair.  Once we get to filegroups, number of files etc I will go over this in more detail.  For now, it is sufficient to say that there are file-groups for each of the years, and each filegroup is associated with a given range.

CREATE PARTITION FUNCTION [Annual_Partition_PF1](int) AS RANGE LEFT FOR VALUES (2005, 2006, 2007, 2008, 2009, 2010);

CREATE PARTITION SCHEME [Annual_Partition_PS1] AS PARTITION [Annual_Partition_PF1] TO ([data_2005], [data_2006], [data_2007], [data_2008], [data_2009], [data_2010], [data_3]);

 

Tying this partitioning discussion back to the disks that we talked about previously, each partition (by way of file-group) is tied to a drive letter, which is one of the two data luns from the SAN.

Thus, if you are requesting data to compare 2010 and 2009 data you are accessing data (in parallel) from two different drives, which are connected to two different storage processors, which are connected to 2 different LUNS…load balanced across two switches connected to your dual-port HBA. I hope you’re starting to see this relatively clearly.

  

   

Compression: I  (at some point) I will go into greater depth on this topic alone, but for our purposes here, keeping it in the world of disk IO performance… proceed (with measured caution) but please do yourself the favor of working through the process of getting things compressed.  There are some gotchas that can tank your query performance, but the upside is HUGE.   Imagine each read of an 8k page brings back 16k of data, now we are getting disk IO performance!

 ALTER TABLE [responder] REBUILD PARTITION = 5 WITH(DATA_COMPRESSION = PAGE )

  

  

The space savings are obvious, and as we discussed with RAID-10 configuration, every MB of actual storage requires 2 MB of physical disk to allow for mirroring. If you can save 300Gb you have realized quite a substantial savings in disk.

The downside… DO NOT (at least not with regularity) update/delete/insert to the compressed partitions.  You WILL pay a penalty large enough that you will notice.  (I have seen cases on large tables with compressed index’s, etc, where the compression can cause a 4x slower response for update/inserts/deletes).

Remembering that we are talking about a warehouse layout, it is relatively safe to compress the data that is a few years old.  IF on the outside chance there is a historical update to the data it will run longer… but the space savings, and read rates are well worth the tradeoff.

Help – My disk performance II

 

 

OK, Post number two in two days on a rather weighty topic… don’t expect this forever! I am prepping for a presentation, and using this to collect my thoughts, as well as provide a place to point to during the presentation… there is no way I can get to all this detail in a presentation, so my hope is to provide a place for some self-paced study of my rambling thoughts.

Introduction
This is the second post in a series on disk performance
Help — My disk performance   (RAID how/why / what to put where)

 Now that we have covered some of the basics about disk, raid configurations, number and size of drives, today’s discussion will move up the stack (just a little) to the connections, and layout of those disks.  This discussion is more theoretical than specific to a particular SAN vendor, the logic applies regardless of the hardware provider, be it EMC, Hitachi, HP, etc.

Some basic assumptions (Your SAN admin folks are going to have done a good job protecting you at the san/network levels with redundancy.)

  1. SAN fabric has a minimum of two switches (This provides failover and redundancy in the network layer) should a switch fail, someone unplug something, break a cable, etc, the two switches are independent and can provide complete access to the SAN.
  2. There are a minimum of two storage processors internally in the SAN.  This provides load-balancing internally, as well as failover internally within the SAN should a processor fail.

I have seen quite good performance from a single dual-port card connected at 2gbps. My recommendation is to connect at 4gbps to effectively eliminate discussions about HBA latency (more on this another time).  A single dual-port card does little for redundancy and failover on the server, but it does provide load-balancing across the two ports, and does provide for connections to both halves of the SAN, which does have redundancy in the fabric as well as internally in the SAN.

 

As you can see in the illustration, the dual port card has a connection to each switch and each switch is connected to a storage processor.  Working with the setup of a minimum of two data drives, each drive is a 1tb lun spanning 40 physical drives(Previous post).  Each LUN gets assigned a primary storage processor, given two drives, and two storage processors (each with their own read-ahead and buffer caches) each of the two luns should be split between the storage processors.

Monitoring, is topic for another day, but here are some basic numbers to live by (if they don’t make sense… stay tuned, I promise I will get there)

  • IOPS, when calculating IO, use 180 – 200 per drive in the LUN
  • Access time 5ms – 8ms (as measured on the SAN)
  • Disk queue 2 per drive in the LUN
  • 1tb per LUN

The filegroups, numbers of files, file sizes are all going to be part of a later post.

Help – My disk performance

First of all, let me start by saying this topic (for those who care) seems to generate quite a bit of conversation.  This is the first installment of what will be many on this topic.  Just in writing this one, you will see my struggle to stick to the topic.  The context of this ramble is based on a few assumptions:

  • We are talking about a Data Warehouse
  • You are running SQL Server (hopefully 2008)
  • You are connected to a SAN (Storage Area Network)

OK, with that out of the way, we are going to start at the bottom of the stack with some basics.

  • RAID-0   Please don’t use this…. stripe set built across multiple disks, performs well, but if you lose 1 disk, you have lost the whole stripe set.
  • RAID-5 Data is striped across 3 or more disks.  This configuration tolerates 1 disk failure in the set (performance will be degraded until the disk is replaced).  This is an effective layout for database log-files.  First,  I would recommend running a warehouse in Simple-mode so the log storage is only important during the length of a transaction.  Secondly, performance is good for sequential write activity (ideal for simple-mode logging).  Third save your money (and SAN slots) and get more pysical storage from a raid-5 group due to the lack of mirroring.
  • RAID-10 (1+0) has 4 or more drives.  Mirrored data (raid-1) between two sets of drives which are then striped across multiple disks (raid-0).  This will give you the performance gains of raid-0, the data security of allowing any drive in the set to fail, as well as leveraging some serious SAN technology with read-ahead, buffering, etc, as the SAN has 2 copies of the data to work with.

Basics of RAID covered, so how many drives (LUNS) should I present to the server?  (Number of data files and many other things will be in a later post)

  1. Log-drive (RAID-5)  150Gb or larger.  There should be NOTHING on these drives in the SAN other than log data.  If your sharing the SAN with other applications, DO NOT put other data on these drives.  Other Logs from SQL server databases…. sure…. other application data…. don’t you dare.  The SAN will leverage quite a bit of technology to optimize the read-write activities for a given set of drives in the array, log activity is all similar, and thus buffer allocation to those drives will be optimized for that activity.  If you mix the activity you will blow the ability of the SAN to help optimize the kind of activity you’re sending.
  2. TempDB (RAID-10) 150GB or larger.  This should be a separate drive from the other drives you present to the server.  If you can… put this on its own set of drives that shares other tempdb activity from other servers.  This drive is also the drive for the other system databases.
  3. SSAS (RAID-10) 400Gb or larger.  I also have a user work database that I allocate on this drive.  Basically allocate all of the user-related “uncontrollable” …. ok uncontrollable is not the right word, but the fact that users can use data-mining structures in excel that create back-end cubes, or users that can dump rows into a work table is a little uncontrollable from an admin perspective.  (more on the controls later)…. but you get the point…. if you put all of this stuff on its own logical drive, you will limit both the performance impact as well as space growth of the adhoc activity.
  4. Data Drive (x)  (RAID-10)  1Tb each(minimum 2 drives), across 40 physical drives.  These should be 146Gb (or smaller) … FAST drives.  Given that we are in a raid 10 configuration, 146Gb drives format to 133Gb, gives you (5320/2)=2660GB of “allocateable” space.  Thus, in those 40 drives you can allocate 2 1tb luns , as well as a 600Gb lun for SSAS/User space.  The reason for a 1tb cap on the drives and how to lay datafiles onto these 2 (or more) drives are all later topics.  (This ramble will go on for a while — stay tuned).
  5. Historical data drive (RAID-10) - 1Tb drive of larger – slower drives. Feel free to deploy 300Gb drives, allocated across fewer (20 maybe even 10) physical drives.  I am trying desperately to stay away from “it depends” in all of this, and give as many actual numbers that I can… but this depends on how often the old data that you put on this drive is accessed.  Utilizing compression (more on this later…. really, hopefully this post will turn into something useful someday) you can get very good read-rates.  300Gb drives in smaller number will NEVER perform like the smaller counter-parts, so long as your users can tolerate historical reporting, and analysis to perform slower… this should be fine.  In addition, the day-to-day analysis of this older data will likely reside in a SSAS partition that will not normally be refreshed.
Follow

Get every new post delivered to your Inbox.