This is the fifth post in a series on disk performance pulled it together quickly to be included in some blogging fun for “IO, IO, It’s off to disk we go!” on T-SQL Tuesday #004
Help — My disk performance (RAID how/why what to put where)
Help — My disk performance II (Physical connections to the SAN)
Help — My disk performance III (Partitioning and compression)
Help — My disk performance IV (filegroups and files)
We have covered a lot of territory in the previous 4 posts, hopefully you have an understanding of where you want to be. In this post I start moving away from the physical layout towards monitoring. I can’t say it strongly enough, you must monitor you servers, don’t believe someone is going to watch for you, and even if you have people with the responsibility of watching… you need to watch anyway. Several reasons for this, first off, your responsibility is the Data Warehouse. I am not speaking badly of other groups monitoring systems, but they have their hands FULL, and may very well have a different tolerance for unacceptable performance due to the number of systems they are responsible for and each systems unique needs. Secondly, each Data Warehouse has a “feel” of its own that you will begin to understand with consistent monitoring. You will get to know your normal performance levels, and at a glance at a graph be able to know someone has a query that needs tuning, or an index opportunity. The monitoring is where the “art” and “science” begin to intersect. (Query tuning… a topic for another time) is the other place where “art” and “science” cross paths.
At the most basic, monitoring for disk-IO starts with server monitoring from the windows perf-mon tool. Start a collection process on the server that writes to a new file every 24 hours, in this collection (at a minimum) collect the avg disk queue (for each drive), avg disk reads per sec (for each drive), avg disk writes per sec (for each drive), avg cpu (total is fine). There are other measures that are worthwhile – sql memory, kernal memory and quite a few others…. but this topic we are looking a disk-IO… trying to stay on topic (and will talk about monitoring tools that go beyond perfmon shortly).
- When does your index maintenance run, and for how long? (you have automated index reorganize/rebuilds based on percent fragmentation right?!?… also another topic for later)
- When does your backup run, and for how long? (are you using SQL Server 2008 backup compression? … you should)
- Do you have other maintenance that is regularly scheduled (you’re doing a weekly dbcc checkdb right?)
- When does your normal nightly data-load start, and how long does it normally run?
- Do you have scheduled reporting? (monthly reports, thing that run off-hours due to their size etc)
- Do you have reporting that is triggered by the completion of a load?
- Do you have resource group allocations scheduled to change based on time/load etc? (Are you using SQL Server 2008 resource governor? …also another topic for later, but you should)
Loading up a representative day (don’t start this exercise with a Saturday if that is your least utilized time, nor should you start by looking at Sunday if that is the day you have a checkdb running for 10 hours) into the performance monitor tool to review the results. Ultimately, you need to understand all of the time periods, including maintenance windows, but for now, lets start with a normal representative day.
Average disk queue
Assuming you have 40 drives for each of you data drives, then the simple answer is, your avg disk queue of 2 per drive should stay bellow 80 on each of your data drives. Don’t worry too much at this point about a spike above 80… if you sustain a queue above 80 you will have performance issues.
It is possible to have a disk-queue reported by perf-mon on the windows side, and the disks be idle, if you are seeing disk queues, you should get your SAN administrator involved to look at the disk statistics for the LUN in question on the SAN. IF you have no disk pressure in the SAN, and the queue is high in Windows, you have HBA latency that needs to be addressed.
This is really unlikely – even with a dual-port 2gb card this is not likely the slow part, with a dual-port 4gb card it is very very unlikely. Most likely if you are seeing latency on the HBA there is something that is set incorrectly, or needs a firmware upgrade.
If you are actually seeing latency:
–Validate your port speed setting for both the card and the SAN switch (I have seen a switch port set to 100mb, you WILL have a performance impact)
–Validate your HBA firmware versions and compatibility with SAN OS versions, your SAN vendor will have very specific recommendations.
–Validate the SAN connections to the switches
Assuming you have 40 drives for each of your data drives, the simple answer is, the reads/writes should stay below 180-200 per drive. This allows for 7200 – 8000 IO’s per second on each drive. Again, don’t worry about spikes above this level unless they are sustain, or very close together.
This has less to do with disk-IO, and more to do with overall performance measures. An important thing to understand, and provides a large amount of insight into the overall responsiveness of your server. Get to know what your normal CPU profile looks like in the different points of time I mentioned above.
This chart shows a 9 drive array that has issues. A little hard to read here, but the queues (blue) are WELL above 18 and the avg read/write activity is WELL above the 1600 – 2000 level.
This chart shows a 40 drive array that is performing within acceptable tolerances. These queues show some spikes in activity, still remaining bellow the allowable 80 for this configuration, and going back to an earlier point about knowing what happens when…. those are all during maintenance and backup windows.