Now that we have covered a large portion of the physical disk (leaving aside physically aligned partitions for a moment), I am going to jump over the next layer in the stack which would be filegroups and files to talk about partitioning and compression first. The logic of the filegroups and files will be easier to cover after this discussion.
Sticking with the basic layout that we arrived at earlier with two data drives, and understanding that we have a range partition scheme available to us (I wish we could subpartition as in Oracle… maybe someday), but given the range you need to decide what is a logical split. The split needs to be logical to you… as well as your users… partition elimination in queries is a powerful thing (throw in compression, and start scanning all partitions because you picked something not logical to your users it is equally powerful in a bad way). Making a general statement, in a data warehouse you can generally find a way that annual partitions make sense to everyone. It might be calendar year, it might be fiscal year, it might be a sales year ending in August… but generally every business will have some sort of logical annual cycle. Given the fact that your building a warehouse, your likely interested in looking at comparing data between last year, and how performance is this year…. multi year trends etc.
Jumping right in, to create a partitioned table, you need to have the partition function, which defines your ranges, and a partition scheme which creates the connection between the file-group and the partition values that are to go into that range. For your own sake, come up with a naming convention that makes sense. This seems like an obvious point, but if you have an annual range for 2005 data and the filegroup it is associated with is named database_name_filegroup_6, (which then has files associated) you will (look at my picture) pull out your hair. Once we get to filegroups, number of files etc I will go over this in more detail. For now, it is sufficient to say that there are file-groups for each of the years, and each filegroup is associated with a given range.
CREATE PARTITION SCHEME [Annual_Partition_PS1] AS PARTITION [Annual_Partition_PF1] TO ([data_2005], [data_2006], [data_2007], [data_2008], [data_2009], [data_2010], [data_3]);
Tying this partitioning discussion back to the disks that we talked about previously, each partition (by way of file-group) is tied to a drive letter, which is one of the two data luns from the SAN.
Thus, if you are requesting data to compare 2010 and 2009 data you are accessing data (in parallel) from two different drives, which are connected to two different storage processors, which are connected to 2 different LUNS…load balanced across two switches connected to your dual-port HBA. I hope you’re starting to see this relatively clearly.
Compression: I (at some point) I will go into greater depth on this topic alone, but for our purposes here, keeping it in the world of disk IO performance… proceed (with measured caution) but please do yourself the favor of working through the process of getting things compressed. There are some gotchas that can tank your query performance, but the upside is HUGE. Imagine each read of an 8k page brings back 16k of data, now we are getting disk IO performance!
ALTER TABLE [responder] REBUILD PARTITION = 5 WITH(DATA_COMPRESSION = PAGE )
The space savings are obvious, and as we discussed with RAID-10 configuration, every MB of actual storage requires 2 MB of physical disk to allow for mirroring. If you can save 300Gb you have realized quite a substantial savings in disk.
The downside… DO NOT (at least not with regularity) update/delete/insert to the compressed partitions. You WILL pay a penalty large enough that you will notice. (I have seen cases on large tables with compressed index’s, etc, where the compression can cause a 4x slower response for update/inserts/deletes).
Remembering that we are talking about a warehouse layout, it is relatively safe to compress the data that is a few years old. IF on the outside chance there is a historical update to the data it will run longer… but the space savings, and read rates are well worth the tradeoff.