Warehouse Design Tip -Schema’s

Introduction

How many times have you wished you could do a table rebuild during the day…. create a new calculated column (in seconds)…alphabetized the column names in a table (yes I really had requests to do this for an external tool)…. hide “unnecessary” columns from users such as checksums, audit_keys etc… code a particular table with (NOLOCK)…. security for specific objects…. etc…??

If you have wanted to do any of those, you should be using views to expose the data in the warehouse.  Because you have built a schema that contains the actual tables, your primary user views can be the default dbo owned objects, and you select the schema that the users utimatly use through the view.

An example:

Recently, due to reporting requirements, it was necessary to alter the description of a group of products, however, the source could not be changed to reflect the change in names.  Rather than build a lookup table, and translation process in the ETL load, I let the undesirable descriptions load, and built an override in the view.  In this case I did need to alter the dimension to add a column (preferred_product_type_desc) that allows nulls. The dimension load process does nothing with the new column, it is not included as comparison for type-1 changes.  Notice that the table is created in a schema.

CREATE TABLE [dw_dbo].[all_products](
   [product_type_key] [int] IDENTITY(1000,1) NOT NULL,
   [product_type_desc] [varchar](50) NOT NULL,
   [preferred_product_type_desc] [varchar](50) NULL,
   [product_category] [varchar](50) NULL,
   [product_type_id] [int] NOT NULL,
   [audit_key] [int] NOT NULL,
   [rec_mod_dt] [datetime] NULL,
   [rec_create_dt] [datetime] NOT NULL
) ON [Primary_DIM_IDX]

Next, create the dbo view (so when the users select product_type_key… from all_products they are actually selecting from the underlying schema object).You will also see a few key things in this view:

  1. There is a case statement that exposes the altered description (if there is one) for this particular field.
  2. The load audit information (that the users don’t need) is not exposed.
  3. The nolock hint is applied in the select.  You don’t want a long-running user query to prevent the load from processing data, and you certainly should not hope (or require) the user to code the hint into the query.
CREATE VIEW [dbo].[all_products]
AS
   SELECT product_type_key,
          CASE
             WHEN preferred_product_type_desc IS NOT NULL
                THEN preferred_product_type_desc
             ELSE product_type_desc
          END AS product_type_desc,
          product_category,
          product_type_id
   FROM   dw_dbo.all_products WITH(nolock)

The schema approach will also allow you to make a temporary copy of the existing table, point the view to the temporary copy while the existing table is altered, repartitioned, moved whatever the maintenance is…. once the maintenance work is complete, the view is pointed back to the permanent table.

Conclusion

Do yourself a favor, and build a layer of insulation between your data  and the outside world. User  and tools have unique requirements, and creating this separations will allow for flexibility, customization, and tool-specific needs that you will be thankful you can implement with such ease!

SSIS – Package Configurations

Introduction

SSIS provides a robust method for making packages portable between environments such as Development –> Testing –> Production.  This post will hopefully get you started with the incredible number of things you can control, scale, and make your packages portable.  I am focusing mainly on password management, however, you will also see a variety of other items that can be set in configuration files.  There are two places you can store your configuration data, first being in the database, the second being an xml file.  There are advantages to each, I have chosen to use an XML file to isolate all of the password info into the file, and not rely on either windows authentication, or keeping a password in the package to obtain the remaining information from the database.

I have chosen to have a config file for each package.  When I began this with SSIS-2005 you could not have connections in an XML file that were not used, so a single all-encompasing file was simply not possible.  With SSIS-2008 (sp1) *prior to sp1 there are still issues with this erroring at runtime* you are able to create a single file with all of the connections, even if they are not used in that particular package.  You will get a warning that there are connections that are not used, but the package will execute.  Even though this functionality is now supported, I would still recommend a file for each package, this will allow for better control over some of the other items you will be setting in a config.  For example, if you want to change the “defaultbuffermaxrows” for a dataflow (which you will), you need to be sure that all of your dataflows have unique names if you use a common xml file, additionally, you can impact other packages if you make a syntax error when you make changes.  One xml file for each package allows for flexibility, isolation, and risk reduction. That is a good trade for the VERY small amount of extra work to create the individual files.

Make it all work (Connections)

If you have read any of my other posts, you will see that I am big on naming conventions, this is no exception.  Do yourself a HUGE favor, and come up with a standard, and PLEASE don’t include the server names in the name of the connection.  You (might) have different server names in development –> test –> production.  You (might) in development host a series of source databases in a single instance that in production are on different machines etc, please keep this generic.

I have used:
[type].[server].[database].[user]
type = Ado/Native/ODBC/File whatever your connecting to
server = logical server name such as dw_db (Data Warehouse Database) dw_etl (Data Warehouse ETL server) these might equate to DB links that you should keep generic as well.
Database = What your connecting to
user = who am I connecting as

This will allow you to understand everything about a connection without making assumptions.  This is good for both the connection manager, but more importantly when you are editing your XML file later.

Package Setup

At the package level, because I have chosen to host ALL of my password information in the xml file, I set the protectionlevel to ‘DontSaveSensitive’.  The default of ‘EncryptSensitivewithuserkey’ gets people into trouble…. if you imbeded the password into the connection with that setting… YOUR windows domain account is the ONLY account that can access that password information.  When you migrate to production, and execute the package from a Job (unless your account is running the database service) the job will fail because the password information is encrypted and cannot be accessed by that account.

Package Configurations

Right-click in the empty space on the control-flow surface, and select package configurations…. you will get a wizard to walk through the rest of the steps to build a config file.

 

This screen you select the XML (or database) configuration, and the file-name (or variable) for the configuration file.  I have selected “N” as I have a local drive on my ETL server that the configurations are stored on.  This allows the configurations to be stored on the server with the production passwords, and the ONLY access to those files would be the network-admin that has access to the server.  Since my environments are consistent, the N drive is available I chose this method.  You can setup an environmental variable on the server that you could set to “N” on one machine and “X” on another, for my environment, the option shown works best for me.

Next, you can select what items are included in the configuration file, I simply select everything for the connection managers, this will allow you to configure your connection strings, passwords, user-id’s, server-names, everything you need for your connections.

Spend some time looking through the list of things you can customize (there is A LOT) and you can get pretty creative if you need to.  The other thing that I do is for each data-flow, I select the “buffertempstoragepath”, “Defaultbuffersmaxrows”, “defaultbuffersize”.  (There are a few others that may make sense….but for now lets keep it simple).  As you move from Development (which in my case is a desktop machine), to a testing environment (which is not sized for speed/performance), to production these values will change.  My default buffer settings are MUCH smaller on my dev machine than they are in production.  Another post, for another day I will talk about the tuning opportunities and how to calculate all this stuff…. but if you include it here, your package can automatically scale to take advantage of the hardware you are deploying it to.

Once you are done with this, you can name the config if you want to support environments for package generation, environment builds etc, you can leave it as configuration 1 to start with.  This will depend on how you deploy packages, and if you need to use the package deployment utilities, or if you manually deploy packages to each environment (which is what I do).

Now that you have an XML file…

  1. Open it in BIDS, select Edit/Advanced/Format Document
  2. Find the configuration strings for your connections (you will see sections that contain the names you have in the connection manager)
    1. Add Password=xxxxxxxx; (whatever your password is for the environment you are going to move this configuration to) for each of the connection strings
    2. Set the User ID=XXXXX; to the appropriate user-id (I have different users in dev/test/prod)
    3. Set the Data Source=xxxxxx; to the appropriate server
  3. In the same section, you will see additional properties
    1. Properties[Password]  (set the configured value)
    2. Properties[ServerName] (set the configured value)
    3. Properties[UserName] (set the configured value)
  4. Set the values you wish for Buffertempstoragepath  (This is where sorts happen when you run out of memory space)
  5. Set DefaultBufferMaxRows (according to which environment you’re moving this file to)
  6. Set DefaultBufferSize (according to which environment you’re moving this file to)

Save the XML file… and you are ready to go…. the location you picked in the second step of the wizard is where this file needs to be, and will be picked up at runtime to obtain the password info (and anything else that you override).

2008 Resource Governor – Does it work?

Clip to Evernote

Yes….it works….

 

Introduction

I am not going to pretend to be the expert on this topic, so please feel free to comment.  I am running resource governor in production (in a Data Warehouse environment), and simply want to share some of my observations, tips that worked, and “words of wisdom”.  This post is not a detailed instruction on how to setup the resource governor, there is a GREAT doc “Using the Resource Governor” Sql Server Technical Article by Aaron Bertrand, and Boris Baryshnikov.

Things to note:

  1. Read the doc!  There are lots of practical hints, monitoring recommendations etc.  Don’t jump in before you understand it.
  2. Keep the proc simple – especially to start off with.  I over-engineered by first attempt at the classifier proc with lookup tables for the different groups.  My aim was to make the process dynamic, and by adding/removing things from the tables at different times I could control how the groups behaved.  Not a great result…. this proc is called A LOT… make sure what you build is well optimized.
  3. Create a failsafe “out” for your account.  The FIRST thing I check in the proc is a test for my user-name and assign myself to a group.  May be paranoid here, but, if this proc goes badly, you may not get a connection to be able to alter it easily…. the last thing you want is to have to shutdown a production instance because you are trying out code that just simply did not work….. I KNOW we test, but things happen that shouldn’t.
  4. Keep the group setup simple – even if you have grand plans, start simple.  Start with your worst offending application, create the group for it, and let everything else fall into another group.  This will basically get you default (system processes), BigApp group (for me that is SPSS),  User Group (everything else).
  5. Understand the utilization you currently have.  I am not being smart… you need to do some monitoring before you start.  What does your CPU/memory utilization look like before you start.  How much parallelism are your large queries getting (the ones you’re going to put into the BigApp group).  If you don’t know where you are starting from, it is hard to know if things are getting better (or worse).
  6. Understand your target, nothing precise, this is “art”.  For me, I had a few applications that needed “special” treatment to allow everything to live together and get consistent results. So I wanted to reduce the parallelism on SPSS so it would not take-over (for long periods of time) and reduce performance for everything else.
  7. Set reasonable limits.  Especially at first, and even after running, never cut your default pool too short, you will be in for a surprise (not the good kind) if something drops into default, and you have allocated all the available resources to the other pools.
  8. Keep it simple (now I am being smart, but really… start slow with this thing you will thank me later)
  9. Use the resource governor!  A healthy dose of caution is good, but, it really does work.

Understanding the pools

Assuming you have an idle system, and a query comes into a pool that is allocated a max of 50% of cpu, it will be allowed to use all 100%, the governor does not prevent “over allocation” to a pool if the resources are idle. 

There are times (untill the machine has been truly idle for a little while) that you will see cpu utilization right at your minimum cpu guarantee.  This caused me concern at first… I had set aside a guarantee of 15% for a adhoc_pool_high, and then saw a jump (to 15%) in cpu utilization.  I thought that the resource governor was suddenly using 15% cpu overhead, that is not the case.  The governor was holding onto 15%, and anything submitted to that pool was automatically given that 15%.  You will reach an idle cpu condition, and the 15% will drop off, but if you see it, don’t assume the worst…. it is likely working exactly as you intend.

 

Monitoring

This will be another topic for at least one (if not more) posts…. but…. because I like DMV’s so much, here are a few to go play with:

  • dm_exec_sessions  (memory/cpu usage) either individually for each session…. or by group, the group_id is here.
  • dm_exec_requests (waits and times) too much cxpacket you may want to lower DOP for the group.
  • dm_exec_cached_plans  (plans cached, for the pool)  freeproccache also takes a new argument to free cache for a given pool-id
  • dm_exec_query_memory_grants (bunches of memory request info) it even gives an ideal_memory_kb for the ideal amount of memory for a request.  Particularly if you don’t have lots of memory to play with, this will give you good tuning advice.
  • dm_exec_query_resource_semaphores  – are you getting the memory you specified in the config of the pool.
  • dm_os_memory_brokers – How is memory being utilized over time.
  • dm_os_memory_cache_entries – detailed info about cache usage buffer vs non buffer.

In addition to the DMV’s there are changes to perf-mon data that you can see at the pool level.  All of this goes back to my earlier point about understanding your system before you start playing with the resource governor.  You will be quickly overwhelmed by stats, and the more pools you setup, the more stats you will be overwhelmed with.

What I ended up with

This is one of those things that changes, gets tweaked over time… but right now here is where I am:

  1. adhoc_pool_high – min cpu 10% (I set this down to 0% via a job-step during load times), but during the day this guarantees the interactive report-user (from SSRS) available CPU.
    1. Report_user group – the common account we use from SSRS lands here
    2. special_user group – the “failsafe” group for my account
  2. adhoc_pool_med – This is where all of the “normal” query tools, and users connect
    1. user_interactive group – has the highest priority in the pool, and 50% of the memory, this is basically all the users that are not otherwise specified
    2. user_spss group – has the lowest priority in the pool, 25% of the memory, and reduced parallelism to 4.  These are LARGE long running queries, this help prevent them from taking over.
    3. user_tools group – has medium priority in the pool, and contains things like excel, office services, tableau.  Reduced parallelism to 7 for this group, not a big reduction, but allows for even execution among the tools.
  3. Default – Default pool …. things land here that are not classified
  4. internal – Internal engine pool…. anything running as SA, or the account the sql engine is running as lands here
  5. load_pool_high – pool for batch-load jobs. (controled by user-account and machine they are connecting from)
  6. maintenance_pool_low – pool with reduced CPU max, this is intended for maintenance tasks that can run for a long time, and reduce the impact on the overall performance of the system. 

I hope this helps you take the steps to begin working with the resource governor!  It will take some work to get everything playing together just the way you want, and there will always be some queries that break the mold, but it is well worth the efforts.

SSIS Checksum – a must have skill

Overview

When loading from source systems that don’t yet have the SQL Server 2008 Change Data Capture (CDC) enabled, using a checksum calculation to determine if data has changed is a critical Data Warehouse load concept to understand.  This will allow you to read the source, that can have dozens of columns, and with a single lookup, and a single compare (of a integer value) and determine if data within that vast source row has changed.  Not to mention speed…. this method is much faster than a traditional compare of many fields. I have build a very simple example (3 columns in the source), will check if the data has changed in 2 of those columns, and perform the lookup/insert logic based on the outcome.         

Setup

You will need to install the checksum transformation[edit] –> as a result of this blog post, there were some comments about another module to create hash-values, also a worthwhile component to look at! Available from codeplex. <– [edit] I have included the sample tables, and data here.         

The package

              

  • OLE DB Source – reads the src table
  • chksum transformation calculates the checksum
  • derived column is setting rec_create_date and rec_mod_date with a getdate()
  • Lookup is using name to find the row in the target, if it does not exist, insert otherwise return checksum from the database
  • conditional split, check to see if the calculated checksum (based on the source) matches the checksum from the lookup
  • OLE DB Command, update the row (the checksum’s did not match)
  • OLE DB Destination, insert a new row

       Now a look a little closer at the main components of this package.          The assumption here is that “fname” is the key value for the row, if that has changed, we want to insert a new row.  If there is any change to the web_address, or email_address for this row we want to update the row.  This will allow us to do a compare without having to worry case sensitivity, null values, varchar comparisons to multiple fields etc.  Also, when you add an additional field to the source, all you need to do is add to the check-box here, and automatically all rows will be tested with the new column.     The algorithm here is the default crc32, there is also a framework checksum (md5), and original.  I would generally use the md5 version (although I forgot to change the default in this example), as, particularly with large calculations it will avoid a possible collision in the number.             The lookup is based on first-name and returns the value for the previously calculated checksum into db_chksum.  Notice that the other fields from the databse are not being returned for the comparisons, we only need the checksum                 The ENTIRE row (in this case 2 fields, but there is no limit to this) is now checked with one simple compare of the newly calculated checksum, and the checksum returned from the database.  If nothing changed… nothing will be done…. if anything changed, update all the data in the row.                  The insert – simple (just note I do not map the rec_mod_date on the insert only the rec_create_Date) to allow us to understand when rows are inserted from the source vs updates.  Normally fact loads also have additional audit info, but that is another topic.                  The Update – also simple (note that we are updating the whole row including the checksum value) The rec_create_date is not touched here, but rather the rec_mod_date is update to indicate the modifications.                          

Putting it together

On the first run, the 1 row runs down the insert leg, and inserts a row with a checksum value of: 1409434363 If you run the package again, without changing data, nothing happens, because the calculated value on the next execution matches the value already in the database.        Then…       

UPDATE sxb_chksum_src SET web_address =
http://StefBauer.wordpress.com/&#8217;  

      
 

Notice the only thing changed is the capitalization of the web address, but that is enough to create a new checksum value.   Run again… and the checksum value from the input  is -252803585 which does not match the database value, and the row runs down the update leg. You will now see the rec-mod-date update, the source values are updated, and the new checksum is stored as well.     

 
 

 
 

         
 

  
 

     
 

  
 

   

Data Quality -Dynamic Business Validation

Overview

As far as business intelligence, and data warehousing is concerned, Data Quality has come to mean many different things, and is one of those buzz words that everyone likes to hear.  It implies that you understand the data in your intelligence systems, and implies that the answers you are getting out are trustworthy and reliable, all of that is true, but you need to identify, monitor, remediate, and communicate the inevitable issues with your data if you are going to move beyond implied quality to a truly trusted source.  Since the enterprise data warehouse comprises information from numerous sources systems throughout an organization dirty and inconsistent data is inevitable.  This article is intended to provide a framework for that monitoring process that is flexible, extensible, and most importantly automatic.

A key component of a successful business validation implementation is a partnership with business so that all the rules are understood.  As long as the customer can identify legitimate values for fields, the query to do the verification can be easily inserted into the process.  As with most  data warehouse load processes, things are done in a logical order:

  1. Dimension tables are loaded to establish any new and changing key values
  2. Business validation is executed against the dimension tables to identify any inconstancies, missing data, or other issues.
  3. Fact tables are loaded
  4. Business validation is executed against the fact tables.  This can be done in parallel for each table that was loaded.
  5. Automated post-load reporting.  These can be business reports based on the data just loaded, as well as listings of any business validations that were triggered due to the introduction of new data.

The Problem

Your company has a key performance indicator (KPI) that is driven by a date that defines the number of days it took to deliver a product.  Imagine what that measurement would look like if someone keys in a valid date, but is next year?  Or transposes the month and keys in 10 instead of 01?  The ETL process has a hard time transforming that data, it is valid, and difficult at load time to determine the intended meaning.  Most often this type of data is loaded, and thus reported on.  Something that simple, if not identified, can significantly undermine the confidence in a key metric, and thus undermine the confidence in an entire source (implied and real) data quality suffers.

The Infrastructure

The bus_val_message table contains the queries that will be looped through by the stored procedure.

Column Description
Query_id Unique identifier for the validation query
Validation_msg The message you want to see on the report when this validation is tripped
Message_cat Logical grouping of the message
Severity Group into Critical/Major/Minor/Informational
Table_name The source table being validated by this query
Zero_rows_ok True/False some validations should find data, others should not.
Query The actual query to be run to validate the data that has been loaded
Column_name The column in the source table being validated
Table_type Fact or Dimension
Group_no Allow the stored procedure to be called for everything in group_no 1 etc.

 

The bus_val_result table contains the queries that will be looped through by the stored procedure.

Column Description
Job_dt Job date that this result row ties to
Validation_msg The message you want to see on the report when this validation is tripped
Message_cat Logical grouping of the message
Severity Group into Critical/Major/Minor/Informational
Table_name The source table being validated by this query
Row Cnt Count result from the query
Query_id Which query does this result row relate to
Customer_key Identifier for the customer that this relates to. Allows reporting by customer
Audit_key Identifier for the process that inserted this row
Rec_create_dt Date and time this record was insert

 

The stored procedure call would look similar to this:

Exec bus_val_execute ‘product_delivery’, <audit_key>, <job_date>, ‘F’, ‘1’

Parameters:

           table_name – the table name
           audit_key  – the audit_key of the package that is executing
           job_date   – the date and time that the package started
           table_type – the type of table
                        F = Fact  (used to append the correct audit information)
                        D = Dimension (Look at the whole dimension)

           group_no – the query group number to be executed during this run
                        0 = do not execute this query
                        1 = queries in group 1 will be processed together
                        2 = queries in group 2 will be processed together
                        3 = queries in group 3 will be processed together

The first thing the stored procedure does is collect all of the available queries to be executed for a given table/table_type/group_no into a temporary table.

The next thing the stored procedure does is (from the audit data stored during load-time) determine which audit_key effected the data for the job-date being processed.  (That audit_key is then added to the where clause of the stored query prior to execution)

The solution

Dynamic business validation is built upon a flexible framework in which the validation queries are stored in the database, selected at execution time by a stored procedure, validation issues are stored as messages in the database that are then reported on.  Assume in our example, that the delivery date cannot be in the future, and must have happened within the last 60 days the query that is stored would look similar to this:

SELECT 230001 as query_id,
 count(*)as validation_count
  FROM product_delivery a
WHERE delivery_dt <= getdate()
  AND delivery_dt >= (getdate() – 60)

Automatically appended by the stored procedure to the query is the audit, and job date information to identify the rows loaded today (load process auditing should be a standard practice in loading datawarehouse data)

This query, if it finds any data, stores the results in the business validation results table.  In our example, we have received a date into the warehouse that is a year in the future, although a valid date, clearly violates the business validation rule, and would be trapped, reported on, and corrected in the source systems responsible for the initial capture of this information.

This achieves a few things

  1. Before a report with an invalid KPI is produced, the issue is identified, reported on, and automatically communicated to the proper people in the organization.
  2. Before business decisions are made, the business rules are automatically applied to the data.
  3. Data Quality is improved, and the trust in all of the underlying data is improved.  If there is a trust that the data is being validated, the trust in the data will increase.
  4. Opportunities to build better edits in source systems is created.  Clearly if a date in the future is not valid, is should never have made it into the originating systems.
  5. Provides a repository of business validation issues.

Database Driven Aggregations

Introduction
Have you ever had a series of aggregate (tsql) queries to run?  Do you want to be sure they all run in the right order?  Do you want to make the changes “dynamic” (no code changes required)?  Certainly you can build an SSIS package that calls sql with the control-flow and control dependencies, however, each new query requires changes to that package.  This solution has a single proc-call (at the right point in the control flow), then hands control to the database.  This allow you to add new queries, change the order, disable, enable new ones, etc.

Setup
A few things to get setup before we start:
1) Table Creation for aggregation process — Optional table creation for logging
2) Proc Creation – This proc will require some modification for your specific environment.  There is logging of the process that takes place (in the optional table script), and some database links that you will need to adjust according to your specific needs.

Usage
Now for the easy part!  Insert into the aggregate driver a row for each query (primary) query you want to run, there may be sub-parts to the query, that is handled in the next table.  This table is the “driver” of the process, has the descriptive info, and allow for “active/inactive” setting.
Aggregate_Query_ID - Something that makes sense to you (I like logical operations together in groups of 1000)
Aggregate_group_ID – What you call the proc with (everything that is active for this group will run – in the order they are in the table)
Aggregate_freq_id – Future enhancement for Weekly/Monthly executions
Aggregate_Desc – What is this query, just a description
dim_audit_pkg_name – audit, what process is this query tied to (optional)
Source_Table_Name – what is the primary table for this query (descriptive, optional)
Source_Column_Name – what is the primary column for this query (descriptive, optional)
active_ind – Bit 0=off 1=on
aggregate_server – target server where the output of process will go
aggregate_database – target database where the output of process will go
aggregate_owner – target schema where the output of process will go
aggregate_table_name – target table where the output of process will go

Next, insert into the aggregate query table.  This is where the actual work goes.
aggregate_query_id – ties this group of queries to the aggregate driver table
aggregate_sequence – for queries that require more than 1 step, this is the order those queries will execute
query_desc – description (optional)
query – tsql valid statements to be executed

exec dw_dbo.aggreage_master_package 3  –this would execute the proc running all queries defined in aggregate_group_id 3 (this is the call in the SSIS package)  From here on out, any addition, change, inactivation to queries running in group 3 will be done in the database.

Example
You have 2 different processes, each process has multiple queries, and each query has multiple parts.

agg_qry_id agg_grp_id agg_freq_id aggregate_desc dim_audit_pkg_nm source_tbl_name   active_ind agg_server agg_database agg_owner agg_tbl_nm
1005 1 1 Example aggregate 1 example_pkg orders   0 maggie dm_reporting dw_dbo dw_agg_order
1006 1 1 Example aggregate 2 example_pkg orders   1 maggie dm_reporting dw_dbo dw_agg_order
1050 1 1 Example aggregate 3 example_pkg orders   1 maggie dm_reporting dw_dbo dw_agg_order
2005 2 1 Example aggregate 4 example_pkg customers   1 maggie dm_reporting dw_dbo dw_agg_customer
2006 2 1 Example aggregate 5 example_pkg customers   1 maggie dm_reporting dw_dbo dw_agg_customer
2007 2 1 Example aggregate 6 example_pkg customers   1 maggie dm_reporting dw_dbo dw_agg_customer

 

You see that when we call the proc with group_id “1″, there are 3 queries, the first of which will be skipped because active_ind is set to false.  When we call the proc for group 2 all group 2 queries will run.

Taking a closer look at the first query that will execute under group 1 (which is actually the second query 1006):

aggregate_query_id aggregate_sequence query_desc query
1006 1 Delete data from staging table on server Bart (stg_dw_agg_order) DELETE FROM bart.dw_stg_db.dw_dbo.stg_dw_agg_order
1006 2 Insert aggregated year 2010 data into staging table INSERT into osprey.dw_work.dw_dbo.stg_dw_agg_order (order_year, customer_id, state, annual_sales) select year, customer_id, state, sum(annual_sales) where year = 2010 from dw_dbo.orders group by year, customer_id, state
1006 3 Insert aggregated search year 2010 data into dw_agg_order BEGIN TRAN; DELETE FROM dm_reporting.dw_dbo.dw_agg_order WHERE year = 2010; insert into dm_reporting.dw_dbo.dw_agg_order with (tablock) (year, cusomter_id, state, annual_sales from bart.dw_staging.dw_dbo.stg_dw_agg_order; COMMIT;

 

There are three parts for this query
1) clears the data from the staging table on the staging server
2) run the query to collect the 2010 data into the staging server table
3) remove the existing aggregate rows for the 2010 customers, and then replace them with the new data (this is done in a transaction, with tablock on purpose, if someone is querying the table this query will wait to obtain the lock, if this query is running, there are no other queries allowed, they will block and wait).  You don’t want someone getting a report of 2010 with no data…
 Once query 1006 is done…(meaning all 3 parts) the next query 1050 (which can have multiple parts as well) is then started.  Separating things into logical groups will allow you to have multiple calls to the proc run from SSIS at the same time, achieving some parallelism.  Once a year is done, you can simply turn it off, and add a new row for the next year, this allow for a re-run of historical aggregations (should there be a data issue that requires updates/changes to the aggregation, or simply an additional field in the aggregations) by setting the active indicator back on.

Help – My disk performance VI

Introduction
Help — My disk performance      (RAID how/why what to put where)
Help — My disk performance II  (Physical connections to the SAN)
Help — My disk performance III  (Partitioning and compression)
Help — My disk performance IV (filegroups and files)  
Help — My disk performance V   (Monitoring)  

This post will wrap-up some additional monitoring topics, and a few other odds and ends leftover from the other posts in this series.

Jumping right in, there are a few quick-checks relative to IO utilization that you can see from the database side using DMV’s.

Two Quick-Check DMV’s
sys.dm_io_pending_io_requests — If you have a large quantity of pending IO, this is an early indicator that you will see other metrics above the allowable limits for you disk configuration.  This is by no means a guarantee that you have an issue, however, is a good (quick) sanity check.

sys.dm_io_virtual_file_stats — I started the discussion about hotspots and how files are filled by SQL Server back in part IV of this series.  This view provides lots of insight into how things are being allocated on a per-file basis.

This sql is intended as a starting point … not a final solution, your usage will depend largely on how many databases you want to look at, how to report the data, etc.  The intent here is to show some output from a very useful and versatile DMV.

1) Create a table to grab the results from when done
CREATE TABLE #tmp_file_io
   ( dbname        VARCHAR(256),
       type_desc     VARCHAR(50),
       file_name     VARCHAR(256),
       read_xfer_ms  INT,
       write_xfer_ms INT   )

2) setup some variables
 DECLARE @count     INT,
                     @num_files INT
SET @count = 1

3) How many files do we need to look at?
 SELECT @num_files = (SELECT Count(*)
                     FROM   sys.database_files)

4) Go collect the data
WHILE @COUNT <= @num_files
  BEGIN

      INSERT INTO #tmp_file_io
                  (dbname,
                   type_desc,
                   file_name,
                   read_xfer_ms,
                   write_xfer_ms)

      SELECT Db_name(6)    AS dbname, <–obvious hardcode here, if your looping over databases, etc
             b.type_desc,
             b.name,
             io_stall_read_ms / num_of_reads   AS ‘disk read xfer ms’,
             io_stall_write_ms / num_of_writes AS ‘disk write xver ms’
      FROM   sys.Dm_io_virtual_file_stats(6, @count) a
             INNER JOIN sys.database_files b
                ON @count = b.file_id

      SET @count = @count + 1
  END

As a result of this, you will have a table with the IO rates by file.  You will be happy that you followed some sort of naming convention when you created the files, because this will quickly point out issues within a filegroup caused by autogrowth of files, or uneven allocation to files.  You don’t want to have gone through all the efforts to balance IO across 8 files, and then, by way of uneven allocation, have all the IO concentrated on a single file.  Again, this query (by itself) does not tell you if you have a problem, you need to review this for trends as the data is collected from instance start.  If you know you have queries that regularly access multiple years of data, across a large cross section of data, you would expect to see reasonably even distribution of IO across all the files in the filegroup.

There is certainly more detail, more script and more to write on this topic, and at some point, I am sure that I will continue with this series.  For now, I am going to wrap this one up.  There are a few things that I mentioned as I went through this series, and will be posting on all of these in the not too distant future. 

The “outstanding” topics that I touched upon, but will be covered in the future include:
–Indepth look at compression
–Resource Governor – How to
–Automated index reorg/rebuild
–Query tuning and explain plans
–Memory configuration, and considerations
–SSIS… modular design, dimension loading, etc.

Help – My disk performance V

Introduction
This is the fifth post in a series on disk performance pulled it together quickly to be included in some blogging fun for “IO, IO, It’s off to disk we go!”  on T-SQL Tuesday #004
Help — My disk performance      (RAID how/why what to put where)
Help — My disk performance II  (Physical connections to the SAN)
Help — My disk performance III  (Partitioning and compression)
Help — My disk performance IV (filegroups and files)  

We have covered a lot of territory in the previous 4 posts, hopefully you have an understanding of where you want to be.  In this post I start moving away from the physical layout towards monitoring.  I can’t say it strongly enough, you must monitor you servers, don’t believe someone is going to watch for you, and even if you have people with the responsibility of watching… you need to watch anyway.  Several reasons for this, first off, your responsibility is the Data Warehouse.  I am not speaking badly of other groups monitoring systems, but they have their hands FULL, and may very well have a different tolerance for unacceptable performance due to the number of systems they are responsible for and each systems unique needs. Secondly, each Data Warehouse has a “feel” of its own that you will begin to understand with consistent monitoring.  You will get to know your normal performance levels, and at a glance at a graph be able to know someone has a query that needs tuning, or an index opportunity.  The monitoring is where the “art” and “science” begin to intersect. (Query tuning… a topic for another time) is the other place where “art” and “science” cross paths.   

At the most basic, monitoring for disk-IO starts with server monitoring from the windows perf-mon tool.  Start a collection process on the server that writes to a new file every 24 hours, in this collection (at a minimum) collect the avg disk queue (for each drive), avg disk reads per sec (for each drive), avg disk writes per sec (for each drive), avg cpu (total is fine).  There are other measures that are worthwhile – sql memory, kernal memory and quite a few others…. but this topic we are looking a disk-IO… trying to stay on topic (and will talk about monitoring tools that go beyond perfmon shortly).  

Now that you’re collecting data, you also need to know when things are happening on your server.  At the highest level:  

  1. When does your index maintenance run, and for how long? (you have automated index reorganize/rebuilds based on percent fragmentation right?!?… also another topic for later)
  2. When does your backup run, and for how long? (are you using SQL Server 2008 backup compression? … you should)
  3. Do you have other maintenance that is regularly scheduled (you’re doing a weekly dbcc checkdb right?)
  4. When does your normal nightly data-load start, and how long does it normally run?
  5. Do you have scheduled reporting? (monthly reports, thing that run off-hours due to their size etc)
  6. Do you have reporting that is triggered by the completion of a load?
  7. Do you have resource group allocations scheduled to change based on time/load etc? (Are you using SQL Server 2008 resource governor? …also another topic for later, but you should)

Loading up a representative day (don’t start this exercise with a Saturday if that is your least utilized time, nor should you start by looking at Sunday if that is the day you have a checkdb running for 10 hours) into the performance monitor tool to review the results.  Ultimately, you need to understand all of the time periods, including maintenance windows, but for now, lets start with a normal representative day.  

Average disk queue
Assuming you have 40 drives for each of you data drives, then the simple answer is, your avg disk queue of 2 per drive should stay bellow 80 on each of your data drives.  Don’t worry too much at this point about a spike above 80… if you sustain a queue above 80 you will have performance issues.  

It is possible to have a disk-queue reported by perf-mon on the windows side, and the disks be idle, if you are seeing disk queues, you should get your SAN administrator involved to look at the disk statistics for the LUN in question on the SAN.  IF you have no disk pressure in the SAN, and the queue is high in Windows, you have HBA latency that needs to be addressed.  

HBA Latency 
 This is really unlikely – even with a dual-port 2gb card this is not likely the slow part,  with a dual-port 4gb card it is very very unlikely. Most likely if you are seeing latency on the HBA there is something that is set incorrectly, or needs a firmware upgrade.
If you are actually seeing latency:
–Validate your port speed setting for both the card and the SAN switch (I have seen a switch port set to 100mb, you WILL have a performance impact)
–Validate your HBA firmware versions and compatibility with SAN OS versions, your SAN vendor will have very specific recommendations.
–Validate the SAN connections to the switches

Average reads/writes
Assuming you have 40 drives for each of your data drives, the simple answer is, the reads/writes should stay below 180-200 per drive.  This allows for 7200 – 8000 IO’s per second on each drive.  Again, don’t worry about spikes above this level unless they are sustain, or very close together. 

Average CPU
This has less to do with disk-IO, and more to do with overall performance measures.  An important thing to understand, and provides a large amount of insight into the overall responsiveness of your server.  Get to know what your normal CPU profile looks like in the different points of time I mentioned above. 

Examples
This chart shows a 9 drive array that has issues.  A little hard to read here, but the queues (blue) are WELL above 18 and the avg read/write activity is WELL above the 1600 – 2000 level.

This chart shows a 40 drive array that is performing within acceptable tolerances.  These queues show some spikes in activity, still remaining bellow the allowable 80 for this configuration, and going back to an earlier point about knowing what happens when…. those are all during maintenance and backup windows.

Help — My disk performance IV

Introduction
This is the fourth post in a series on disk performance
Help — My disk performance       (RAID how/why what to put where)
Help — My disk performance II  (Physical connections to the SAN)
Help– My disk performance III  (Partitioning and compression)

Now that we have partitioning covered, filegroups should actually be relatively simple because it just becomes a physical discussion about the logical splits we just arrived at.  Just to restate the obvious, PLEASE come up with a naming convention for your filegroups, file-names, and directory structures. 10 minutes (literally) of thought about this up front will make monitoring, scripting, moving, etc MUCH easier later.

Data Files
The Windows OS operates on file-handles, and depending on the number of processors and read/write to more than 1 file, you will get many opinions on this one, and there are some “it depends” sort of answers…. what I have found through testing with 8 processors and the disk configuration discussed to this point in this blog, 1 file per processor for data files +1 file. (Log files in a moment)  More than 8 files in the filegroup seemed to add little measurable value, the 9th file is my insurance policy for unexpected growth.  Understanding how Sql Server writes to the files is also important, the engine will do it’s best to keep the amount of data equal in all the files in a filegroup.  So having a file-group with 7 files at 75% full, if you add a 8th file all the new data will go to the 8th file till it is 75% full, effectively removing the benefit of multiple files and creating a hotspot on that file.  (I will have more specifics, DMV’s, etc, for monitoring in a later post).Some basic rules:

  1. Pre-allocate space to all 8 datafiles
  2. Do not allow autogrowth on the 8 main files in the filegroup.
  3. Set the initial size small on the 9th file, and allow autogrowth to some number your’re comfortable with (I allow the ‘extra’ file to have growth to the same size as the primary files) I don’t like the autogrowth, but I like things failing with space issues less.
  4. Monitor the extra file for growth…if it is growing, you messed up your space projections, and you need to add space to the other 8 files.
  5. It is best to get the calculations correct, and preallocate the 8 files, this will prevent physical disk fragmentation as the files are added to.
  6. Monitor disk fragmentation on the server.

 Primary – personally, nothing against primary, but I put nothing into primary, and monitor what is in that space.  IF something went to primary, someone (probably me) built something without thinking all the way through.  I would allow for 8 data files preallocated to a reasonable size relative to your other dataspaces.  IF something goes into primary, you don’t want things failing or performing poorly, you can move things around in the background if something that goes there needs to be kept.

Expanding upon a diagram we looked at earlier, you can see the number of filegroups, and their relationship to the physical disks

Beyond the annual partitions, you see a filegroup for “Dim Data” (dimension data), “Dim Idx” (dimension indexs) “Data 1/2/3″ (generic data spaces)  You are going to have some data that does not belong to dimensions, and is not partitioned.  Hopefully all of the large stuff is partitioned, but, you need to have a non-partitioned home for some large tables as well.  If you KNOW two of these large tables are always used together, put them in data spaces that are on different drives. “Index” (non-partitioned indexs).  Depending on the quantity of non-partitioned tables, you may want to consider an 2″ space to allow for separation of the index IO from the table IO. 

Log files
Basically the same principles apply to log files (remember, they are going to their own drive).  Through my testing, due to the high level of activity,  logging can benefit from additional files (3 per cpu).  To avoid the over head associated with growing log files set them to their maximum size and do not allow for autogrowth.  Because there is no autogrowth on these files, you need to ensure you have sufficient log space available to manage data loading and queries, as well as monitoring in place to detect logs filling.  In addition to these file management recomendations, logging for a Data Warehouse should be set to simple mode. (Backup and recovery design for a Data Warehouse is another topic).

Naming Conventions
Each data drive should have a sqldata directory followed by a directory for each database (please don’t be one of those people that drop a few hundred files from all the databases on a server in one directory).  Beyond that, each filegroup should get a name.

J:\                                                                      <—- Data Drive
      SQLDATA                                                  <—- SQL Data directory (this will matter when you have ssas on the same drive as data) or sqllog
            Example_DB                                      <—- Each database has a directory
                  Example_db_data_2005_1  <—- Name each file – dbname + file type (data/log) + year + number
                  Example_db_data_2005_2
                  Example_db_data_2005_3
                                ……
                  Example_db_data_2005_7
                  Example_db_data_2005_8
                  Example_db_data_2006_1
                             …….

 As you can well imagine, there are going to be quite a number of files, so maintaining order within the naming will give you the ability to script, and otherwise programatically monitor things.  This will also provide a very clear delineation of which data resides in which file.  When it comes time to move the oldest data to a different drive, it is as simple as issuing the command in the database to provide the new location of the file, shut-down the database, physically copy the file to the new location, restart the database.  This will become important once your primary “fast” drives start to fill, and the old (least used) data can be migrated to larger (slower), less expensive LUNS.

Help – My disk performance III

Introduction
This is the third post in a series on disk performance
Help — My disk performance       (RAID how/why what to put where)
Help — My disk performance II  (Physical connections to the SAN)

 Now that we have covered a large portion of the physical disk (leaving aside physically aligned partitions for a moment), I am going to jump over the next layer in the stack which would be filegroups and files to talk about partitioning  and compression first.  The logic of the filegroups and files will be easier to cover after this discussion.

Sticking with the basic layout that we arrived at earlier with two data drives, and understanding that we have a range partition scheme available to us (I wish we could subpartition as in Oracle… maybe someday), but given the range you need to decide what is a logical split.  The split needs to be logical to you… as well as your users… partition elimination in queries is a powerful thing (throw in compression, and start scanning all partitions because you picked something not logical to your users it is equally powerful in a bad way).  Making a general statement, in a data warehouse you can generally find a way that annual partitions make sense to everyone.  It might be calendar year, it might be fiscal year, it might be a sales year ending  in August… but generally every business will have some sort of logical annual cycle.  Given the fact that your building a warehouse, your likely interested in looking at comparing data between last year, and how performance is this year…. multi year trends etc.

Jumping right in, to create a partitioned table, you need to have the partition function, which defines your ranges, and a partition scheme which creates the connection between the file-group and the partition values that are to go into that range.  For your own sake, come up with a naming convention that makes sense.  This seems like an obvious point, but if you have an annual range for 2005 data and the filegroup it is associated with is named database_name_filegroup_6, (which then has files associated) you will (look at my picture) pull out your hair.  Once we get to filegroups, number of files etc I will go over this in more detail.  For now, it is sufficient to say that there are file-groups for each of the years, and each filegroup is associated with a given range.

CREATE PARTITION FUNCTION [Annual_Partition_PF1](int) AS RANGE LEFT FOR VALUES (2005, 2006, 2007, 2008, 2009, 2010);

CREATE PARTITION SCHEME [Annual_Partition_PS1] AS PARTITION [Annual_Partition_PF1] TO ([data_2005], [data_2006], [data_2007], [data_2008], [data_2009], [data_2010], [data_3]);

 

Tying this partitioning discussion back to the disks that we talked about previously, each partition (by way of file-group) is tied to a drive letter, which is one of the two data luns from the SAN.

Thus, if you are requesting data to compare 2010 and 2009 data you are accessing data (in parallel) from two different drives, which are connected to two different storage processors, which are connected to 2 different LUNS…load balanced across two switches connected to your dual-port HBA. I hope you’re starting to see this relatively clearly.

  

   

Compression: I  (at some point) I will go into greater depth on this topic alone, but for our purposes here, keeping it in the world of disk IO performance… proceed (with measured caution) but please do yourself the favor of working through the process of getting things compressed.  There are some gotchas that can tank your query performance, but the upside is HUGE.   Imagine each read of an 8k page brings back 16k of data, now we are getting disk IO performance!

 ALTER TABLE [responder] REBUILD PARTITION = 5 WITH(DATA_COMPRESSION = PAGE )

  

  

The space savings are obvious, and as we discussed with RAID-10 configuration, every MB of actual storage requires 2 MB of physical disk to allow for mirroring. If you can save 300Gb you have realized quite a substantial savings in disk.

The downside… DO NOT (at least not with regularity) update/delete/insert to the compressed partitions.  You WILL pay a penalty large enough that you will notice.  (I have seen cases on large tables with compressed index’s, etc, where the compression can cause a 4x slower response for update/inserts/deletes).

Remembering that we are talking about a warehouse layout, it is relatively safe to compress the data that is a few years old.  IF on the outside chance there is a historical update to the data it will run longer… but the space savings, and read rates are well worth the tradeoff.

Follow

Get every new post delivered to your Inbox.

Join 373 other followers