SSIS Design Tip – Build For Restartability

Introduction

One of the most important things that you can do as you begin a development effort is to develop a common, unified set of standards.  In prior posts I have stressed the importance of naming conventions for database files, as well as connections in the packages.  This is less about the naming of the objects, but rather the consistency, supportability, and modular design that SSIS allows which in turn creates a restartable product.

I do not run a package in production that cannot be started “from the top” by itself.  There is quite a bit of thought and effort in some cases to ensure data consistency, proper cleanup of work tables, etc, however, in every case that work has paid off (multiple times over).  There is nothing worse (and more dangerous) than to have to do research to figure out what part of a package needs to be disabled, which work tables need to be manually cleaned up, what data needs to be backed out before a restart…. generally when those things have gone wrong, a package has failed, it is not the time to try to remember what you were thinking when you (or someone else) was thinking when the package was built.

Package Design

Every package, dimension load, fact table, post-load processing, does not matter…. they ALL look exactly the same.  There are 3 parts.  Part 1 consists of establishing the audit-key in the audit table (there will be another post on SSIS load audits), determining some audit (pre-load) information about the table that is about to be affected.  All of part-1 is driven by package variables, so there is no hard-code in this section.  In addition some packages will have specific cleanup processes before part-2 which is the ETL.  In the cleanup, you need to cleanup from the staging table (if you are going to re-stage the data), cleanup from the target if the ETL does not deal with existing data.  This is the section that requires the most thought…. think about what your package needs to do should you run out of space in your target, you loose database connections to remote servers, any series of things that can go wrong…. should you need to start this package from the top, what data needs to be cleaned so the ETL can run cleanly.  Part-3 is the post-load audit wrap-up.  Again, driven by variables, after the load audit information is collected and the audit-rows is closed out with a completion time as well as statistics about that load.

Conclusion

There are many advantages to consistent package design, not the least of which is forcing the issue of restartablility.  It is important that each modular piece of a load process is independent and functions as a standalone process.  True, there are dependencies that need to be met between load processes,however, beyond those dependencies, the processes should each handle their own logical part of work.  1 package = 1 dimension load…. or 1 fact load etc.  resist the urge to combine too much into a single package, restart is important, and if you load does too much in a single package, restart becomes difficult.

Warehouse Design Tip -Schema’s

Introduction

How many times have you wished you could do a table rebuild during the day…. create a new calculated column (in seconds)…alphabetized the column names in a table (yes I really had requests to do this for an external tool)…. hide “unnecessary” columns from users such as checksums, audit_keys etc… code a particular table with (NOLOCK)…. security for specific objects…. etc…??

If you have wanted to do any of those, you should be using views to expose the data in the warehouse.  Because you have built a schema that contains the actual tables, your primary user views can be the default dbo owned objects, and you select the schema that the users utimatly use through the view.

An example:

Recently, due to reporting requirements, it was necessary to alter the description of a group of products, however, the source could not be changed to reflect the change in names.  Rather than build a lookup table, and translation process in the ETL load, I let the undesirable descriptions load, and built an override in the view.  In this case I did need to alter the dimension to add a column (preferred_product_type_desc) that allows nulls. The dimension load process does nothing with the new column, it is not included as comparison for type-1 changes.  Notice that the table is created in a schema.

CREATE TABLE [dw_dbo].[all_products](
   [product_type_key] [int] IDENTITY(1000,1) NOT NULL,
   [product_type_desc] [varchar](50) NOT NULL,
   [preferred_product_type_desc] [varchar](50) NULL,
   [product_category] [varchar](50) NULL,
   [product_type_id] [int] NOT NULL,
   [audit_key] [int] NOT NULL,
   [rec_mod_dt] [datetime] NULL,
   [rec_create_dt] [datetime] NOT NULL
) ON [Primary_DIM_IDX]

Next, create the dbo view (so when the users select product_type_key… from all_products they are actually selecting from the underlying schema object).You will also see a few key things in this view:

  1. There is a case statement that exposes the altered description (if there is one) for this particular field.
  2. The load audit information (that the users don’t need) is not exposed.
  3. The nolock hint is applied in the select.  You don’t want a long-running user query to prevent the load from processing data, and you certainly should not hope (or require) the user to code the hint into the query.
CREATE VIEW [dbo].[all_products]
AS
   SELECT product_type_key,
          CASE
             WHEN preferred_product_type_desc IS NOT NULL
                THEN preferred_product_type_desc
             ELSE product_type_desc
          END AS product_type_desc,
          product_category,
          product_type_id
   FROM   dw_dbo.all_products WITH(nolock)

The schema approach will also allow you to make a temporary copy of the existing table, point the view to the temporary copy while the existing table is altered, repartitioned, moved whatever the maintenance is…. once the maintenance work is complete, the view is pointed back to the permanent table.

Conclusion

Do yourself a favor, and build a layer of insulation between your data  and the outside world. User  and tools have unique requirements, and creating this separations will allow for flexibility, customization, and tool-specific needs that you will be thankful you can implement with such ease!

Follow

Get every new post delivered to your Inbox.

Join 373 other followers