SSIS – Load Audit

Introduction

There are many reasons to put some thought into an audit process for your load jobs.

  1. Tie the insert/update of a row to a specific SSIS package.
  2. Provide consistent date-stamp times for a given load (I will explain this in a moment).
  3. Ensure all the data staged actually made it to the target.
  4. Monitor load times, and easily find packages (or tables) that need tuning.
  5. Provide input to load jobs for selecting data from the source (you know the last time you ran successfully, select data that changed since then in the source… that is a topic for another post)
  6. Provide SLA reporting information on load-times.
  7. Probably a few other things that you will discover as you implement this.

If well thought out up-front, the additional major benefit of a good audit process is a cookie cutter approach to building load jobs.  They ALL do exactly the same thing, in exactly the same way using the same variables, and setup.  (The details of this is for another post, but I will cover the generalities here)

The Setup 

CREATE TABLE [dw_dbo].[ld_control_audit]
   (
       [audit_key]                 [INT] IDENTITY(1, 1) NOT NULL,
       [parent_audit_key]          [INT] NOT NULL,
       [tbl_name]                  [VARCHAR](50) NULL,
       [pkg_name]                  [VARCHAR](50) NULL,
       [exec_start_dt]             [DATETIME] NULL,
       [exec_stop_dt]              [DATETIME] NULL,
       [extract_row_cnt]           [BIGINT] NULL,
       [staged_row_cnt]            [BIGINT] NULL,
       [insert_row_cnt]            [BIGINT] NULL,
       [update_row_cnt]            [BIGINT] NULL,
       [error_row_cnt]             [BIGINT] NULL,
       [tbl_initial_row_cnt]       [BIGINT] NULL,
       [tbl_final_row_cnt]         [BIGINT] NULL,
       [tbl_max_surrogate_key]     [BIGINT] NULL,
       [successful_processing_ind] [CHAR](1) NULL
   )

Once you have the table, the view (in the default dbo schema) also calculates the execution time and executes with a NOLOCK, you want to be sure there is no locking by your monitoring processes on this table.

CREATE VIEW [dbo].[ld_control_audit] AS
   SELECT audit_key,
          parent_audit_key,
          tbl_name,
          pkg_name,
          exec_start_dt,
          exec_stop_dt,
          DATEDIFF(MINUTE, exec_start_dt, exec_stop_dt) AS runtime_minutes,
          staged_row_cnt,
          extract_row_cnt,
          insert_row_cnt,
          update_row_cnt,
          error_row_cnt,
          tbl_initial_row_cnt,
          tbl_final_row_cnt,
          tbl_max_surrogate_key,
          successful_processing_ind
   FROM   dw_dbo.dimaudit AS dimaudit_1 WITH (nolock)

Usage

Now that you have the basic table and view, the first thing your load will do is insert a row for the “Master Package” , which will create a new with a -9999 parent_audit_key to tie all of your “sub” executions to, and creates the exec_start_dt of this particular execution.

You now have an option…. you can select back the identity column you created, and pass it into each of your packages that you execute, or let each package lookup to get the audit-key that they will use as the parent_audit_key.  I DO NOT pass it in, but rather lookup at the top of each sub-package.  Reason 1, I execute all the sub-packages as out-of-process, to obtain a separate memory space for each of the executions, as I am running MANY processes in parallel.  Reason 2, restartability, if each package functions as an independent unit of work (including looking up the parent_audit_key) if a package fails for any reason, it can be re-run by itself without having to invoke the master-package, or manually adjusting variables etc.  If you are using  checkpoint files for restart this still plays well with that as the existing variable states (including audit_key assigned to the package, as well as parent_audit_key, record counts etc) are preserved for restart.

 

 If you have been reading any of my other posts, by now you have figured out I am “stuck” on naming conventions and consistency, but if you open ANY load package, they all look the same, so lets walk through the parts.

  1. Selecting the parent audit key from the audit table (max key for “master” for successful_processing_ind is null)
  2. Insert a new row into the audit table (VARIABLES for everything in the insert)…This is cut-paste for any package.
    There is a second sql statement that then selects the max audit-key for the @tablename that we just inserted a row for (this is what we will stamp on ALL rows affected by this load)
  3. This will select the package start-time from the audit table row.  This will be used for the rec-mod/rec-create dates in ALL rows affected by this load.  We are effectively going to tie all of the data in the warehouse back to a specific audit row, and a specific execution time.  If data needs to be backed out due to source issues or otherwise, all rows affected by the run, all have the same time.
  4. Using a scripting task, (same result as using an expression build in bi-express product if you have that).  This task sets SQL statement variables that will be executed before (and after) the load.
  5. Executes the count sql statement just built by the scripting task by using the table-name variables in the package.
  6. Executes the select max surrogate key sql stmt built by the scripting task using the table-name (or if this is a fact table, we select the max audit-key)… this is to capture with absolute certainty the starting point of the load
  7. This is the ONLY unique thing in this package.  EVERYTHING else is driven by the variables and dynamic SQL.
  8. Executes the count sql statement just built by the scripting task by using the table-name variables in the package.
  9. Update stats (this is for dimension loads only) the fact table stats/index’s are maintained by a nightly script, that is too large to run at load-time
  10. Update the audit row, set the end-time, and success-ind

Conclusion

There are many reasons for consistency… audit and maintainability is a major benefit.  The audit will produce a row in an audit table that you can tie ANY row in the warehouse (dimension or fact) back to the specific execution of a package.  The audit table will give you execution times for packages (which is very useful to monitor ongoing load-times and bottlenecks).   It may seem like a lot of work up-front (it really is not), and since all of the infrastructure is driven by dynamic sql, and variables, there is nothing (other than changing the values in variables) that needs to be created to maintain a consistent audit process.

SSIS Design Tip – Build For Restartability

Introduction

One of the most important things that you can do as you begin a development effort is to develop a common, unified set of standards.  In prior posts I have stressed the importance of naming conventions for database files, as well as connections in the packages.  This is less about the naming of the objects, but rather the consistency, supportability, and modular design that SSIS allows which in turn creates a restartable product.

I do not run a package in production that cannot be started “from the top” by itself.  There is quite a bit of thought and effort in some cases to ensure data consistency, proper cleanup of work tables, etc, however, in every case that work has paid off (multiple times over).  There is nothing worse (and more dangerous) than to have to do research to figure out what part of a package needs to be disabled, which work tables need to be manually cleaned up, what data needs to be backed out before a restart…. generally when those things have gone wrong, a package has failed, it is not the time to try to remember what you were thinking when you (or someone else) was thinking when the package was built.

Package Design

Every package, dimension load, fact table, post-load processing, does not matter…. they ALL look exactly the same.  There are 3 parts.  Part 1 consists of establishing the audit-key in the audit table (there will be another post on SSIS load audits), determining some audit (pre-load) information about the table that is about to be affected.  All of part-1 is driven by package variables, so there is no hard-code in this section.  In addition some packages will have specific cleanup processes before part-2 which is the ETL.  In the cleanup, you need to cleanup from the staging table (if you are going to re-stage the data), cleanup from the target if the ETL does not deal with existing data.  This is the section that requires the most thought…. think about what your package needs to do should you run out of space in your target, you loose database connections to remote servers, any series of things that can go wrong…. should you need to start this package from the top, what data needs to be cleaned so the ETL can run cleanly.  Part-3 is the post-load audit wrap-up.  Again, driven by variables, after the load audit information is collected and the audit-rows is closed out with a completion time as well as statistics about that load.

Conclusion

There are many advantages to consistent package design, not the least of which is forcing the issue of restartablility.  It is important that each modular piece of a load process is independent and functions as a standalone process.  True, there are dependencies that need to be met between load processes,however, beyond those dependencies, the processes should each handle their own logical part of work.  1 package = 1 dimension load…. or 1 fact load etc.  resist the urge to combine too much into a single package, restart is important, and if you load does too much in a single package, restart becomes difficult.

SSIS – Package Configurations

Introduction

SSIS provides a robust method for making packages portable between environments such as Development –> Testing –> Production.  This post will hopefully get you started with the incredible number of things you can control, scale, and make your packages portable.  I am focusing mainly on password management, however, you will also see a variety of other items that can be set in configuration files.  There are two places you can store your configuration data, first being in the database, the second being an xml file.  There are advantages to each, I have chosen to use an XML file to isolate all of the password info into the file, and not rely on either windows authentication, or keeping a password in the package to obtain the remaining information from the database.

I have chosen to have a config file for each package.  When I began this with SSIS-2005 you could not have connections in an XML file that were not used, so a single all-encompasing file was simply not possible.  With SSIS-2008 (sp1) *prior to sp1 there are still issues with this erroring at runtime* you are able to create a single file with all of the connections, even if they are not used in that particular package.  You will get a warning that there are connections that are not used, but the package will execute.  Even though this functionality is now supported, I would still recommend a file for each package, this will allow for better control over some of the other items you will be setting in a config.  For example, if you want to change the “defaultbuffermaxrows” for a dataflow (which you will), you need to be sure that all of your dataflows have unique names if you use a common xml file, additionally, you can impact other packages if you make a syntax error when you make changes.  One xml file for each package allows for flexibility, isolation, and risk reduction. That is a good trade for the VERY small amount of extra work to create the individual files.

Make it all work (Connections)

If you have read any of my other posts, you will see that I am big on naming conventions, this is no exception.  Do yourself a HUGE favor, and come up with a standard, and PLEASE don’t include the server names in the name of the connection.  You (might) have different server names in development –> test –> production.  You (might) in development host a series of source databases in a single instance that in production are on different machines etc, please keep this generic.

I have used:
[type].[server].[database].[user]
type = Ado/Native/ODBC/File whatever your connecting to
server = logical server name such as dw_db (Data Warehouse Database) dw_etl (Data Warehouse ETL server) these might equate to DB links that you should keep generic as well.
Database = What your connecting to
user = who am I connecting as

This will allow you to understand everything about a connection without making assumptions.  This is good for both the connection manager, but more importantly when you are editing your XML file later.

Package Setup

At the package level, because I have chosen to host ALL of my password information in the xml file, I set the protectionlevel to ‘DontSaveSensitive’.  The default of ‘EncryptSensitivewithuserkey’ gets people into trouble…. if you imbeded the password into the connection with that setting… YOUR windows domain account is the ONLY account that can access that password information.  When you migrate to production, and execute the package from a Job (unless your account is running the database service) the job will fail because the password information is encrypted and cannot be accessed by that account.

Package Configurations

Right-click in the empty space on the control-flow surface, and select package configurations…. you will get a wizard to walk through the rest of the steps to build a config file.

 

This screen you select the XML (or database) configuration, and the file-name (or variable) for the configuration file.  I have selected “N” as I have a local drive on my ETL server that the configurations are stored on.  This allows the configurations to be stored on the server with the production passwords, and the ONLY access to those files would be the network-admin that has access to the server.  Since my environments are consistent, the N drive is available I chose this method.  You can setup an environmental variable on the server that you could set to “N” on one machine and “X” on another, for my environment, the option shown works best for me.

Next, you can select what items are included in the configuration file, I simply select everything for the connection managers, this will allow you to configure your connection strings, passwords, user-id’s, server-names, everything you need for your connections.

Spend some time looking through the list of things you can customize (there is A LOT) and you can get pretty creative if you need to.  The other thing that I do is for each data-flow, I select the “buffertempstoragepath”, “Defaultbuffersmaxrows”, “defaultbuffersize”.  (There are a few others that may make sense….but for now lets keep it simple).  As you move from Development (which in my case is a desktop machine), to a testing environment (which is not sized for speed/performance), to production these values will change.  My default buffer settings are MUCH smaller on my dev machine than they are in production.  Another post, for another day I will talk about the tuning opportunities and how to calculate all this stuff…. but if you include it here, your package can automatically scale to take advantage of the hardware you are deploying it to.

Once you are done with this, you can name the config if you want to support environments for package generation, environment builds etc, you can leave it as configuration 1 to start with.  This will depend on how you deploy packages, and if you need to use the package deployment utilities, or if you manually deploy packages to each environment (which is what I do).

Now that you have an XML file…

  1. Open it in BIDS, select Edit/Advanced/Format Document
  2. Find the configuration strings for your connections (you will see sections that contain the names you have in the connection manager)
    1. Add Password=xxxxxxxx; (whatever your password is for the environment you are going to move this configuration to) for each of the connection strings
    2. Set the User ID=XXXXX; to the appropriate user-id (I have different users in dev/test/prod)
    3. Set the Data Source=xxxxxx; to the appropriate server
  3. In the same section, you will see additional properties
    1. Properties[Password]  (set the configured value)
    2. Properties[ServerName] (set the configured value)
    3. Properties[UserName] (set the configured value)
  4. Set the values you wish for Buffertempstoragepath  (This is where sorts happen when you run out of memory space)
  5. Set DefaultBufferMaxRows (according to which environment you’re moving this file to)
  6. Set DefaultBufferSize (according to which environment you’re moving this file to)

Save the XML file… and you are ready to go…. the location you picked in the second step of the wizard is where this file needs to be, and will be picked up at runtime to obtain the password info (and anything else that you override).

SSIS Checksum – a must have skill

Overview

When loading from source systems that don’t yet have the SQL Server 2008 Change Data Capture (CDC) enabled, using a checksum calculation to determine if data has changed is a critical Data Warehouse load concept to understand.  This will allow you to read the source, that can have dozens of columns, and with a single lookup, and a single compare (of a integer value) and determine if data within that vast source row has changed.  Not to mention speed…. this method is much faster than a traditional compare of many fields. I have build a very simple example (3 columns in the source), will check if the data has changed in 2 of those columns, and perform the lookup/insert logic based on the outcome.         

Setup

You will need to install the checksum transformation[edit] –> as a result of this blog post, there were some comments about another module to create hash-values, also a worthwhile component to look at! Available from codeplex. <– [edit] I have included the sample tables, and data here.         

The package

              

  • OLE DB Source – reads the src table
  • chksum transformation calculates the checksum
  • derived column is setting rec_create_date and rec_mod_date with a getdate()
  • Lookup is using name to find the row in the target, if it does not exist, insert otherwise return checksum from the database
  • conditional split, check to see if the calculated checksum (based on the source) matches the checksum from the lookup
  • OLE DB Command, update the row (the checksum’s did not match)
  • OLE DB Destination, insert a new row

       Now a look a little closer at the main components of this package.          The assumption here is that “fname” is the key value for the row, if that has changed, we want to insert a new row.  If there is any change to the web_address, or email_address for this row we want to update the row.  This will allow us to do a compare without having to worry case sensitivity, null values, varchar comparisons to multiple fields etc.  Also, when you add an additional field to the source, all you need to do is add to the check-box here, and automatically all rows will be tested with the new column.     The algorithm here is the default crc32, there is also a framework checksum (md5), and original.  I would generally use the md5 version (although I forgot to change the default in this example), as, particularly with large calculations it will avoid a possible collision in the number.             The lookup is based on first-name and returns the value for the previously calculated checksum into db_chksum.  Notice that the other fields from the databse are not being returned for the comparisons, we only need the checksum                 The ENTIRE row (in this case 2 fields, but there is no limit to this) is now checked with one simple compare of the newly calculated checksum, and the checksum returned from the database.  If nothing changed… nothing will be done…. if anything changed, update all the data in the row.                  The insert – simple (just note I do not map the rec_mod_date on the insert only the rec_create_Date) to allow us to understand when rows are inserted from the source vs updates.  Normally fact loads also have additional audit info, but that is another topic.                  The Update – also simple (note that we are updating the whole row including the checksum value) The rec_create_date is not touched here, but rather the rec_mod_date is update to indicate the modifications.                          

Putting it together

On the first run, the 1 row runs down the insert leg, and inserts a row with a checksum value of: 1409434363 If you run the package again, without changing data, nothing happens, because the calculated value on the next execution matches the value already in the database.        Then…       

UPDATE sxb_chksum_src SET web_address =
http://StefBauer.wordpress.com/&#8217;  

      
 

Notice the only thing changed is the capitalization of the web address, but that is enough to create a new checksum value.   Run again… and the checksum value from the input  is -252803585 which does not match the database value, and the row runs down the update leg. You will now see the rec-mod-date update, the source values are updated, and the new checksum is stored as well.     

 
 

 
 

         
 

  
 

     
 

  
 

   

Follow

Get every new post delivered to your Inbox.

Join 386 other followers