SSIS – Load Audit

Introduction

There are many reasons to put some thought into an audit process for your load jobs.

  1. Tie the insert/update of a row to a specific SSIS package.
  2. Provide consistent date-stamp times for a given load (I will explain this in a moment).
  3. Ensure all the data staged actually made it to the target.
  4. Monitor load times, and easily find packages (or tables) that need tuning.
  5. Provide input to load jobs for selecting data from the source (you know the last time you ran successfully, select data that changed since then in the source… that is a topic for another post)
  6. Provide SLA reporting information on load-times.
  7. Probably a few other things that you will discover as you implement this.

If well thought out up-front, the additional major benefit of a good audit process is a cookie cutter approach to building load jobs.  They ALL do exactly the same thing, in exactly the same way using the same variables, and setup.  (The details of this is for another post, but I will cover the generalities here)

The Setup 

CREATE TABLE [dw_dbo].[ld_control_audit]
   (
       [audit_key]                 [INT] IDENTITY(1, 1) NOT NULL,
       [parent_audit_key]          [INT] NOT NULL,
       [tbl_name]                  [VARCHAR](50) NULL,
       [pkg_name]                  [VARCHAR](50) NULL,
       [exec_start_dt]             [DATETIME] NULL,
       [exec_stop_dt]              [DATETIME] NULL,
       [extract_row_cnt]           [BIGINT] NULL,
       [staged_row_cnt]            [BIGINT] NULL,
       [insert_row_cnt]            [BIGINT] NULL,
       [update_row_cnt]            [BIGINT] NULL,
       [error_row_cnt]             [BIGINT] NULL,
       [tbl_initial_row_cnt]       [BIGINT] NULL,
       [tbl_final_row_cnt]         [BIGINT] NULL,
       [tbl_max_surrogate_key]     [BIGINT] NULL,
       [successful_processing_ind] [CHAR](1) NULL
   )

Once you have the table, the view (in the default dbo schema) also calculates the execution time and executes with a NOLOCK, you want to be sure there is no locking by your monitoring processes on this table.

CREATE VIEW [dbo].[ld_control_audit] AS
   SELECT audit_key,
          parent_audit_key,
          tbl_name,
          pkg_name,
          exec_start_dt,
          exec_stop_dt,
          DATEDIFF(MINUTE, exec_start_dt, exec_stop_dt) AS runtime_minutes,
          staged_row_cnt,
          extract_row_cnt,
          insert_row_cnt,
          update_row_cnt,
          error_row_cnt,
          tbl_initial_row_cnt,
          tbl_final_row_cnt,
          tbl_max_surrogate_key,
          successful_processing_ind
   FROM   dw_dbo.dimaudit AS dimaudit_1 WITH (nolock)

Usage

Now that you have the basic table and view, the first thing your load will do is insert a row for the “Master Package” , which will create a new with a -9999 parent_audit_key to tie all of your “sub” executions to, and creates the exec_start_dt of this particular execution.

You now have an option…. you can select back the identity column you created, and pass it into each of your packages that you execute, or let each package lookup to get the audit-key that they will use as the parent_audit_key.  I DO NOT pass it in, but rather lookup at the top of each sub-package.  Reason 1, I execute all the sub-packages as out-of-process, to obtain a separate memory space for each of the executions, as I am running MANY processes in parallel.  Reason 2, restartability, if each package functions as an independent unit of work (including looking up the parent_audit_key) if a package fails for any reason, it can be re-run by itself without having to invoke the master-package, or manually adjusting variables etc.  If you are using  checkpoint files for restart this still plays well with that as the existing variable states (including audit_key assigned to the package, as well as parent_audit_key, record counts etc) are preserved for restart.

 

 If you have been reading any of my other posts, by now you have figured out I am “stuck” on naming conventions and consistency, but if you open ANY load package, they all look the same, so lets walk through the parts.

  1. Selecting the parent audit key from the audit table (max key for “master” for successful_processing_ind is null)
  2. Insert a new row into the audit table (VARIABLES for everything in the insert)…This is cut-paste for any package.
    There is a second sql statement that then selects the max audit-key for the @tablename that we just inserted a row for (this is what we will stamp on ALL rows affected by this load)
  3. This will select the package start-time from the audit table row.  This will be used for the rec-mod/rec-create dates in ALL rows affected by this load.  We are effectively going to tie all of the data in the warehouse back to a specific audit row, and a specific execution time.  If data needs to be backed out due to source issues or otherwise, all rows affected by the run, all have the same time.
  4. Using a scripting task, (same result as using an expression build in bi-express product if you have that).  This task sets SQL statement variables that will be executed before (and after) the load.
  5. Executes the count sql statement just built by the scripting task by using the table-name variables in the package.
  6. Executes the select max surrogate key sql stmt built by the scripting task using the table-name (or if this is a fact table, we select the max audit-key)… this is to capture with absolute certainty the starting point of the load
  7. This is the ONLY unique thing in this package.  EVERYTHING else is driven by the variables and dynamic SQL.
  8. Executes the count sql statement just built by the scripting task by using the table-name variables in the package.
  9. Update stats (this is for dimension loads only) the fact table stats/index’s are maintained by a nightly script, that is too large to run at load-time
  10. Update the audit row, set the end-time, and success-ind

Conclusion

There are many reasons for consistency… audit and maintainability is a major benefit.  The audit will produce a row in an audit table that you can tie ANY row in the warehouse (dimension or fact) back to the specific execution of a package.  The audit table will give you execution times for packages (which is very useful to monitor ongoing load-times and bottlenecks).   It may seem like a lot of work up-front (it really is not), and since all of the infrastructure is driven by dynamic sql, and variables, there is nothing (other than changing the values in variables) that needs to be created to maintain a consistent audit process.

Tags: , ,

2 Comments on “SSIS – Load Audit”

  1. ivan November 28, 2013 at 10:14 am #

    Great one Stef!

    I am glad I came across this, it was very, very helpful in explaining the process.

  2. cornwall holidays cheap May 25, 2015 at 6:21 pm #

    One of the most widespread locations for
    anyone on holiday in Cornwall is the Eden Project, famous for its giant bio-domes
    nestled in a 50 metre deep crater.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: