SSIS – Package Configurations

Introduction

SSIS provides a robust method for making packages portable between environments such as Development –> Testing –> Production.  This post will hopefully get you started with the incredible number of things you can control, scale, and make your packages portable.  I am focusing mainly on password management, however, you will also see a variety of other items that can be set in configuration files.  There are two places you can store your configuration data, first being in the database, the second being an xml file.  There are advantages to each, I have chosen to use an XML file to isolate all of the password info into the file, and not rely on either windows authentication, or keeping a password in the package to obtain the remaining information from the database.

I have chosen to have a config file for each package.  When I began this with SSIS-2005 you could not have connections in an XML file that were not used, so a single all-encompasing file was simply not possible.  With SSIS-2008 (sp1) *prior to sp1 there are still issues with this erroring at runtime* you are able to create a single file with all of the connections, even if they are not used in that particular package.  You will get a warning that there are connections that are not used, but the package will execute.  Even though this functionality is now supported, I would still recommend a file for each package, this will allow for better control over some of the other items you will be setting in a config.  For example, if you want to change the “defaultbuffermaxrows” for a dataflow (which you will), you need to be sure that all of your dataflows have unique names if you use a common xml file, additionally, you can impact other packages if you make a syntax error when you make changes.  One xml file for each package allows for flexibility, isolation, and risk reduction. That is a good trade for the VERY small amount of extra work to create the individual files.

Make it all work (Connections)

If you have read any of my other posts, you will see that I am big on naming conventions, this is no exception.  Do yourself a HUGE favor, and come up with a standard, and PLEASE don’t include the server names in the name of the connection.  You (might) have different server names in development –> test –> production.  You (might) in development host a series of source databases in a single instance that in production are on different machines etc, please keep this generic.

I have used:
[type].[server].[database].[user]
type = Ado/Native/ODBC/File whatever your connecting to
server = logical server name such as dw_db (Data Warehouse Database) dw_etl (Data Warehouse ETL server) these might equate to DB links that you should keep generic as well.
Database = What your connecting to
user = who am I connecting as

This will allow you to understand everything about a connection without making assumptions.  This is good for both the connection manager, but more importantly when you are editing your XML file later.

Package Setup

At the package level, because I have chosen to host ALL of my password information in the xml file, I set the protectionlevel to ‘DontSaveSensitive’.  The default of ‘EncryptSensitivewithuserkey’ gets people into trouble…. if you imbeded the password into the connection with that setting… YOUR windows domain account is the ONLY account that can access that password information.  When you migrate to production, and execute the package from a Job (unless your account is running the database service) the job will fail because the password information is encrypted and cannot be accessed by that account.

Package Configurations

Right-click in the empty space on the control-flow surface, and select package configurations…. you will get a wizard to walk through the rest of the steps to build a config file.

 

This screen you select the XML (or database) configuration, and the file-name (or variable) for the configuration file.  I have selected “N” as I have a local drive on my ETL server that the configurations are stored on.  This allows the configurations to be stored on the server with the production passwords, and the ONLY access to those files would be the network-admin that has access to the server.  Since my environments are consistent, the N drive is available I chose this method.  You can setup an environmental variable on the server that you could set to “N” on one machine and “X” on another, for my environment, the option shown works best for me.

Next, you can select what items are included in the configuration file, I simply select everything for the connection managers, this will allow you to configure your connection strings, passwords, user-id’s, server-names, everything you need for your connections.

Spend some time looking through the list of things you can customize (there is A LOT) and you can get pretty creative if you need to.  The other thing that I do is for each data-flow, I select the “buffertempstoragepath”, “Defaultbuffersmaxrows”, “defaultbuffersize”.  (There are a few others that may make sense….but for now lets keep it simple).  As you move from Development (which in my case is a desktop machine), to a testing environment (which is not sized for speed/performance), to production these values will change.  My default buffer settings are MUCH smaller on my dev machine than they are in production.  Another post, for another day I will talk about the tuning opportunities and how to calculate all this stuff…. but if you include it here, your package can automatically scale to take advantage of the hardware you are deploying it to.

Once you are done with this, you can name the config if you want to support environments for package generation, environment builds etc, you can leave it as configuration 1 to start with.  This will depend on how you deploy packages, and if you need to use the package deployment utilities, or if you manually deploy packages to each environment (which is what I do).

Now that you have an XML file…

  1. Open it in BIDS, select Edit/Advanced/Format Document
  2. Find the configuration strings for your connections (you will see sections that contain the names you have in the connection manager)
    1. Add Password=xxxxxxxx; (whatever your password is for the environment you are going to move this configuration to) for each of the connection strings
    2. Set the User ID=XXXXX; to the appropriate user-id (I have different users in dev/test/prod)
    3. Set the Data Source=xxxxxx; to the appropriate server
  3. In the same section, you will see additional properties
    1. Properties[Password]  (set the configured value)
    2. Properties[ServerName] (set the configured value)
    3. Properties[UserName] (set the configured value)
  4. Set the values you wish for Buffertempstoragepath  (This is where sorts happen when you run out of memory space)
  5. Set DefaultBufferMaxRows (according to which environment you’re moving this file to)
  6. Set DefaultBufferSize (according to which environment you’re moving this file to)

Save the XML file… and you are ready to go…. the location you picked in the second step of the wizard is where this file needs to be, and will be picked up at runtime to obtain the password info (and anything else that you override).

SSIS Checksum – a must have skill

Overview

When loading from source systems that don’t yet have the SQL Server 2008 Change Data Capture (CDC) enabled, using a checksum calculation to determine if data has changed is a critical Data Warehouse load concept to understand.  This will allow you to read the source, that can have dozens of columns, and with a single lookup, and a single compare (of a integer value) and determine if data within that vast source row has changed.  Not to mention speed…. this method is much faster than a traditional compare of many fields. I have build a very simple example (3 columns in the source), will check if the data has changed in 2 of those columns, and perform the lookup/insert logic based on the outcome.         

Setup

You will need to install the checksum transformation[edit] –> as a result of this blog post, there were some comments about another module to create hash-values, also a worthwhile component to look at! Available from codeplex. <– [edit] I have included the sample tables, and data here.         

The package

              

  • OLE DB Source – reads the src table
  • chksum transformation calculates the checksum
  • derived column is setting rec_create_date and rec_mod_date with a getdate()
  • Lookup is using name to find the row in the target, if it does not exist, insert otherwise return checksum from the database
  • conditional split, check to see if the calculated checksum (based on the source) matches the checksum from the lookup
  • OLE DB Command, update the row (the checksum’s did not match)
  • OLE DB Destination, insert a new row

       Now a look a little closer at the main components of this package.          The assumption here is that “fname” is the key value for the row, if that has changed, we want to insert a new row.  If there is any change to the web_address, or email_address for this row we want to update the row.  This will allow us to do a compare without having to worry case sensitivity, null values, varchar comparisons to multiple fields etc.  Also, when you add an additional field to the source, all you need to do is add to the check-box here, and automatically all rows will be tested with the new column.     The algorithm here is the default crc32, there is also a framework checksum (md5), and original.  I would generally use the md5 version (although I forgot to change the default in this example), as, particularly with large calculations it will avoid a possible collision in the number.             The lookup is based on first-name and returns the value for the previously calculated checksum into db_chksum.  Notice that the other fields from the databse are not being returned for the comparisons, we only need the checksum                 The ENTIRE row (in this case 2 fields, but there is no limit to this) is now checked with one simple compare of the newly calculated checksum, and the checksum returned from the database.  If nothing changed… nothing will be done…. if anything changed, update all the data in the row.                  The insert – simple (just note I do not map the rec_mod_date on the insert only the rec_create_Date) to allow us to understand when rows are inserted from the source vs updates.  Normally fact loads also have additional audit info, but that is another topic.                  The Update – also simple (note that we are updating the whole row including the checksum value) The rec_create_date is not touched here, but rather the rec_mod_date is update to indicate the modifications.                          

Putting it together

On the first run, the 1 row runs down the insert leg, and inserts a row with a checksum value of: 1409434363 If you run the package again, without changing data, nothing happens, because the calculated value on the next execution matches the value already in the database.        Then…       

UPDATE sxb_chksum_src SET web_address =
http://StefBauer.wordpress.com/&#8217;  

      
 

Notice the only thing changed is the capitalization of the web address, but that is enough to create a new checksum value.   Run again… and the checksum value from the input  is -252803585 which does not match the database value, and the row runs down the update leg. You will now see the rec-mod-date update, the source values are updated, and the new checksum is stored as well.     

 
 

 
 

         
 

  
 

     
 

  
 

   

Follow

Get every new post delivered to your Inbox.

Join 373 other followers