Saturday, September 3, 2011

Incremental Load in SSIS

Problem Description: Perform Incremental Load using SSIS package. There is one Source table with ID (may be Primary Key), CreatedDate and ModifiedDate along with other columns. The requirement is to load the destination table with new records and update the existing records (if any updated records are available).

Soultion:
You can use Lookup Transformation where you compare source and destination data based on some id/code and get the new and updated records, and then use Conditoional Split to select the new and updated rows before loading the table.

However, I don't recommend this approach, specially when destination table is very huge and volume of delta is very high.

You can do it in simple steps:
  1. Find the Maximum ID & Last ModifiedDate from destination and store in package variables.
  2. Pull the new and updated records from source and load to a staging table using above variables.
  3. Insert and Update the records using Execute SQL Task
Here is the an step-by-step example to do this:
 
STEP1:
Create a new Package IncrementalLoad.dtsx and add following package variables:








VariableName Description Examle
DestinationConnStr Connection string for
destination server/db
Data Source=(local);
Initial Catalog=TestHN;
Provider=SQLNCLI10.1;
Integrated Security=SSPI;
Auto Translate=False;
MaxID Max ID from
destination table
100
MaxModifiedDate Max Date from
destination table
2010:11:10 11:50:20.003
SourceConnStr Connection string for
source server/db
Data Source=(local);
Initial Catalog=TestHN;
Provider=SQLNCLI10.1;
Integrated Security=SSPI;
Auto Translate=False;



STEP2:
Create Source, Destination, and staging tables using below code:

-- Source Table (create in source DB)
IF OBJECT_ID('dbo.TestSource','U') IS NOT NULL
DROP TABLE dbo.TestSource
GO
CREATE Table dbo.TestSource
(
  [ID] [int] IDENTITY(1,1)
  ,[Code] [varchar](10)
  ,[Description]  [varchar](100)
  ,[CreatedDate]  [datetime] NOT NULL default GETDATE()
  ,[ModifiedDate] [datetime] NOT NULL default GETDATE()
)
GO

-- Destination Table (create in destination DB)
IF OBJECT_ID('dbo.TestDestination','U') IS NOT NULL
DROP TABLE dbo.TestDestination
GO
CREATE Table dbo.TestDestination
(
  [ID] [int]
  ,[Code] [varchar](10)
  ,[Description]  [varchar](100)
  ,[CreatedDate]  [datetime] NOT NULL
  ,[ModifiedDate] [datetime] NOT NULL
)
GO

-- Staging Table (create in destination DB)
IF OBJECT_ID('dbo.TestDestinationSTG','U') IS NOT NULL
DROP TABLE dbo.TestDestinationSTG
GO
CREATE Table dbo.TestDestinationSTG
(
  [ID] [int]
  ,[Code] [varchar](10)
  ,[Description] [varchar](100)
  ,[CreatedDate] [datetime] NOT NULL
  ,[ModifiedDate] [datetime] NOT NULL
)
GO

STEP3:
Create two OLE DB Connection Manager, one for Source Server and another for Destination Server.
In the connection manager properties, set the expression ConnectionString with respective variables as shown below:

STEP4:
Drag and drop Execute SQL Task and name it -
Execute SQL Task - Get Max ID and Last ModifiedDate.
Double click on EST and set  following properties:
ResultSet = Single row
Connection = Destination Server
SQLStatement =
SELECT
   ISNULL(MAX(ID) ,0) MaxID,
   ISNULL(MAX(ModifiedDate),'2000-01-01') MaxModifiedDate
FROM TestDestination (NOLOCK)

Drag and drop another Execute SQL Task to Truncate staging table. Rename the task - Execute SQL Task - Truncate staging Table and set following properties:
ResultSet = None
Connection = Destination Server
SQLStatement = Truncate Table dbo.TestDestinationStg

 
Drag and drop Data Flow Task and name it Data Flow Task - Pull New and Updated rows.
Double click on DFT or click on Data Flow tab.
Now drag and drop OLE DB Source and select Source Server in OLE DB connection manager, select SQL command in Data access mode, and write following T-SQL code in SQL command text:
SELECT [ID],[Code],[Description],
[CreatedDate],[ModifiedDate]
FROM TestSource
WHERE [ID] > ? OR [ModifiedDate] >= ?
 
Click on Columns to generate metadata and set Parameters for above query:
 

Now drag and drop OLE DB Destination task and select Destination Server in OLE DB Connection manager, Table or view - fast load in Data access mode, and dbo.TestDestinationSTG in Name of the table or view.
Now click on Mapping to map the metadat with source columns.

Drag and drop one more Execute SQL Task and rename it to Execute SQL Task - Insert and Updated new rows and set following properties:
ResultSet = None
Connection = Destination Server
SQLStatement =
-- INSERT New Records
INSERT INTO TestDestination
([ID],[Code],[Description],
[CreatedDate],[ModifiedDate])
SELECT
   [ID],[Code],[Description],
   [CreatedDate],[ModifiedDate]
FROM TestDestinationStg
WHERE ID > ?


--UPDATE modified records
UPDATE D
SET D.[ID] = S.[ID]
,D.[Code] = S.[Code]
,D.[Description] = S.[Description]
,D.[CreatedDate] = S.[CreatedDate]
,D.[ModifiedDate] = S.[ModifiedDate]
FROM TestDestination D
JOIN TestDestinationStg S
   ON S.ID = D.ID
WHERE
   S.ID <= ? AND
   S.ModifiedDate > D.ModifiedDate

Click on Parameter Mapping and set the parameters as shown below:

Finally package will look like below snapshot:
 

STEP5:
We are done with package development. Its time to test the package. I will test this package in three steps:
1. Insert few dummy records in Source table while keeping Destination table empty.
--CASE 1: First Time Execution
INSERT INTO dbo.TestSource ([Code],[Description])
VALUES ('AAA','American Automobile Association')
WAITFOR DELAY '00:00:01.100' -- delay between two rows
INSERT INTO dbo.TestSource ([Code],[Description])
VALUES ('ABC','Associated Builders & Contractors')
WAITFOR DELAY '00:00:01.150'
INSERT INTO dbo.TestSource ([Code],[Description])
VALUES ('RR','Road Runner')
GO

Now execute the package and check the destination data. You can add Data viewers in Data Flow to see the result at run time as shown below:

2. Insert few more records in Source table to check whether new records are inserted into Destination table.
--CASE 2: Only New Records
INSERT INTO dbo.TestSource ([Code],[Description])
VALUES ('TSL','Trina Solar Limited')
WAITFOR DELAY '00:00:01' -- delay between two rows
INSERT INTO dbo.TestSource ([Code],[Description])
VALUES ('FSLR','First Solar, Inc.')

Now execute the package and check for new records:

3. Insert few more records and update few existing records in Source table and check whether both table are in sync.
--CASE 3 -- New & Updated Records
INSERT INTO dbo.TestSource ([Code],[Description])
VALUES ('LDK','LDK Solar Co., Ltd')


UPDATE dbo.TestSource
SET [Description] = 'Associated Builders and Contractors',
ModifiedDate = GETDATE()
WHERE [Code] = 'ABC'

Finaly check source and destination tables for match. Since I have both Source & Destination tables in one server, I can use EXCEPT command (for this example) which retunrs no records i.e. both the tables are in sync.


Cheers!!!

12 comments:

  1. Great article with nice example... thanks a lot.

    ReplyDelete
  2. Why do we need to take the max(ID) from destination?? Max(ModifiedDate) From destination is fine ..rt? Then using this Max(ModifiedDate) , retrieve records from Source where the modifiedDate > Max(ModifiedDate) .

    Or if you are using 2008 R2 , you can use CDC feature

    ReplyDelete
  3. There are several ways to do that Manuraj :)
    I agree your solution will work for the incremental load, albeit, you would require to handle which record to insert and which to update, which is kind of similar to your another suggestion of using SCD.

    Preference of the solution should depends on the performance and then simplicity. How about you share some statistics on the performance of different approaches so that everybody get benefited out of that!

    ReplyDelete
  4. Nice one really it helps to understand Incremental load.

    Thanks
    Madhu

    ReplyDelete
  5. Thank u for posting the incremental load in detail..If you perform deletion also it may be helpful to differentiate easily incremental load with scd transformation because deletion is the main difference between the scd transformation and incremental.

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. Thanks for the tutorial which makes it easy to understand the incremental load procedure. However in the tutorial you are incrementally loading a table, Can we do an incremental load of a database? Any tutorial or help will be appreciated.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. A database consists of many objects like stored procedures, functions, views and also tables. Loading is something that happens only with tables, as long as you have the appropriate schema in the destination, you can loop through source tables using INFORMATION_SCHEMA and place the single table load logic in the iteration. Stage some auditing once for a batch you prefer to keep track of load time.

      Delete
  8. A nice article here, i think that people who have grown up with the idea of using computers are showing more responsibility towards writing posts that are thoughtful, do not have grammar mistakes and pertinent to the post..

    MSBI Training in Chennai

    Informatica Training in Chennai

    Dataware Housing Training in Chennai

    ReplyDelete

Here are few FREE resources you may find helpful.