The topic for T-SQL Tuesday #74 is “Be the Change”. This covers data changes, ETLs, data sanitisation, etc. Basically anything that changes your data.
I want to discuss the management of the SQL Server feature “Change Data Capture” (CDC). This is commonly used by ETL structures to incrementally load data warehouses. Typically, there would be an initial load of all the data into the DW. This is then followed by incrementally loading the changes to the live data to keep the DW up-to-date. CDC is the system that captures these changes.
CDC was introduced in SQL Server 2008, it was designed to provide a standardised and easy to roll system for capturing changes to your data. Almost all actions that happen in SQL Server are logged to the transaction log. This includes all INSERT, UPDATE, DELETE and MERGE statements. CDC reads the changes to your data from the transaction log asynchronously. This helps it to be as lightweight as possible as it is not locking any source tables to check the data.
The changes read from the transaction log are stored in change tables. These change tables can be found in a cdc schema that is added when CDC is enabled on the database. One change table is created for each table that is enabled for CDC. These tables can be queried by the ETL process to retrieve the changes to the data that it needs to apply to the DW.
These change tables can get very large very quickly if they are storing changes for tables whose data is altered frequently.
Luckily, a cleanup SQL Server Agent job is created when you enable CDC at the database level. The job will be named cdc.DATABASENAME_cleanup. This means that the data in the change tables has a retention period. By default the clean up job is scheduled to run every day at 2am and the retention period is 3 days.
3 days may sound like a lot but what if our ETL failed during a weekend or a holiday? Could you guarantee that you could get it back up and running before the changes you need to send to the DW meet their retention period? If you couldn’t, then you may need to start over and load all of your data again.
You could set a very long retention period which may be OK for small rarely changing tables, but the bigger the change tables get the longer it can take to load new changes and cleanup the oldest changes. Index maintenace can also be difficult to manage on these tables.
In some cases it may make sense to copy the data from the change tables to another database where it can be managed separatley from the live database. The ETL can pickup the changes from this “Change Database” and initiate the cleanup job to empty the change tables in the source database.
This method can allow you to have sensible retention periods and to persist the data long enough for your ETL to recover from an outage. The process of moving the data from the change tables to this “Change Database” needs to be very robust and heavily tested because if it fails you may find yourself back in the same place you started.
If the source database is in an Availability Group it is not possible to read the change data from a secondary. One very nice bonus to using this method is that the change data loaded to the “Change Database” can be read from a secondary if it is in an Availability Group. This allows the reads from the ETL to be offloaded to a secondary node.
The cleanup job needs to be configured to suit your environment. The cleanup job removes the expired change data from a change table in batches. Once it has removed all of the expired data, it moves onto the next change table. All the while the tables it hasn’t got to (and the ones it has completed) are receiving more changes.
If you have a lot of change tables (or even just 1 change table) with lots of data changing very regularly, then the defaults of the cleanup job may soon start to cause you pain.
I have seen cleanup jobs take over a couple of days to catchup after it was found that they had not been running or a huge amount of data was loaded.
Increasing the threshold setting on the cleanup job can dramatically improve the time it takes to delete expired changes but setting this too high may cause excessive locks on the change tables. Test, test and test!
The current seetings of your CDC jobs can be found with the following query:
SELECT * FROM dbo.cdc_jobs
The stored procedure sys.sp_cdc_change_job can be used to change these settings. I’m setting the threshold to 20,000 in the script below. Experiment with different values to find what’s right for you. See BOL for details.
EXECUTE sys.sp_cdc_change_job @job_type = N'cleanup', @threshold = 20000; GO
For more information read through this post by Kendra Little where she discusses some more of the challenges with using CDC.
Thanks for reading!