Replicating data into Amazon Redshift

Myself and one of our associates, Abdul, were tasked with implementing a replacement for some existing proprietary software that replicated data from SQL Server into Amazon Redshift.

The existing solution suffered from duplicate records, missing data and could not keep up with the rate of change on the source system. It also didn’t run well with other concurrent processes that were happening on Amazon Redshift. We had just over a month to create something that would keep over 2000 source tables in sync on a daily basis.

The problem with the existing solution was that it had originally been designed for replicating to and from a standard, row-based RDBMS. MPP systems such as Amazon Redshift are columnar and do not enforce referential integrity and therefore require a different approach.

We decided that the new system had to be:

1) idempotent. It can be scheduled to run once or a hundred times a day and have the same outcome.

2) set based. Due to their columnar nature MPP systems do not like large volumes of row-based operations - having to read every column on disc means scanning more blocks on disc unnecessarily.

3) self-sufficient. 2000 source tables increases the risk of something going awry. There needed to be a way for the system to correct itself.

4) able to cope with the potential for duplicate records in a system that does not enforce primary keys.

5) faster than the existing solution.

To make it idempotent the entire process that performs the complex bit (known as Change Data Capture, or CDC) was written to run inside a single transaction; it either works or it doesn’t.

To get the most out of the MPP architecture the transaction made use of a small number of outer joins that meant entire tables were processed in one hit, rather than a row at a time as in the existing solution. Deletes on Amazon Redshift tables that make use of the sort key are blisteringly fast, even across billions of rows. The single transaction had a step at the end that would remove duplicates, should they ever occur.

The process ran many times a day and would keep a high ‘watermark’ of where it had got up to in terms of data it had replicated. If it ever failed it would catchup the next time it ran.

To make it run faster it had to do everything in parallel. The source system had 13 SQL Server databases and data was extracted in a similar number of parallel SSIS jobs. On the Redshift side over 150 Python processes load the data and run the CDC process.

The original system had a 4 hour window to replicate a days worth of change but rarely caught up. The new system takes 25 minutes to extract the data from 2000 tables in SQL Server, 15 minutes to load the changes and 15 minutes to process the inserts / updates and deletes.