We created a mechanism that we called "The Federator" for making data processed on one Redshift cluster be available on other Redshift clusters. This post introduces what we did.
Our client had a Redshift cluster with dw1.xlarge magnetic nodes. Using a Python-based framework, it ingested raw data from S3 and then processed a subset of that data into facts and dimensions.
The users were primarily interested in the facts and dimensions, but there were some valid use cases where they would need to access the raw data.
We were using the magnetic nodes because there was such a lot of raw data that the cost of SSD nodes would have been too high.
There was a requirement to increase the amount of query capacity available to the end users, primarily on the facts and dimensions. We were asked to consider options for transferring this data from the primary "transform" Redshift cluster to a "reporting" cluster. This reporting cluster would not contain the raw data, and so the total data volume required would be small enough for us to be able to use the SSD node type.
- Data must be available on the reporting cluster as promptly as reasonably possible - as data was being loaded and processed throughout the day when it became available, an overnight refresh would have been inadequate.
- Operation of the process must be as low effort as possible.
- The mechanism must be robust and self-healing in the event of any problem.
- It must be possible to have more than one reporting cluster.
- The overhead of offloading the data from the primary transform cluster must be kept as low as possible.
- It must be able to deal with 70+ tables, including some fact tables with over 7Bn records.
With the constraint that we must be able to support more than one reporting cluster, and minimise the overhead of offloading from the primary transform cluster, we realised that we needed to decouple the offloading from the primary transform cluster from the loading into the reporting clusters.
We achieved this decoupling through the model in the diagram below, where we published data from the transform cluster onto S3. The reporting clusters then subscribed this data from S3.
This allowed us to publish data once from the transform cluster, and use it in multiple reporting clusters, each of which could subscribe that data independently from S3. Thus each cluster was decoupled from every other, and a failure of one cluster would not affect another.
Selective Publication and Subscription
To minimise the effort required to manage the process, and make it robust and self healing, while also minimising the overhead on the system, we implemented the publish and subscribe mechanisms so that they would run for each table at a time and check if they actually needed to take any action.
When publishing data from the Redshift DB onto S3, our Python script followed the following idempotent process.
We selected the current record count, maximum insert and update timestamps from the Redshift table, which is an efficient process even with large tables because of its architecture. We were able to compare this directly with what is on S3 without loading the data because the filename convention we use on S3 includes the record count and maximum insert and update timestamps.
Because we were checking the current state on Redshift with the current state on S3, we had a robust process that is self healing and can be run at any time.
We had a corresponding process for subscription. It's worth pointing out that this mechanism meant that we can create a new empty cluster at any time and it will bring itself up to date with S3.
Because the reporting clusters were decoupled from the transform clusters, it was a lot easier to make changes such as when we rebuilt the transform cluster - we subscribed the facts and dimensions from S3 onto the new transform cluster, effectively synchronising it with the old transform cluster. Then we cut over to processing and publishing from the new transform cluster. The configuration on the reporting clusters didn't have to be changed - as far as they were concerned, they were still subscribing from the same place on S3.
It is essential that the transform and reporting clusters have the same table definitions in order for this process to work. As discussed in previous posts, we had robust mechanisms for this already. We we were able to deploy code changes to the reporting clusters as a downstream dependency on the deployment to the transform clusters.
We developed a particular approach for dealing with very large fact tables efficiently, which may be the subject of a future post.