Lightweight Completeness Checks

You may find the following useful if you need to set up some checks for completeness of processing.

We got interested in this problem because we needed an alert to be raised if any files or records were going missing in a data warehouse ingest process.

The solution concepts were

  1. Use the log analysis system to do alerting for mismatches in volumes processed between measurement points - it saves setup time if you don’t use any new systems, and you may be collecting the necessary data already

  2. Count log entries within the log analysis system rather than specifying aggregate counts to be collected at particular points - if data volumes are not too high for this approach, then it gives more flexibility than comparing counts

  3. Compare logs in one system with logs in another to find mismatches - if you have the individual log entries you can compare sets to see if items are reported at one measurement point but not another

  4. Expand the comparisons to compare logs with summaries extracted from the database on a schedule controlled by the log analysis system - extracting line level details from the database might not be practical (though sometimes it is) but you can compare counts between a SELECT COUNT(*) on a table and the appropriately bounded counts of log entries, and raise an alert if there is a mismatch

Alert On Mismatch.png

 

The solution in practice

We used Splunk to monitor a Redshift data warehouse and found it was very quick to set up alerts based on log entries and database summaries.  once we had got used to the syntax.

Other Considerations and References…

On the project where we first used Splunk like this, we had been very involved in making deployment mechanisms that used version control (Bitbucket), automated testing (Fitnesse/DbFit), CICD (Jenkins) and automated deployment (custom sequential build mechanism similar to Liquibase). Our first instinct was to start analysing how to treat our controls like this by investing effort in making some heavyweight deployment mechanisms, but then we realised it was so easy to set up the rules in the Splunk UI that any extra mechanisms would have got in the way. We could justify this approach because the types of controls we were applying were passive on-the-fly analysis of logs and the database, so if we ever got something wrong, we wouldn’t have days of corrupted data to recover like we would with an ETL process. Instead we could just correct the rule and rerun. The better way to apply more safety was to make sure there was some overlap in coverage between rules.