Automated Testing of an Enterprise Data Warehouse - Best Cost/Benefit Ratio

Overview

We created a lot of different types of automated tests for a recent enterprise data warehouse. We used Nose for the Python-based ELT orchestration framework, and DbFit for business logic testing. Jenkins was used to run all of these.

There is a lot to be said about what we've learned about business logic testing in this context, given the particular challenges of testing data warehouse environments. We would write a book about it if we had time, but pending that happening, you could contact us on info@cloudbilimited.com and we'll help you. 

We found one particular type of test to provide particularly high benefits for very limited effort.

The Challenge

We had several dozen independent processes which would run as part of the ELT process. These had components which were metadata files, and components which were SQL scripts. In addition, part of the codebase specified how the database should be built.

Problems could occur when a team would make a change to the database to suit a modification to an ELT process, and not realise that this would break another ELT process.

For example, we remove an attribute from table A because it is no longer needed for a new version of process 1. This attribute is used by process 4, but we didn't realise the dependency. If we try to run process 4, it will fail with an SQL error because a column is missing from a dimension table.

Breaking a Process

Breaking a Process

 

The Solution

Some of these kinds of problems can be picked up by unit tests, particularly where metadata is used to define the processes. But it can be challenging to get full coverage.

The solution we used was to use a script to automatically generate a very simple DbFit test page for every process. This test would run the process in an empty data warehouse test environment. Because there was no data, the test would run very quickly. It would discover if there were any processes which were broken due to modifications to the database, or syntax errors.

We had the DbFit test page display the log message for the process. Jenkins ran a complete set of these tests periodically, and so if there were ever any broken processes, we would be notified by Jenkins, and we could go and look at the error message in the log on the DbFit test page in the environment controlled by Jenkins. Typically it would show that a syntax error had been introduced into the code, or that a table definition had been changed such that a process had been broken.

Jenkins/DbFit Test Results

Jenkins/DbFit Test Results


Recommendations

There are lots of more sophisticated ways to prevent problems with syntax errors and dependencies in data warehouse code, but this simple approach of automatically checking the code actually runs end to end without error is a valuable, robust and low effort way to find out about problems sooner than is possible when running tests which actually use data.