You may find the following useful if you want to structure the analysis of performance of your data warehouse in order to support tuning as part of improvements work on the same platform, migration to a new platform using the same technology or migration to a new platform on different technology.
We got interested in this problem because we were simultaneously involved in three projects giving us examples ranging from benchmarking to replatforming onto a radically different technology, and the solution concepts were useful to all three.
The solution concepts were
Organise the analysis by each of these categories:
Query Experience and Volume
User Experience - how quickly does it return for an individual query
Workload Volume - how much of the above is running
Other Benefits & Costs
Sometimes this is as simple as the relative cost of running the relative number of nodes when changing cluster size, but it could be a more involved analysis especially when considering replatforming to another technology. It’s a good idea to work out what benefits and costs need to count as being in scope of the benchmarking exercise rather than of a replatforming project itself.
Query Configuration - such as workload management settings
Query Performance - underlying indicators of efficiency at the query level such as queue time
Data Model Configuration - logical and physical details such as aggregate table approach, distribution style
Data Model Performance - underlying indicators of efficiency at the table level such as proportion of unsorted storage
Platform Configuration - system details such as node count
Platform Performance - underlying indicators of efficiency at the system level such as CPU utilisation
The solution in practice …
Focusing on a relatively small number of primary metrics that are valid for comparison between solutions makes it more straightforward to identify benefits. An understanding of the supporting metrics can be key for identifying how to improve performance of a particular solution, but they are only useful if they can drive a change in the primary metrics.
Measuring the user experience and the workload volume separately allows us to see how the user experience changes with workload volume. This can vary over the short term, for example high volume and poor performance on a busy Monday morning versus low volume and good performance on a quiet Sunday afternoon, and also over the long term, for example between January and August the workload volume could have increased greatly which then gives a better understanding of why performance could have decreased. There are some useful specific considerations for metric definition that help with this that we could detail outside the scope of this post.
Making best use of the supporting metrics can require a lot of platform specific understanding, and each platform will have its own best practices. It could be argued that there are some common themes but given the range of different technologies (Amazon Athena versus on premise Teradata for example) we would generally suggest pursuing these independently.
Other Considerations and References
When discussing benchmarking a data warehouse, make the distinction between benchmarking the platform for a generic workload versus analysing performance for the actual workload. When analysing an actual workload we need to discover what is actually running, then we can tune the platform based on an iterative cycle of analysis focused by the workload, modifying the workload (where possible), and modifying the platform. With generic workload analysis we already know what we’re running, the analysis can be equally generic, we don’t modify the workload, and we only need to tune the platform to find the best performance. Actual workload analysis is a more involved problem. It’s also potentially higher value add, because it relates more directly to implementing actual performance improvement.
This analysis does not relate to ingest performance, which has other considerations such as availability latency which need specific treatment.
It also does not include availability of the platform as a whole - these techniques assume the system is available to register failed queries, for example. When providing a SaaS solution, quoting an SLA such as 99.5% uptime that spans the entire platform (see example) is a specification of for how much time the system will be able to provide a separately specified volume of workload of a level of user experience.
It does not relate to SLAs for data quailty as per https://searchbusinessanalytics.techtarget.com/news/2240111188/Service-Level-Agreements-for-Data
It is not an overall general management activity as per http://www.eiminstitute.org/library/eimi-archives/volume-1-issue-5-july-2007-edition/measuring-the-data-warehouse
It is not an analysis of how many people are actually using the platform as a proportion of those who could, i.e. a user engagement mechanism. These techniques look at the total workloads but don’t analyse whether those are the total workloads that should be running to best support the business.