You may find the following useful if you need to benchmark the volume of queries on a data warehouse.
We got interested in this problem because as part of defining a structured approach to benchmarking data warehouse performance we identified the isolated quantification of volume of queries as a key component. See Structuring Data Warehouse Benchmarks for more details.
The solution concepts were
Using a layered model to understand what the data warehouse needs to do in order to support business decisions, the primary metric for quantifying the workload is identified as how many instances are returned for each specified set of query results. For example, it is the sum of the ex VAT sales price for all items sold in all stores in February 2018. If the solution A serves this 1000 times and solution B serves this 4000 times and all other aspects are the same such as time distribution of queries, then solution B is serving four times the volume of solution A.
The above is solution independent - it doesn’t matter if solution B is a completely different architecture from solution A, this approach just needs to know the volume of queries requiring the sum of the ex VAT sales price for all items sold in all stores in February 2018. The underlying layers which occur are useful for analysis of volume aspects such as number of bytes of disk scanned but they don’t necessarily relate directly to the actual primary workload volume. For example, for some technologies we have the option to change the partitioning strategy which could result in radically less disk scanned for the exact same set of results returned.
Any supporting metric which shows higher volume is a actually showing a cost increase which should be justified by an increase in a corresponding primary volume metric. For example, a supporting metric showing higher volume of disk scan should be justified by a primary metric showing higher volume of query results returned. Without this increase in the primary metric, it might just be that the disk scans have increased because of the partitioning being less efficient.
If the ratio of a specific volume supporting metric to the primary volume metric is known to be stable, then the supporting metric can be used as a proxy to primary volume, but it is potentially very misleading to do so without this proof of stability as with the disk scan example above. It is therefore safest to always measure how many instances are returned for each specified set of query results.
The specification of a set of query results is unambiguously given by the actual SQL submitted. The instances of SQL submitted can be used as a proxy to the specification of the set of query results, if they have any solution technology specific aspects normalised (e.g. if SQL extensions used are different between technologies).
To simplify the quantification of volume as a primary metric, the primary metric for how much data is loaded into the solution (e.g. one year of history versus five years of history) is considered part of the “Other Benefits and Costs” shown in Structuring Data Warehouse Benchmarks rather than part of the volume metric itself. This is important because the supporting metrics such as disk scanned may be affected by the amount of data loaded when serving up the same volume of useful results. The purpose of the primary metrics are to show what the user experience is, and how much volume is being served, and what other benefits and costs are relevant. If we were to have two solutions which were identical other than that A had one year of history loaded and B had five, and we made them each serve the same number of queries from the one year of history, then they would have the exact same primary volume specification. This makes it simpler to see that if the query experience as described in Query Experience Benchmarking is worse for solution B versus solution A, then the solutions can be contrasted at a primary metric level by stating that for the same primary volume of queries, solution B gives a worse user experience but has the other benefit that it make five years of data available rather than one.
The solution in practice …
In its simplest form the following total counts of queries are used:
Count timed out or user cancelled
Because the types of queries could be changing, the total count is not informative enough, so it is normally useful to categorise these. When using the submitted SQL as a proxy to the query results specification, the main challenge is in allocating the SQL to categories in order to count them. This is not a problem in a controlled comparative performance test context where the same volumes are being run on each solution, but when it’s necessary to discover the volumes being run, it’s necessary to know how to allocate them. This is a fundamental challenge of production solution benchmarking with real workloads as distinct from performance testing.
This allocation is subjective and can best be worked out by asking the users “what kind of queries do you have” and then drilling into on “total” counts by being able to report on for example “total simple clickstream summary queries”,”total clickstream user journey analysis queries”, “total big sales transactions summary queries” and so on.
If it is practical for the solutions being benchmarked, it may be most productive to create a query benchmarking fact table with one row for each query run, and populate this from the solution’s system tables or logs which specify the query SQL and other information. Then SQL can be run on this query benchmarking table to add columns which specify how to allocate them to categories. Depending on the solution this could be done in the data warehouse itself, in which case it will also be able to see the impact of its own operation, or outside the data warehouse. It may not be desirable to log the individual results in a table for example if there are very large volumes of small queries which are lighter weight than the query logging itself, in which case a counters based solution may be more appropriate, but this gives less flexibility to drill into the results when analysing.
For some types of queries, it may be useful to make a type and a subtype, and there may be further levels of subjectivity with those subtypes especially where there is a greater degree of ad hoc SQL activity occurring. Having the results available as a query benchmarking fact table means the method of summarisation can be chosen after the queries are run, and this flexibility can be useful.
As well as comparing queries by type, this approach of using a query benchmarking fact table allows adding other information that may be of use to the primary metrics analysis. It’s also likely to be useful for supporting metrics analysis such as disk bytes scanned, but those metrics should be discussed elsewhere. When analysing the primary metrics, it should not show system maintenance operations such as Postgres vacuum, but for supporting metrics analysis the query benchmarking fact table can be useful for contrasting the workloads on maintenance operations versus the serving of useful workload.
Further useful information for the query benchmarking table, noting that these may be very different between query categories, and are suggested here as a way to discover if the characteristics of specific query categories are changing.
Total rows returned
If the queries have changed so more rows need to be returned, this is more workload
Date range in days (or other suitable time unit)
If the queries have changed to require more days of data then this is more workload
Look out for queries with an increased range that is outside the range of data included in the system. For example if the system has one year of history data and the queries typically used one year of scope, then if the scope specified in the SQL changes to two years this is unlikely to make any difference, but if it was reduced to six months then it might well make a difference.
Particularly for ad hoc queries there are some interesting characteristics of queries that might work as additional properties on the query benchmarking table, such as the number of objects joined. At the primary metric analysis level, it may be simplest to use such characteristics to define the type or subtype of the query (e.g. “ad hoc query on fact table x joined only to small dimension tables”), rather than adding the count of objects joined, to avoid spending too much time making a distinction between whether something counts as a primary or as a secondary metric. Then this frees the analysis up at the secondary metric level to report for example that queries of subtype “ad hoc query on fact table x joined only to small dimension tables” have gone from 100% hash join to 100% nested loops. There are techniques for analysing secondary metrics in the context of primary metrics that make this process more efficient and the conclusions more reliable.
Other Considerations and References…
Latency of data may be be different between solutions, this is simplest to present as another primary “Benefits and costs” item as described in Structuring Data Warehouse Benchmarks, along with time scope of history.
When analysing workload it’s relevant how the relative timings of queries will affect the user experience of the individual queries. For example if the first query of a particular type requires loading from object storage and the second can then run from a RAM cache if it occurs within x minutes of the first, then it becomes important to understand that time relationship in order to configure improvements. It is the purpose of the primary metrics to show the volume of workload as described here and also the user experience of individual instances of workload as described in Query Experience Benchmarking, but the primary metrics are deliberately kept independent of solution specific supporting metrics such as proportion of queries served from cache, because it is safest to avoid making any assumptions how supporting metrics relate to actual useful volume of workload served and corresponding user experience of individual instances of workload. There are techniques that can be described separately for analysing the primary metrics by time that provide a template approach for understanding how volume and experience vary with time and with each other. Once this is understood then the options for improvement can then be informed by analysing the relevant secondary metrics in relation to the primary metrics.