You may find the following useful if you need to benchmark the user experience of running queries on a data warehouse.
We got interested in this problem because as part of defining a structured approach to benchmarking data warehouse performance we identified the isolated quantification of user experience of running queries as a key component. See structuring data warehouse benchmarks for more details.
The solution concepts were
Focus purely on quantifying what the end user experience is, and don’t attempt to explain anything about why the performance is what is is.
By avoiding dealing with why, these primary metrics are suitable for comparing the user experience between platforms because they are not platform specific
The supporting metrics described in structuring data warehouse benchmarks are platform specific and are useful for explaining why - this needs to be done separately
Metrics are in the first instance in terms of averages and proportions and do not attempt to describe the total volumes, so that they can describe the end user experience for submitting an individual query independently of the total volume of queries run.
Techniques for analysing the total volumes are described separately
Only valid queries are considered. We exclude queries with invalid syntax for example, because users (or autogeneration mechanisms) submitting invalid queries is not something that informs us about cluster performance.
The solution in practice …
There are three main types of metric we would propose:
Proportion of queries submitted with submit to complete within SLA time (higher is better)
SLA may be an informal measure - for dashboarding solutions it’s straightforward to make it formal due to the standard nature of the queries but for ad hoc analysis the workload can vary widely
A perfect situation is to get 100% on this metric and then work out how to reduce the cost of running the cluster
If you have to pick only one metric as an indicator of cluster experience then use this one
So that this can be as comprehensive a measure as possible it should be calculated as a proportion of total valid queries submitted rather than queries completed, so that if there are timeouts or cancellations then the metric goes down, indicating worse state.
If there are changes in the quality of queries submitted then timeouts could increase, or if the users start cancelling more (perhaps they get a new UI that has a bigger cancel button!) then this could mask understanding of changes in the performance of the queries that do complete
If the above happens, then it is useful to show % that completed that completed within SLA, because if the fluctuations in query cancelling performance are larger compared to the size of the queries that run longer than SLA, then the graphs will look like they are driven by performance fluctuations but really they’re driven by user decisions to cancel queries.
It’s not easy to tell if users cancel queries because they realise they wrote the wrong query or because they got frustrated waiting and decided to write a different one. Likewise timeouts can be due to user error. It’s also worth looking out for automatic cancellations from query tools. If there are difficulties getting a stable metric in this area it could be because the type of activity is too variable to support an SLA, and that type of access could best be reported separately.
xample presentation of detail showing a primary metric of 74% submit to complete within a 2 second SLA
Not all types of queries have to have the same SLA times, so you can report them separately and/or report a weighted average if that gives a better indication of true cluster health
You can also report proportions with submit to complete within an informal higher performance SLA, or variants on that idea, such as reporting x% complete within SLA, y% complete “impressively fast” and z% complete “almost instantly”, where x% > y% > z%.
We’d be sceptical about spending too much time on this but possibly it works quite well to have two specific extra categories - could be useful to call out the y% that go noticeably faster than SLA (this could be half SLA by standard for example) and the z% that go almost instantly (this could be fixed at 1 second for example, though it could depend on the context)
If there is some kind of specification of required average and worst case response times such as suggested in Data warehousing service level agreements then the worst case value is the x SLA where you are aiming for 100% compliance and the average value is where you are aiming for 50%, or adjust that if the specification is for the mean rather than the median. If you’re supposed to be just reporting the worst case and the mean (or median), then you should show these directly (see additional metrics below) as well as relate to them in the proportions within SLA levels
In some circumstances this could be very informative - if you’ve replatformed to a technology that has an additional caching layer and you’re keeping to a conservative cluster size to reduce costs then you could keep proportion within SLA constant (indicating customer experience is still as reliably acceptable) but you would notice that z% increased (for example) from close to 0% to 50%, showing that half of the queries are now served in negligible time.
The particularly useful aspect of multi level SLA analysis is that because it uses proportions within SLA rather than mean averages, it better reflects the importance of the queries that are outside SLA and reporting on “nice to have” levels of performance does not corrupt the understanding the basic SLA performance requirement. Why averages suck and percentiles are great has a great explanation of how the mean can be dragged below the median by a proportion of fast queries (or transactions as per that example) which supports the benefits of being wary of using simple averages.
Proportion of submitted queries that are aborted or timed out (lower is better)
Average time submit to complete (completed queries only) (lower is better)
Includes any queuing plus execution times - as this is a primary metric we are ignoring platform specific metrics such as queue times
Plot the percentiles to show performance if specific SLAs are not available
It’s useful to see both the mean and the median
Failed/cancelled queries can’t be included in average time calculations - be cautious about them being accidentally included in queries and corrupting the results perhaps by having zero times to complete. This is where the proportion completed within SLA time is more useful.
Other Considerations and References…
There are some useful techniques for analysing the user experience versus time and and versus workload that should be discussed separately.
It’s not clear whether users get better or worse at doing valid queries when the cluster performance is better.