Benchmarking Query Experience Over Recurring Time Periods

You may find the following useful if you need to analyse how the user experience of the data warehouse varies over the course of recurring time periods such as the working week or financial reporting periods.

We got interested in this problem because as part of defining a structured approach to benchmarking data warehouse performance we needed a standard approach for analysing Query Experience by time. See Structuring Data Warehouse Benchmarks for more details.

The solution concepts were

  1. Plot Query Experience versus time to understand when poorer performance is occurring

  2. Plot Workload Volume versus time to understand where query experience is apparently impacted by workload volume

  3. Plot supporting metrics suitable for the system to understand where query experience is apparently impacted by factors other than workload volume

  4. Make changes and reassess via concepts 1 to 3 above

The solution in practice …

For this kind of analysis we normally use a time axis because that makes it easier to understand what other types of effects could be occurring. We don’t typically plot Query Experience versus Workload Volume because it doesn’t normally add value to be more formal about calculating correlation, instead it is usually more useful to be able to add supporting metrics to the plot.

Starting with a query experience plot shows the following features:

  1. Poor experience on Monday morning

  2. Variable experience through the day during the rest of the week, worse in the mornings

  3. Poor experience early Thursday evening

2018-09-10 at 06.00 1.png

 

The next step is to understand this by analysing workload volume, which shows:

  1. Workload volume is much higher on Monday morning, which may explain the poorer performance at that time

  2. Workload varies throughout the day, with the higher workload occurring at times when poorer experience is seen

  3. There is no workload explanation for the poor experience on Thursday evening

2018-09-10 at 06.01 2.png

After looking at the two primary metrics of user experience and workload volume, we look at secondary metrics to explain the remaining behaviour. Which secondary metric to look at depends on the technology in use - in this illustration the solution is a postgres appliance, and for these solutions analysing the maintenance job workload (including vacuum and backups) can be useful. The chart below shows that the majority of maintenance jobs  are configured to occur when workload is very low, but there is a Thursday afternoon maintenance job which occurs at the same time as the anomalous poor user experience.

2018-09-10 at 06.01 3.png

Consider contrasting running the exact same workload volume on another solution which has the following differences to the solution above:

1) does not require maintenance jobs

2) allows additional capacity to be provisioned

3) has radically different technology such that comparing performance between the solutions via supporting metrics such as disk bytes read does not give any useful information

Assuming for simplicity of illustration that the new solution has otherwise the same performance as the old solution, and that the primary metric workload volume is the same, we can see that the queries completing within SLA time is improved versus the original system with respect to the poor experience on Monday mornings due to the provisioning of extra capacity and for the anomalies occurring on Thursday afternoon due to the lack of need for maintenance jobs:

2018-09-10 at 06.01 4.png

As explained in Structuring Data Warehouse Benchmarks different solutions can have different supporting metrics - in this example the new solution has a Provisioned Compute Capacity metric that is relevant for explaining the improved performance.

2018-09-10 at 06.02 5.png

Other Considerations and References...

The above example is deliberately simple to illustrate the basic principles behind the technique. Techniques for contrasting relative performance with different workloads can be discussed separately. The key benefit of the technique illustrated here is that it focuses on appropriate use of the primary metrics of query experience and workload volume when contrasting systems, while also making use of the supporting metrics.

This simple example also shows just one week - techniques including taking averages and minimum / maximum for the same recurring time period can be useful for identifying persistent versus exceptional effects. Techniques for this including separating the short term effects from the longer term changes can be discussed separately.