You may find the following useful if you need to analyse the impact of adding extra features to a solution which is already approaching its performance limits with its current configuration.

We got interested in this problem because we had a solution which was scheduled to be used with higher data volumes than it was optimally configured for, and then we were asked to add complexity to the data processing. We were curious if simultaneously pushing the limits of the solution in two dimensions would be feasible. We had some easy options to increase the performance but we wanted to investigate whether making any changes at all were justified.

The solution concepts were to visualise the changes to the solution on a two dimensional continuum:

We knew that using an existing solution with an increase in volume or an increase in complexity would result in some compromises, which might or might not be acceptable.

For example, an increase in peak data volume without changing the solution might result in some latency compromises. Of course sometimes it’s possible to scale a design from small volumes to huge volumes without changing the approach, there are a lot of standard patterns when you can scale up the processing resources at short notice to deal with spikes in input, serverless solutions can be very useful for this. Maybe the solution you’re dealing with will need you to add some of that, and we’ll assume for this discussion that it’s not perfectly scalable yet.

Complexity behaviour compromises are a bit more challenging to generalise, but an example could be with dealing with unexpected data. A compromise if this happens only occasionally could be stopping the processing and having a manual fix, whereas if there’s a lot of unexpected data then the system must solve for it.

We wanted to know how to treat the combination of volume and complexity compromises. Could we have both? Could we just add them together?

The solution in practice

Analysing our particular problem we looked at the volume performance compromises we were making, and these boiled down to if the data arrived in spikes then the system could violate its latency SLAs as it would be running at 100% of capacity until it cleared them. We’ll illustrate this below with a generic model for which arbitrary input units need to be processed within 10 minutes.

If we have a steady input of 60 units per hour spread evenly throughout the hour, then as long as our capacity is at or above 60 units per hour, then we’ll always process our inputs within 10 minutes and we’ll never have any latency violations:

Of course we aren’t so naive as to think that that input will be completely evenly distributed throughout the day, so we specify double capacity so that if we get 20 minutes worth in the first 10 minutes and then nothing, we can still keep up:

However, even though we have a steady state capacity of 120 units per hour and the total input is only 60 units per hour, it doesn’t take much of a relative spike to cause a latency violation, for example:

It might be that a violation of a few minutes a few times a day is an acceptable compromise. It absolutely wouldn’t be for some kinds of systems, but for others maybe we’re analysing a theoretical peak traffic value that we’re sceptical about versus a latency requirement that someone invented without real justification and the client would rather we just get on with it.

Given we’ve got double the capacity versus the average input, let’s think about adding some extra complexity into the system. If we imagine this would half the effective capacity of the system (our real world example actually reduced it by more), then we go from violations of a few minutes a few times a day to violations of most of the traffic:

If we look at the effect on reducing capacity when the traffic becomes more unevenly distributed, the effect is even clearer:

If we had not been near the volume performance limit of the design, we could have added the extra complexity and there would have been no problem, but because we were already looking at volume performance compromises, we couldn’t afford it.

For this type of solution, if we’re already in the volume performance compromises region and we need to satisfy increased processing complexity requirements, we can generalise to state that we must either increase the volume capacity, or we must change the way the system works so that the increased complexity is handled within uncompromised system parameters: