This article about the recent S3 slowdown and recovery notes that AWS originally pursued the wrong root cause. There's always a risk of this happening.
AWS are famous for automation in their deployment processes. For environments where there isn't a formal and automated mechanism for making urgent changes, we can get problems like in the flowchart below.
When we pursue the wrong root cause, and we can expect this to happen sometimes, then we can very quickly get into making changes that don't fix the issue and can even make it worse. If we don't have automated mechanisms for making and reverting changes, we can then cause further problems when we try to undo the changes we made.
We're in a much better situation if we have a formal and fully automated mechanism for making changes and reverting them. We've used the git flow hotfix mechanism for this extensively in a data warehouse context, as it is convenient and easy to understand even when reverting changes, and the production changes are automatically made available in development environments.
It would be ideal if we didn't have to make any urgent production changes, or if we did, that we would do the correct change first time. But sometimes we'll need to make changes and won't get it right first time, and so we benefit greatly from automated and revertible processes for making changes, that provide a full audit trail.
We are then able to learn from the changes that we had to make urgently, and by understanding exactly what they were, we understand what caused the issue and are able to improve processes in future.
If you're wondering where I'm suggesting that testing fits in with all of this, I will point out that with automated deployment mechanisms, it's faster and more consistent to make a change in a test environment, run specific tests and deploy that exact same change to production than with manual processes. We can make use of that in the normal development cycle and also when making urgent changes.
The theme of this post is the pragmatism that makes us realise that sometimes things are going to go wrong, and if they can go wrong for AWS then they can go wrong for us too, so we need to have low risk, auditable mechanisms we can use to solve these problems. Now that DevOps techniques are no longer just for the pioneers, but are maturing and becoming suitable for the settlers (see Simon Wardley's PST post), it's time for a lot more organisations to benefit greatly from this more advanced architecture.