Part of the Crowd
While we were discussing the recent Crowdstrike incident, we touched on the topic of what it would take to be immune to such problems. I’ll give you the TL;DR — it’s hard.
When we construct complex sites, we often bake resilience into each component’s deployment. Two servers, two containers, even running serverless code across multiple locations. We have odds and evens, A and B, or however you describe it. We can lose one server to random problems and our service continues.
But the fix to the Crowdstrike problem, and its many and diverse bedfellows, is not about creating resilience, but about avoiding homogeneity. There’s massive benefits to choosing one operating system, one security suite, one update mechanism, one automation method, to run across your whole site. Even if you run in containers, you’ll probably choose to prioritise one base image.
Choosing several means multiple skill sets just to deliver one single aspect of your estate. It means splitting various core functions across competing systems - no single monitoring dashboard, no single security dashboard. It means every deployment comes with two rollout plans - one for the A stack, one for the B stack. This is twice the opportunity for failure. It requires more testing. It means you need both stacks in test environments - you can’t halve your non-prod stack for cost savings.
That is quite a price to pay for some protection against a class of fundamentally rare occurrences. Vulnerability to a single vendor, like Crowdstrike, is part of the choices made when you create your environment - amid considerations of efficiency, skill sets and overall sanity. That choice - to trust selected vendors - is one made practically everyone makes, and by and large it is a rational one.
A retired Microsoft engineer’s simple explanation of what happened covers the blurred line between signed drivers and threat data, and how Crowdstrike themselves bear responsibility for quality-checking their product.
So we struggled to fault the various industries struck by outages. They balanced reliability against… well, reliability, and decided that trying to prevent occasional catastrophes isn’t worth making everything else worse, and more expensive. And who wants that?