CrowdStrike Strikes Crowd
2024-07-20
By now you will have heard about the massive BSOD-Fest that temporarily crippled much of the world’s business and, to a lesser extent, government computer systems.
Just in case you were locked in the executive washroom and missed the details, here’s a witty summary from Fireship:
It’s a little bit funny, in a gallows-humor way, when a company that uses militaristic language to sell to airlines and governments – Strike, Falcon! – ends up showing off the fragility of our national infrastructure. At least this time, you have not been eaten by a GRU.
Those of us who’ve worked in the industry were at first surprised that the bosses had gone for such an obvious single point of failure, then – curse you, experience! – not surprised at all.
Even so, we were puzzled that nobody had talked the C-Level into the miniscule engineering spend required to at least test the updates before rolling them out, and maybe consider not rolling them out all at once.
The nerdiverse was abuzz: do the airlines really not have staging?!
No, sadly, they don’t: staging was taken out back during the Great QA Purge, never to be seen again, but worry not: now everyone does DevOps and the quarterly numbers were made!
Polishing Your Attack Surface
At least among actual tech people, a lot of fingers are pointing at the prevalence of the “security audit,” in which a third party (often a government agency) comes with a checklist of “best practices” (some more best, some less worst) and requires your company/agency to tick the expected boxes. If you don’t, you will not have any cake! And since the cake is often made from juicy taxpayer dollars, you really, really want some.
This creates a perverse incentive market for some other third party to offer a
solution of some kind that ticks a box for you, or ideally multiple boxes at
once. Provided all these solutions add up to less than
$CAKE
, why not? You get the boxes ticked, and you save yourself the trouble
of properly understanding the issue. The people who showed up with the
checklist don’t understand it either! Nor do their bosses!
If things go south, you will be in good company: nobody was ever (yet) fired for buying CrowdStrike! You have your third party ready to take the blame, fall on the sword… oh wait, no, not actually fall on the sword. But take the rhetorical blame, oh yes indeed: George Kurtz is deeply sorry.
It’s perhaps too easy just blame executive sloth and magical thinking, and/or the Administrative Deep State or whatever it’s called now. Because there’s one other thing you need to have done before laying down on the tracks for Mr. Kurtz’s null-pointer locomotive.
You had to ignore all the engineers, former engineers, people who understand computers, and random critical thinkers who tried to explain to you what a terrible idea this was. That, friend, takes a culture.
Fragility is a Choice
Electing Peter Thiel Vice President is not gonna fix this. Nor will more stealth protectionism from Margrethe Vestager – a monocrop of SUSE is not the answer, and folks like SAP are in the same game as CrowdStrike.
The rot runs deeper: this is a direct result of the many, many people who knew better having had the give-a-shit beaten out of them over long years of stock-price maximalism and bureaucratic self-love.
The good news? You can be the change you want to see in the org chart!
Here’s my eight-step path to not get CrowdStruck, follow me for more life advice:
- No systems monoculture, everything runs on a mix of operating systems.
- That goes for clouds too.
- Everything Windows is inside a VM whose host can automatically roll it back.
- Threat models include your vendors.
- Rolling updates, and don’t roll too fast.
- Stress-test everything first, i.e. bring back Staging.
- Have disaster-recovery protocols, and practice them in real life.
- No updates on Friday!
But really, you already knew this, or someone you work with did.
The two-step version for the CTO is dirt simple:
- Keep your contentious nerds close.
- Listen to what they say.
Every level of every organization has its own culture. The CrowdStrike-Microsoft debacle required multiples of them to fail at once, but slowly, over time and sales calls.
Do I think it’s really going to change? Nah, not until the same thinking gives us a proper disaster with explosions and millions dead and maybe a TikTok outage and some zombies – the Agile kind, natch!
But at least some of us can learn from this mess and do our best to prevent the same from happening where we work. Building the future is all well and good, but building a future that doesn’t fall apart on a minor C++ bug? Winning! 🚀
(Until then, please Mr Kurtz sir no more updates until after my next flight please thank you!)