Things go wrong all the time. When they do, the problems can be widespread, or they can be localized, and the solutions can be easy, or hard.
This week's CrowdStrike debacle appears to be widespread, and its solution hard. And there are lessons in that.
According to current reports, the problem is a single line of code that has bricked Windows machines around the world – as many as a billion, according to some accounts – and that can't be fixed the way it was broken. With the computers blue-screened, you can't just push out another update and fix everything. At present they often have to be booted in safe mode and the offending bit deleted by hand. Repeated rebooting sometimes gets the fix that Crowdstrike came up with to install and run, but that's also a retail, not a wholesale, solution. A billion computers is a lot of rebooting.
The outage has brought down airlines, medical centers, the London Stock Exchange, and quite a few other institutions. I went for some routine bloodwork this morning, slightly grumpy because I had to fast until after the draw. When I got there I was told that they couldn't do anything. Apparently the computer tells the (very competent) phlebotomists which tubes to fill, how to label them, etc., and without the system they can't do it. There's nobody to call, and nowhere to look up what to do. All the intelligence has been moved into the system and away from the workers, which is fine as long as the system works.
Oops. (My fasting blood draw having failed, I went next door and got an Egg McMuffin, out of spite. Well, and to keep my blood sugar from making me grumpier.)
There are lots of lessons here. One is that Windows should be harder to kill, but we've known that for a while. Another is that maybe it's a mistake to have a billion PCs using the same software that can cripple them.
And yet another is that we shouldn't depend so much on informatics, which are easily hacked, and worse yet often fail on their own – as here – without any destructive human agency being involved at all. Back when I was studying computer programming, a teacher said that if buildings were constructed like software, a single woodpecker could level an entire city. Still true!
I'm a big fan of Resilience Engineering – an approach that involves systems being designed to degrade gracefully under pressure rather than fail catastrophically. Sadly, brittle systems are usually cheaper and the people who budget and build them generally don't plan for disaster enough. You might see some degree of government regulation helping – some localities prone to power interruptions due to things like hurricanes require gas stations to have backup generators to power the pumps – but I can't imagine a government regulation that would deal with stuff like this CrowdStrike failure. (Except maybe a limitation on their ability to avoid liability for failures by contract).
But then there's individual resilience, which is something we can do more about.
On social media I see people stranded in Paris with no working credit cards and dead ATMs, and that leads to another important lesson: Always carry cash! When traveling, I generally carry enough cash to get me through at least a couple of days (often more) and even at home I keep some cash in case things don't work right.
Back in the 2003 New York blackout, Amy Langfield wrote about the value of keeping a stash of small bills that she could use at the bodegas when the credit/debit card machines were down. The cashless society depends on the flawless functioning of networks that aren't really secure or reliable. Cash carries its own information with it – a $20 bill is worth $20 – and you don't need to know more to spend or accept it. That's resilient. Likewise making sure you have plenty of cushion with regard to supplies of medication, food at home, and the like.
Well, this debacle looks more like an inconvenience than a catastrophe, though I'm sure that lives are being lost due to delayed or canceled medical procedures, or people who choose to drive long distances rather than fly. But it should be a wakeup call.
For the big tech companies, and for us all.
[As always, if you like these essays, please sign up for a paid subscription. I’ll thank you, and my family will thank you.]
I wrote code in assembly language for an IBM 360 with a HASP operating system. We took pride in crafting efficient and robust operating systems. Those days are long gone. You don't need nukes, chemical or biological weapons to defeat the free world; you need a few lines of malicious code and enough explosives to knock out a few transmission towers. The rest would take care of itself.
I work in the security industry and we have a bunch of customers badly impacted by this. The upside, I was able to tell them, is that the computers that Crowdstrike (which is security software itself) made unusable are also not vulnerable to attack right now. I'm sure that is very small consolation to all the businesses losing exorbitant amounts of money to fix Crowdstrike's screw up.
Your points about resilience are good ones. As a long time security professional, I can tell you that resilience is one of the most overlooked things done in securing people and systems. It's difficult and costs more. Until the problem occurs and then it turns out it would have been cheaper to build the resilience ahead of time.