The global IT outage was only a matter of time
Friday, 2024-07-19. Large parts of the world come to a screeching halt,
when Crowdstrike, a "security and endpoint protection solutions company",
deploys a broken signature update.
8.5 million Windows machines crash and do not recover.
Crowdstrike caught the bullet and the blame, but this could have been
any one of the vendors in the same sector. They are all equally bad.
Rootkits, rootkits everywhere
An endpoint detection and response product (EDR, NDR, XDR, or whatever
they are called this week) tries to prevent dodgy and/or malicious
software compromising the system. It does this by inspecting,
intercepting, blocking or modifying the low-level library and operating
system calls made by all other software running on the system. To do
that, it must itself run with elevated privileges and even inside
the kernel.
A piece of third-party software, loaded into the kernel, intended to
modify what other software on the system can see or do. Back in the
1990s we used to call them rootkits.
The endpoint security products may wear corporate suits and come with
exorbitant marketing budgets, but from a technical perspective they are
just the same: rootkits.
Poor quality code
The outage from this particular failure was caused by Crowdstrike's
agent (running in the kernel) parsing a malformed signature data file
... and promptly crashing. Taking the entire system with it.
On reboot, the agent is one of the first pieces to start, so it has an
early access to read its data files before anything else comes up. So
now it crashes early in the boot sequence, causing a "boot loop".
Reboot. CRASH! Reboot. CRASH!
This should be error handling 101, and it is crystal clear Crowdstrike's
agent was not doing it properly. But as I said at the top, this could
have been anyone. If this is how the industry giants write kernel
drivers, their userspace code is unlikely to be any better.
This is not a new sentiment. Few months back, Dmitri Alperovitch -
founding member of CSRB, co-founder and former CTO of Crowdstrike - said
during a podcast interview that some of the worst code he had ever seen
was in security products. Industry professionals have known this for a
long time, so it didn't come as news... but it was admittedly nice to
hear someone finally say the quiet part out loud.
Bad release management practices
Based on the reports heard so far, the signature data update ("channel
file") had passed all their build time checks. The malformed data was
inserted into the build artifacts at a later stage, before the
now-modified release artifact was finally made available to their
installed client base.
How is that an acceptable release mechanism?
As if that wasn't enough, the updated artifacts were made globally
available without (apparently) testing them internally first. Even the
most basic testing scheme should have caught an entire fleet of
machines crashing after update. So either Crowdstrike did not have
testing regime in place - or they did, and nobody paid attention to what
the test systems were showing.
How is that an acceptable release strategy?
And if Crowdstrike, a supposedly industry-leading giant is doing it this
badly - how bad is it with the rest of the industry?
Perverse incentives
Based on grapevine and rumour mill (ie. things I can not attribute or
verify), EDR vendors have contractual SLAs that bind them to a
guaranteed 4-hour turnaround from "new threat detected" to "signature
and prevention mechanism deployed to clients".
High pressure. Complex problems. Arbitrarily tight, unrealistic
deadlines.
That is a recipe for cutting corners and bypassing checks. If running a
pre-release check cycle takes 20 minutes and you're already running out
of time, then sure, by all means, YOLO the release. If you breach the
SLA, your clients are going to get hefty discounts and service credits -
and your boss's boss will miss their annual bonus ladder.
What's the worst that could happen?
Regulatory capture
A lesser known misfeature of the security product vendor race is that
once a vendor is big enough and well-known enough, their name gets
"somehow" added to security questionnaire templates. These templates are
used by various regulators, auditors, clients' vendor
diligence/assurance teams, and insurers. Get big enough as a vendor, and
you get added to a list of pre-approved/known-good providers, into a
dropdown menu in spreadsheet, or a radio button menu in a web form.
These questionnaires are everywhewre. Every time you, as a party
getting questioned, pick the sane option ("other"), you get to explain
the reason for doing so to non-technical, non-security people. It is no
wonder that for a company going through the same dance for the umpteenth
time, someone high up in the chain will eventually decide that it's
going to be easier to buy a solution from one of the listed vendors just
to cut down on the time and headache.
A tacit moat is still a moat.
Disaster recovery through clicky-clicky
As bad as the security product vendors may be, they are not the only
ones to drop the ball. This disaster took out more than eight million
systems in a couple of hours. The companies impacted will take days, if
not weeks, to recover in full.
In this day and age, system provisioning and recovery should be a solved
problem. Frequently updated, well maintained golden images with well
exercised, automated (re)install cycles should be table stakes.
They're not.
Instead we have overburdened IT admin teams who have to go around
from machine to machine, clicking buttons in the right order, to get the
basic functionality back.
Our industries are running fleets of machines that are capable of doing
the same thing over and over again, blazingly fast, never getting tired.
We supposedly thrive on automation. And yet the actual maintenance of
these same machines is done without taking advantage of the same
automation capabilities. Instead of routinely used scripts taking care
of the mundane activities, we depend on runbooks with screenshots to
explain which button to click at any given step in the sequence.
The outro
A global outage thanks to security vendor failure was not an accident.
It was an inevitability.
And we're going to see it happen again.