Crowdstrike outage was inevitable

The global IT outage was only a matter of time

Friday, 2024-07-19. Large parts of the world come to a screeching halt, when Crowdstrike, a "security and endpoint protection solutions company", deploys a broken signature update. 8.5 million Windows machines crash and do not recover.

Crowdstrike caught the bullet and the blame, but this could have been any one of the vendors in the same sector. They are all equally bad.

Rootkits, rootkits everywhere

An endpoint detection and response product (EDR, NDR, XDR, or whatever they are called this week) tries to prevent dodgy and/or malicious software compromising the system. It does this by inspecting, intercepting, blocking or modifying the low-level library and operating system calls made by all other software running on the system. To do that, it must itself run with elevated privileges and even inside the kernel.

A piece of third-party software, loaded into the kernel, intended to modify what other software on the system can see or do. Back in the 1990s we used to call them rootkits.

The endpoint security products may wear corporate suits and come with exorbitant marketing budgets, but from a technical perspective they are just the same: rootkits.

Poor quality code

The outage from this particular failure was caused by Crowdstrike's agent (running in the kernel) parsing a malformed signature data file ... and promptly crashing. Taking the entire system with it.

On reboot, the agent is one of the first pieces to start, so it has an early access to read its data files before anything else comes up. So now it crashes early in the boot sequence, causing a "boot loop".

Reboot. CRASH! Reboot. CRASH!

This should be error handling 101, and it is crystal clear Crowdstrike's agent was not doing it properly. But as I said at the top, this could have been anyone. If this is how the industry giants write kernel drivers, their userspace code is unlikely to be any better.

This is not a new sentiment. Few months back, Dmitri Alperovitch - founding member of CSRB, co-founder and former CTO of Crowdstrike - said during a podcast interview that some of the worst code he had ever seen was in security products. Industry professionals have known this for a long time, so it didn't come as news... but it was admittedly nice to hear someone finally say the quiet part out loud.

Bad release management practices

Based on the reports heard so far, the signature data update ("channel file") had passed all their build time checks. The malformed data was inserted into the build artifacts at a later stage, before the now-modified release artifact was finally made available to their installed client base.

How is that an acceptable release mechanism?

As if that wasn't enough, the updated artifacts were made globally available without (apparently) testing them internally first. Even the most basic testing scheme should have caught an entire fleet of machines crashing after update. So either Crowdstrike did not have testing regime in place - or they did, and nobody paid attention to what the test systems were showing.

How is that an acceptable release strategy?

And if Crowdstrike, a supposedly industry-leading giant is doing it this badly - how bad is it with the rest of the industry?

Perverse incentives

Based on grapevine and rumour mill (ie. things I can not attribute or verify), EDR vendors have contractual SLAs that bind them to a guaranteed 4-hour turnaround from "new threat detected" to "signature and prevention mechanism deployed to clients".

High pressure. Complex problems. Arbitrarily tight, unrealistic deadlines.

That is a recipe for cutting corners and bypassing checks. If running a pre-release check cycle takes 20 minutes and you're already running out of time, then sure, by all means, YOLO the release. If you breach the SLA, your clients are going to get hefty discounts and service credits - and your boss's boss will miss their annual bonus ladder.

What's the worst that could happen?

Regulatory capture

A lesser known misfeature of the security product vendor race is that once a vendor is big enough and well-known enough, their name gets "somehow" added to security questionnaire templates. These templates are used by various regulators, auditors, clients' vendor diligence/assurance teams, and insurers. Get big enough as a vendor, and you get added to a list of pre-approved/known-good providers, into a dropdown menu in spreadsheet, or a radio button menu in a web form.

These questionnaires are everywhewre. Every time you, as a party getting questioned, pick the sane option ("other"), you get to explain the reason for doing so to non-technical, non-security people. It is no wonder that for a company going through the same dance for the umpteenth time, someone high up in the chain will eventually decide that it's going to be easier to buy a solution from one of the listed vendors just to cut down on the time and headache.

A tacit moat is still a moat.

Disaster recovery through clicky-clicky

As bad as the security product vendors may be, they are not the only ones to drop the ball. This disaster took out more than eight million systems in a couple of hours. The companies impacted will take days, if not weeks, to recover in full.

In this day and age, system provisioning and recovery should be a solved problem. Frequently updated, well maintained golden images with well exercised, automated (re)install cycles should be table stakes.

They're not.

Instead we have overburdened IT admin teams who have to go around from machine to machine, clicking buttons in the right order, to get the basic functionality back.

Our industries are running fleets of machines that are capable of doing the same thing over and over again, blazingly fast, never getting tired. We supposedly thrive on automation. And yet the actual maintenance of these same machines is done without taking advantage of the same automation capabilities. Instead of routinely used scripts taking care of the mundane activities, we depend on runbooks with screenshots to explain which button to click at any given step in the sequence.

The outro

A global outage thanks to security vendor failure was not an accident. It was an inevitability.

And we're going to see it happen again.