Filed under random

Crowdstrike outage was inevitable

The global IT outage was only a matter of time

Friday, 2024-07-19. Large parts of the world come to a screeching halt, when Crodstrike, a "security and endpoint protection solutions company", deploys a broken signature update. 8.5 million Windows machines crash and do not recover.

Crowdstrike caught the bullet and the blame, but this could have been any one of the vendors in the same sector. They are all equally bad.

Rootkits, rootkits everywhere

An endpoint detection and response product (EDR, NDR, XDR, or whatever they are called this week) tries to prevent dodgy and/or malicious software compromising the system. It does this by inspecting, intercepting, blocking or modifying the low-level library and operating system calls made by all other software running on the system. To do that, it must itself run with elevated privileges and even inside the kernel.

A piece of third-party software, loaded into the kernel, intended to modify what other software on the system can see or do. Back in the 1990s we used to call them rootkits.

The endpoint security products may wear corporate suits and come with exorbitant marketing budgets, but from a technical perspective they are just the same: rootkits.

Poor quality code

The outage from this particular failure was caused by Crowdstrike's agent (running in the kernel) parsing a malformed signature data file ... and promptly crashing. Taking the entire system with it.

On reboot, the agent is one of the first pieces to start, so it has an early access to read its data files before anything else comes up. So now it crashes early in the boot sequence, causing a "boot loop".

Reboot. CRASH! Reboot. CRASH!

This should be error handling 101, and it is crystal clear Crowdstrike's agent was not doing it properly. But as I said at the top, this could have been anyone. If this is how the industry giants write kernel drivers, their userspace code is unlikely to be any better.

This is not a new sentiment. Few months back, Dmitri Alperovitch - founding member of CSRB, co-founder and former CTO of Crowdstrike - said during a podcast interview that some of the worst code he had ever seen was in security products. Industry professionals have known this for a long time, so it didn't come as news... but it was admittedly nice to hear someone finally say the quiet part out loud.

Bad release management practices

Based on the reports heard so far, the signature data update ("channel file") had passed all their build time checks. The malformed data was inserted into the build artifacts at a later stage, before the now-modified release artifact was finally made available to their installed client base.

How is that an acceptable release mechanism?

As if that wasn't enough, the updated artifacts were made globally available without (apparently) testing them internally first. Even the most basic testing scheme should have caught an entire fleet of machines crashing after update. So either Crowdstrike did not have testing regime in place - or they did, and nobody paid attention to what the test systems were showing.

How is that an acceptable release strategy?

And if Crowdstrike, a supposedly industry-leading giant is doing it this badly - how bad is it with the rest of the industry?

Perverse incentives

Based on grapevine and rumour mill (ie. things I can not attribute or verify), EDR vendors have contractual SLAs that bind them to a guaranteed 4-hour turnaround from "new threat detected" to "signature and prevention mechanism deployed to clients".

High pressure. Complex problems. Arbitrarily tight, unrealistic deadlines.

That is a recipe for cutting corners and bypassing checks. If running a pre-release check cycle takes 20 minutes and you're already running out of time, then sure, by all means, YOLO the release. If you breach the SLA, your clients are going to get hefty discounts and service credits - and your boss's boss will miss their annual bonus ladder.

What's the worst that could happen?

Regulatory capture

A lesser known misfeature of the security product vendor race is that once a vendor is big enough and well-known enough, their name gets "somehow" added to security questionnaire templates. These templates are used by various regulators, auditors, clients' vendor diligence/assurance teams, and insurers. Get big enough as a vendor, and you get added to a list of pre-approved/known-good providers, into a dropdown menu in spreadsheet, or a radio button menu in a web form.

These questionnaires are everywhewre. Every time you, as a party getting questioned, pick the sane option ("other"), you get to explain the reason for doing so to non-technical, non-security people. It is no wonder that for a company going through the same dance for the umpteenth time, someone high up in the chain will eventually decide that it's going to be easier to buy a solution from one of the listed vendors just to cut down on the time and headache.

A tacit moat is still a moat.

Disaster recovery through clicky-clicky

As bad as the security product vendors may be, they are not the only ones to drop the ball. This disaster took out more than eight million systems in a couple of hours. The companies impacted will take days, if not weeks, to recover in full.

In this day and age, system provisioning and recovery should be a solved problem. Frequently updated, well maintained golden images with well exercised, automated (re)install cycles should be table stakes.

They're not.

Instead we have overburdened IT admin teams who have to go around from machine to machine, clicking buttons in the right order, to get the basic functionality back.

Our industries are running fleets of machines that are capable of doing the same thing over and over again, blazingly fast, never getting tired. We supposedly thrive on automation. And yet the actual maintenance of these same machines is done without taking advantage of the same automation capabilities. Instead of routinely used scripts taking care of the mundane activities, we depend on runbooks with screenshots to explain which button to click at any given step in the sequence.

The outro

A global outage thanks to security vendor failure was not an accident. It was an inevitability.

And we're going to see it happen again.

In wake of xz project compromise...

OSS project governance demands, distilled

Hi, I'm from Entitled Inc.

You know that unpaid labour of yours, which we benefit from? You should do more of it. Oh, and you should add all this red tape so that we don't have to do anything ourselves.

We're still not going to pay you.

We also demand that you sign our Modern Slavery statement.

Life of success

Three steps to a life of success

  1. Be born to the right parents
  2. Nepo
  3. Coast

Collected scribblings

Things written in times past

I used to write things for a previous employer's tech blog, but the old URLs may succumb to bitrot. These should work as long as Medium works:

The challenges of running a betting exchange (2016)

Notes on interviewing engineers (2016) -- This one was also picked up by a recruiter's blog.

DevOps is culture, not a title prefix (2017)

Security and Devops - a natural fit (2017)

Wait, what is my fleet doing (2018)

Hey, guess what? Your passwords have been compromised

Shields up on user information (2019)

Audits explained

Audits explained

A pentest is like going to the GP for a check-up. An audit is like having a month-long colonoscopy.

from copilot import vulnerabilities

Not your grandfather's MVC

This was perfectly predictable. CoPilot generates insecure code, as expected.

Machine Learning, the magic pixie dust of the past decade, is all about volume. And writing secure code is harder than writing insecure code. So by sheer volume there will be a lot more insecure code around.

Given that a lot of code in the wild is essentially a minimum viable copypaste from the highest scoring answer on StackOverflow, teaching the code generator model has obviously consumed a lot of insecure code. Since SO rewards speed, the answers that take the least time to write will receive most points.

Writing secure code takes more time and more space - so by the time someone submits an answer that considers security aspects, the person asking the question has already accepted (and ran with) the first and shortest working answer instead.

StackOverflow has redefined the MVC programming model. It now stands for Minimum Viable Copypaste.

Getting home for plague'mas

Be honest

Planning to travel to visit your family in these plague-ridden times, you're really saying:

"I miss my family so much they will see me if it's the last thing they do."

It's that time of the year again

Page from an undated journal

Daddy's crying. Mommy has a black eye. Little sister's hiding under her bed.

Yep, it's christmas.

Perception is almost everything

Misunderstood Indexing

Holding the number-one spot on Corruption Perceptions Index tells nothing about how well a country is doing.

It merely highlights how bad the situation is even for the runner-up.

Dear online surveillance addicts

Ground rules for acceptable ads online

The following is an edit of a piece originally written in November 2015.

The pinnacle of non-intrusive online ads were the original Google search ads. They were out of the way, clearly marked as ads - and hence could be visually filtered out. They were pure text, so could be neatly included as elements on the rendered page. And they were always targeting an INTEREST. Not an individual.

I will take that as the minimum acceptable advertising behaviour. I'm not implying it's perfect, but at least we set a clear set of ground rules. With that in mind, my ideal, non-intrusive ads mechanism builds on the following rules:

  • Ads must never be inline to page content.
  • Even when clearly out of the way, ads must not be allowed to mimic page content; they must be clearly marked as ads.
  • Text only.
  • I might accept an image within the ad, provided it was always served from the content provider's system.
  • As an extension to previous point: if the served image size would exceed a notable fraction of the page size, it must not be included in the output.
  • No user tracking of any kind.
  • No third-party javascript. Ever.
  • At most 15% of display real estate allowed to be used by ads. Including the padding in the UI. (It all counts as space denied from content.)
  • Not allowed to affect page content load times. Ad material must be included at the end of the page code. If your service pushes ads from internal and separate system, hard timeouts must be imposed: if the internal system cannot serve an ad within an allotted time, the frontend must never be forced to wait. You just missed an ad impression. Tough.
  • If clicking an ad takes a user through a bounce page, all identifiable information from the user must be stripped. Bounce page or redirect must not impose any further page loading delay.
  • No beacons.

Breaking even one of the rules automatically disqualifies you.

If you, as an advertiser, find these rules unacceptable - well, then we are in mutual disagreement. I find your ads equally unacceptable and will treat them as a form of cancer.

However, as a genuine service to the user... please allow the users to search for ads that have been displayed to them. Preferably by display context. I would be glad to return to a subject at a later date and search for something I remember seeing earlier.

The above set of rules is still not ideal, but everything that behaved according to them would at least be palatable.