Filed under random

Four stages of getting griefed

Scribbling at $dayjob, again

Some time ago (cough several months cough) I wrote a piece at work. As part of a bigger set of reader friendly updates, I knew I needed the option to link to a concise piece that explains what takes place in the event of an online breach. And that piece could not be too specific, or tied to any single vendor or system.

Among other things, breaches are reported and news of them read because humans are... well, human. Selfish. Shallow.

Consider this a PR-approved, yet sufficiently cynical take on the anatomy of an online breach: Four stages of getting griefed

Crowdstrike outage was inevitable

The global IT outage was only a matter of time

Friday, 2024-07-19. Large parts of the world come to a screeching halt, when Crowdstrike, a "security and endpoint protection solutions company", deploys a broken signature update. 8.5 million Windows machines crash and do not recover.

Crowdstrike caught the bullet and the blame, but this could have been any one of the vendors in the same sector. They are all equally bad.

Rootkits, rootkits everywhere

An endpoint detection and response product (EDR, NDR, XDR, or whatever they are called this week) tries to prevent dodgy and/or malicious software compromising the system. It does this by inspecting, intercepting, blocking or modifying the low-level library and operating system calls made by all other software running on the system. To do that, it must itself run with elevated privileges and even inside the kernel.

A piece of third-party software, loaded into the kernel, intended to modify what other software on the system can see or do. Back in the 1990s we used to call them rootkits.

The endpoint security products may wear corporate suits and come with exorbitant marketing budgets, but from a technical perspective they are just the same: rootkits.

Poor quality code

The outage from this particular failure was caused by Crowdstrike's agent (running in the kernel) parsing a malformed signature data file ... and promptly crashing. Taking the entire system with it.

On reboot, the agent is one of the first pieces to start, so it has an early access to read its data files before anything else comes up. So now it crashes early in the boot sequence, causing a "boot loop".

Reboot. CRASH! Reboot. CRASH!

This should be error handling 101, and it is crystal clear Crowdstrike's agent was not doing it properly. But as I said at the top, this could have been anyone. If this is how the industry giants write kernel drivers, their userspace code is unlikely to be any better.

This is not a new sentiment. Few months back, Dmitri Alperovitch - founding member of CSRB, co-founder and former CTO of Crowdstrike - said during a podcast interview that some of the worst code he had ever seen was in security products. Industry professionals have known this for a long time, so it didn't come as news... but it was admittedly nice to hear someone finally say the quiet part out loud.

Bad release management practices

Based on the reports heard so far, the signature data update ("channel file") had passed all their build time checks. The malformed data was inserted into the build artifacts at a later stage, before the now-modified release artifact was finally made available to their installed client base.

How is that an acceptable release mechanism?

As if that wasn't enough, the updated artifacts were made globally available without (apparently) testing them internally first. Even the most basic testing scheme should have caught an entire fleet of machines crashing after update. So either Crowdstrike did not have testing regime in place - or they did, and nobody paid attention to what the test systems were showing.

How is that an acceptable release strategy?

And if Crowdstrike, a supposedly industry-leading giant is doing it this badly - how bad is it with the rest of the industry?

Perverse incentives

Based on grapevine and rumour mill (ie. things I can not attribute or verify), EDR vendors have contractual SLAs that bind them to a guaranteed 4-hour turnaround from "new threat detected" to "signature and prevention mechanism deployed to clients".

High pressure. Complex problems. Arbitrarily tight, unrealistic deadlines.

That is a recipe for cutting corners and bypassing checks. If running a pre-release check cycle takes 20 minutes and you're already running out of time, then sure, by all means, YOLO the release. If you breach the SLA, your clients are going to get hefty discounts and service credits - and your boss's boss will miss their annual bonus ladder.

What's the worst that could happen?

Regulatory capture

A lesser known misfeature of the security product vendor race is that once a vendor is big enough and well-known enough, their name gets "somehow" added to security questionnaire templates. These templates are used by various regulators, auditors, clients' vendor diligence/assurance teams, and insurers. Get big enough as a vendor, and you get added to a list of pre-approved/known-good providers, into a dropdown menu in spreadsheet, or a radio button menu in a web form.

These questionnaires are everywhewre. Every time you, as a party getting questioned, pick the sane option ("other"), you get to explain the reason for doing so to non-technical, non-security people. It is no wonder that for a company going through the same dance for the umpteenth time, someone high up in the chain will eventually decide that it's going to be easier to buy a solution from one of the listed vendors just to cut down on the time and headache.

A tacit moat is still a moat.

Disaster recovery through clicky-clicky

As bad as the security product vendors may be, they are not the only ones to drop the ball. This disaster took out more than eight million systems in a couple of hours. The companies impacted will take days, if not weeks, to recover in full.

In this day and age, system provisioning and recovery should be a solved problem. Frequently updated, well maintained golden images with well exercised, automated (re)install cycles should be table stakes.

They're not.

Instead we have overburdened IT admin teams who have to go around from machine to machine, clicking buttons in the right order, to get the basic functionality back.

Our industries are running fleets of machines that are capable of doing the same thing over and over again, blazingly fast, never getting tired. We supposedly thrive on automation. And yet the actual maintenance of these same machines is done without taking advantage of the same automation capabilities. Instead of routinely used scripts taking care of the mundane activities, we depend on runbooks with screenshots to explain which button to click at any given step in the sequence.

The outro

A global outage thanks to security vendor failure was not an accident. It was an inevitability.

And we're going to see it happen again.

In wake of xz project compromise...

OSS project governance demands, distilled

Hi, I'm from Entitled Inc.

You know that unpaid labour of yours, which we benefit from? You should do more of it. Oh, and you should add all this red tape so that we don't have to do anything ourselves.

We're still not going to pay you.

We also demand that you sign our Modern Slavery statement.

Life of success

Three steps to a life of success

  1. Be born to the right parents
  2. Nepo
  3. Coast

Collected scribblings

Things written in times past

I used to write things for a previous employer's tech blog, but the old URLs may succumb to bitrot. These should work as long as Medium works:

The challenges of running a betting exchange (2016)

Notes on interviewing engineers (2016) -- This one was also picked up by a recruiter's blog.

DevOps is culture, not a title prefix (2017)

Security and Devops - a natural fit (2017)

Wait, what is my fleet doing (2018)

Hey, guess what? Your passwords have been compromised

Shields up on user information (2019)

Audits explained

Audits explained

A pentest is like going to the GP for a check-up. An audit is like having a month-long colonoscopy.

from copilot import vulnerabilities

Not your grandfather's MVC

This was perfectly predictable. CoPilot generates insecure code, as expected.

Machine Learning, the magic pixie dust of the past decade, is all about volume. And writing secure code is harder than writing insecure code. So by sheer volume there will be a lot more insecure code around.

Given that a lot of code in the wild is essentially a minimum viable copypaste from the highest scoring answer on StackOverflow, teaching the code generator model has obviously consumed a lot of insecure code. Since SO rewards speed, the answers that take the least time to write will receive most points.

Writing secure code takes more time and more space - so by the time someone submits an answer that considers security aspects, the person asking the question has already accepted (and ran with) the first and shortest working answer instead.

StackOverflow has redefined the MVC programming model. It now stands for Minimum Viable Copypaste.

Getting home for plague'mas

Be honest

Planning to travel to visit your family in these plague-ridden times, you're really saying:

"I miss my family so much they will see me if it's the last thing they do."

It's that time of the year again

Page from an undated journal

Daddy's crying. Mommy has a black eye. Little sister's hiding under her bed.

Yep, it's christmas.

Perception is almost everything

Misunderstood Indexing

Holding the number-one spot on Corruption Perceptions Index tells nothing about how well a country is doing.

It merely highlights how bad the situation is even for the runner-up.