Crowdstrike outage was inevitable

The global IT outage was only a matter of time

Friday, 2024-07-19. Large parts of the world come to a screeching halt, when Crowdstrike, a "security and endpoint protection solutions company", deploys a broken signature update. 8.5 million Windows machines crash and do not recover.

Crowdstrike caught the bullet and the blame, but this could have been any one of the vendors in the same sector. They are all equally bad.

Rootkits, rootkits everywhere

An endpoint detection and response product (EDR, NDR, XDR, or whatever they are called this week) tries to prevent dodgy and/or malicious software compromising the system. It does this by inspecting, intercepting, blocking or modifying the low-level library and operating system calls made by all other software running on the system. To do that, it must itself run with elevated privileges and even inside the kernel.

A piece of third-party software, loaded into the kernel, intended to modify what other software on the system can see or do. Back in the 1990s we used to call them rootkits.

The endpoint security products may wear corporate suits and come with exorbitant marketing budgets, but from a technical perspective they are just the same: rootkits.

Poor quality code

The outage from this particular failure was caused by Crowdstrike's agent (running in the kernel) parsing a malformed signature data file ... and promptly crashing. Taking the entire system with it.

On reboot, the agent is one of the first pieces to start, so it has an early access to read its data files before anything else comes up. So now it crashes early in the boot sequence, causing a "boot loop".

Reboot. CRASH! Reboot. CRASH!

This should be error handling 101, and it is crystal clear Crowdstrike's agent was not doing it properly. But as I said at the top, this could have been anyone. If this is how the industry giants write kernel drivers, their userspace code is unlikely to be any better.

This is not a new sentiment. Few months back, Dmitri Alperovitch - founding member of CSRB, co-founder and former CTO of Crowdstrike - said during a podcast interview that some of the worst code he had ever seen was in security products. Industry professionals have known this for a long time, so it didn't come as news... but it was admittedly nice to hear someone finally say the quiet part out loud.

Bad release management practices

Based on the reports heard so far, the signature data update ("channel file") had passed all their build time checks. The malformed data was inserted into the build artifacts at a later stage, before the now-modified release artifact was finally made available to their installed client base.

How is that an acceptable release mechanism?

As if that wasn't enough, the updated artifacts were made globally available without (apparently) testing them internally first. Even the most basic testing scheme should have caught an entire fleet of machines crashing after update. So either Crowdstrike did not have testing regime in place - or they did, and nobody paid attention to what the test systems were showing.

How is that an acceptable release strategy?

And if Crowdstrike, a supposedly industry-leading giant is doing it this badly - how bad is it with the rest of the industry?

Perverse incentives

Based on grapevine and rumour mill (ie. things I can not attribute or verify), EDR vendors have contractual SLAs that bind them to a guaranteed 4-hour turnaround from "new threat detected" to "signature and prevention mechanism deployed to clients".

High pressure. Complex problems. Arbitrarily tight, unrealistic deadlines.

That is a recipe for cutting corners and bypassing checks. If running a pre-release check cycle takes 20 minutes and you're already running out of time, then sure, by all means, YOLO the release. If you breach the SLA, your clients are going to get hefty discounts and service credits - and your boss's boss will miss their annual bonus ladder.

What's the worst that could happen?

Regulatory capture

A lesser known misfeature of the security product vendor race is that once a vendor is big enough and well-known enough, their name gets "somehow" added to security questionnaire templates. These templates are used by various regulators, auditors, clients' vendor diligence/assurance teams, and insurers. Get big enough as a vendor, and you get added to a list of pre-approved/known-good providers, into a dropdown menu in spreadsheet, or a radio button menu in a web form.

These questionnaires are everywhewre. Every time you, as a party getting questioned, pick the sane option ("other"), you get to explain the reason for doing so to non-technical, non-security people. It is no wonder that for a company going through the same dance for the umpteenth time, someone high up in the chain will eventually decide that it's going to be easier to buy a solution from one of the listed vendors just to cut down on the time and headache.

A tacit moat is still a moat.

Disaster recovery through clicky-clicky

As bad as the security product vendors may be, they are not the only ones to drop the ball. This disaster took out more than eight million systems in a couple of hours. The companies impacted will take days, if not weeks, to recover in full.

In this day and age, system provisioning and recovery should be a solved problem. Frequently updated, well maintained golden images with well exercised, automated (re)install cycles should be table stakes.

They're not.

Instead we have overburdened IT admin teams who have to go around from machine to machine, clicking buttons in the right order, to get the basic functionality back.

Our industries are running fleets of machines that are capable of doing the same thing over and over again, blazingly fast, never getting tired. We supposedly thrive on automation. And yet the actual maintenance of these same machines is done without taking advantage of the same automation capabilities. Instead of routinely used scripts taking care of the mundane activities, we depend on runbooks with screenshots to explain which button to click at any given step in the sequence.

The outro

A global outage thanks to security vendor failure was not an accident. It was an inevitability.

And we're going to see it happen again.

In wake of xz project compromise...

OSS project governance demands, distilled

Hi, I'm from Entitled Inc.

You know that unpaid labour of yours, which we benefit from? You should do more of it. Oh, and you should add all this red tape so that we don't have to do anything ourselves.

We're still not going to pay you.

We also demand that you sign our Modern Slavery statement.

Solution in search of permanence

Visions of future past

Year is 2041. Freshly elected Tory government, with nothing left in the country to sell off, find a solution to their prison overcrowding problem. State sanctioned organ harvesting becomes an overnight export success.

Medical facilities are caught by surprise, consistently outbid by the pet food industry.

Life of success

Three steps to a life of success

  1. Be born to the right parents
  2. Nepo
  3. Coast

Collected scribblings

Things written in times past

I used to write things for a previous employer's tech blog, but the old URLs may succumb to bitrot. These should work as long as Medium works:

The challenges of running a betting exchange (2016)

Notes on interviewing engineers (2016) -- This one was also picked up by a recruiter's blog.

DevOps is culture, not a title prefix (2017)

Security and Devops - a natural fit (2017)

Wait, what is my fleet doing (2018)

Hey, guess what? Your passwords have been compromised

Shields up on user information (2019)

Coronation Time

Dedicated to Cause

UK government's refusal to negotiate with the NHS staff can only be taken as dedication to monarchy. They want to make sure that King Charles's coronation will be a once-in-a-lifetime experience for as many people as possible.

Audits explained

Audits explained

A pentest is like going to the GP for a check-up. An audit is like having a month-long colonoscopy.

from copilot import vulnerabilities

Not your grandfather's MVC

This was perfectly predictable. CoPilot generates insecure code, as expected.

Machine Learning, the magic pixie dust of the past decade, is all about volume. And writing secure code is harder than writing insecure code. So by sheer volume there will be a lot more insecure code around.

Given that a lot of code in the wild is essentially a minimum viable copypaste from the highest scoring answer on StackOverflow, teaching the code generator model has obviously consumed a lot of insecure code. Since SO rewards speed, the answers that take the least time to write will receive most points.

Writing secure code takes more time and more space - so by the time someone submits an answer that considers security aspects, the person asking the question has already accepted (and ran with) the first and shortest working answer instead.

StackOverflow has redefined the MVC programming model. It now stands for Minimum Viable Copypaste.

Chilihillo

Lusikoitavaa chiliä

Arvioitu valmistusaika: 1,5h

chilit

Muokattu Guardianin reseptistä

Ainekset

  • Reilu 300g tuoreita chilejä
  • 1kg hienosokeria
  • Pektiiniä
  • 400ml siideriviinietikkaa

Muut tarvikkeet

  • Hillopurkkeja
  • 5l kattila
  • Soseutin

Esivalmistelut

Mittaa reilu 5g pektiiniä ja sekoita hienosokeriin.

Keitä purkit ja kannet erillisessä kattilassa. Ota syrjään ja pidä veden alla kunnes hillo on valmista.

sterilointi

Valmistus

Leikkaa chilien tyvet pois, halkaise ja poista siemenet sekä suurin osa valkoisesta lihasta.

ilman siemeniä

...

Perattuna chilejä pitäisi olla noin 250g. Ei ole kovin tarkkaa. (Alkuperäisen reseptin mukaan 200g riittää mutta nyt lisätään vähän potkua ja makua.)

noin 250g

Pilko chilit sopivan pieniksi ja soseuta.

chilinpaloja

soseutettu

Laita aineet kattilaan:

  • sokeri ja pektiini
  • soseutettu chili
  • etikka

kaikki sekaisin

Keitä ja sekoita noin 15m verran. Keitos saa kuohua ihan kunnolla.

keitos

Hillo on valmista kun se jähmettyy kylmälle lautaselle. Jos keittämistä jatkaa muutaman minuutin pitempään, tulos on gelatiinimaisempi.

Kaada hillo purkkeihin. Vie ensin joksikin aikaa huoneenlämpöön tai ulos jäähtymään, ja kun purkit eivät enää ole kuumia, siirrä jääkaappiin.

hillopurkit

lähikuva

Säilytys

Pysyy hyvänä jääkaapissa jopa puolikin vuotta. Ellei lopu ensin.

Swipe to SSH

SSH (Yubi)Key Authentication

SSH with private keys coming from secure hardware. What's not to like?

You've read the How-To. You've changed the pin with:

yubico-piv-tool -a change-pin

You're generating a resident key with:

ssh-keygen -t ed25519-sk -O resident -f ~/.ssh/yubi_ed255_key

After entering your pin, the above command breaks with "enrollment failed: invalid format". Surely you're doing something wrong? After a minute of head scratching, you try again, this time with:

ssh-keygen -vv -t ed25519-sk -O resident -f ~/.ssh/yubi_ed255_key

And are greeted with a confusing error. According to the error code (FIDO_ERR_PIN_NOT_SET) the resident key can not be generated because your YubiKey is not protected with a pin. But you've changed it already - what gives?

You've changed the PIN for the PIV application... which is different from the FIDO2 application.

Right idea, right PIN, wrong application?

Turns out you're missing the right tool. Get the correct one with:

${SUDO} apt install yubikey-manager

And then configure the FIDO2 application code with:

ykman fido set-pin

Now you can rerun the command from above and generate a private key directly with the YubiKey.

Things that work

The private key file, generated by the ssh-keygen command, can be nuked. It is after all a resident key, accessible directly from the YubiKey device. And you probably didn't add a keyphrase for it either.

So you can now load the private key into SSH agent, with:

ssh-add -K

You'll need to type in the PIN you set earlier.

... and a few that don't

The main problem with the above setup is that every use of the private key, even when loaded to the agent, requires to touch the magic button. To make things worse, the client doesn't show any hint what is needed, from a casual observer's point of view establishing the connection seems to hang.

This is okay for random logins, but breaks non-interactive workflows, and utterly messes up remote autocomplete. Touch the key one time too few, and the autocomplete never finishes. Touch it one time too many, and you've just vomited an OTP string to your terminal.

An ideal setup would allow the agent to authenticate without interaction for a configurable time, but so far this seems not to be supported.

Near-future experimentation

Documentation for ssh-keygen states that the resident key may be generated with the additional option -O no-touch-required to allow fully non-interactive use. At least at the time of writing, portable OpenSSH v8.4 does not appear to support the option, which may be for the best. Additionally, the public key requires special annotation for its entry in authorised_keys but even then it's not a good idea.

Because this option essentially would turn the YubiKey into a USB-attached SSH trust/identity dongle, it's far too dangerous to be used without other mitigations.

The missing hint

The bit about FIDO2 application for SSH client and the necessary command was found here.

Helpful two-liners

When changing the PIN/PUK codes, of course you want the new codes to be random. A really easy way to generate them is with python. Like this:

% python3

import secrets

secrets.randbelow(10**6) # for PIN

secrets.randbelow(10**8) # for PUK