CrowdStrike will add layers of extra testing around future file updates, stage global deployments and give customers more control, after a buggy update bricked millions of Windows machines.
The endpoint detection and response (EDR) vendor published a preliminary post-incident report into last Friday’s sensor configuration update to Windows systems.
The vendor said the bad update mistakenly passed a validation check that was inherently trusted, given it had run similar checks successfully on four previous occasions this year.
“Due to a bug in the content validator, one of two [file updates] passed validation despite containing problematic content data,” CrowdStrike said.
The company said it routinely ships two types of security content configuration files for its Falcon EDR customers: one with new sensor releases, the other “at operational speed”.
Sensor releases are exposed to “extensive QA” and testing, and “customers then have the option of selecting which parts of their fleet should install the latest sensor release (‘N’), or one version older (‘N-1’) or two versions older (‘N-2’),” the vendor said.
The configuration file update that caused Windows machines to crash was of the kind released at “operational speed” – a capability “used by threat detection engineers to gather telemetry, identify indicators of adversary behaviour and perform detections and preventions.”
This particular update type is “created and configured” through the company’s cloud-based Falcon platform. A “content validator that performs validation checks on the content before it is published” is part of that system.
However, aside from this validation step, the process lacks some of the more rigorous quality assurance and testing that occurs for content in the sensor release process.
It also lacks the customer control, explaining why the file was instantly applied to Windows endpoints running CrowdStrike.
CrowdStrike said it will fix this with more testing of these file types in future, including local developer testing, stress testing and stability testing.
It will also “implement a staggered deployment strategy … in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment”; and “provide customers with greater control over the delivery of … updates by allowing granular selection of when and where these updates are deployed.”
CrowdStrike also committed to releasing a full root cause analysis once its investigations are complete.