CrowdStrike Explains Friday Incident Crashing Millions of Windows Devices


Jul 24, 2024NewsroomSoftware Update / IT Outage

Cybersecurity firm CrowdStrike on Wednesday blamed an issue in its validation system for causing millions of Windows devices to crash as part of a widespread outage late last week.

“On Friday, July 19, 2024 at 04:09 UTC, as part of regular operations, CrowdStrike released a content configuration update for the Windows sensor to gather telemetry on possible novel threat techniques,” the company said in its Preliminary Post Incident Review (PIR).

“These updates are a regular part of the dynamic protection mechanisms of the Falcon platform. The problematic Rapid Response Content configuration update resulted in a Windows system crash.”

The incident impacted Windows hosts running sensor version 7.11 and above that was online between July 19, 2024, 04:09 UTC and 05:27 UTC and received the update. Apple macOS and Linux systems were not affected.

CrowdStrike said it delivers security content configuration updates in two ways, one via Sensor Content that’s shipped with Falcon Sensor and another through Rapid Response Content that allows it to flag novel threats using various behavioral pattern-matching techniques.

Cybersecurity

The crash is said to have been the result of a Rapid Response Content update containing a previously undetected error. It’s worth noting that such updates are delivered in the form of Template Instances corresponding to specific behaviors – that are mapped to specific Template Types – for enabling new telemetry and detection.

The Template Instances, in turn, are created using a Content Configuration System, after which they are deployed to the sensor over the cloud through a mechanism dubbed Channel Files, which are ultimately written to disk on the Windows machine. The system also encompasses a Content Validator component that carries out validation checks on the content before it is published.

“Rapid Response Content provides visibility and detections on the sensor without requiring sensor code changes,” it explained.

“This capability is used by threat detection engineers to gather telemetry, identify indicators of adversary behavior and perform detections and preventions. Rapid Response Content is behavioral heuristics, separate and distinct from CrowdStrike’s on-sensor AI prevention and detection capabilities.”

These updates are then parsed by the Falcon sensor’s Content Interpreter, which then facilitates the Sensor Detection Engine to detect or prevent malicious activity.

While each new Template Type is stress tested for different parameters like resource utilization and performance impact, the root cause of the problem, per CrowdStrike, could be traced back to the rollout of the Interprocess Communication (IPC) Template Type on February 28, 2024, that was introduced to flag attacks that named pipes.

The timeline of events is as follows –

  • February 28, 2024 – CrowdStrike releases sensor 7.11 to customers with new IPC Template Type
  • March 5, 2024 – The IPC Template Type passes the stress test and is validated for use
  • March 5, 2024 – The IPC Template Instance is released to production via Channel File 291
  • April 8 – 24, 2024 – Three more IPC Template Instances are deployed in production
  • July 19, 2024 – Two additional IPC Template Instances are deployed, one of which passes validation despite having problematic content data

“Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production,” CrowdStrike said.

“When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSoD).”

In response to the sweeping disruptions caused by the crash and preventing them from happening again, the Texas-based company said it has improved its testing processes and enhanced its error handling mechanism in the Content Interpreter. It’s also planning to implement a staggered deployment strategy for Rapid Response Content.

Found this article interesting? Follow us on Twitter and LinkedIn to read more exclusive content we post.





Source link