Cloudflare Reveals Full Technical Explanation Of Major Internet Outage

Cloudflare has released a comprehensive post-mortem analysis of a significant network outage that disrupted internet services globally on November 18, 2025.

The incident, which began at 11:20 UTC and lasted several hours, affected millions of websites and applications relying on Cloudflare’s content delivery network and security services.

Database Permission Change Triggers Cascade Failure

A cyberattack or malicious activity did not cause the outage. Instead, it originated from a seemingly routine database permission update in Cloudflare’s ClickHouse database system.

This change altered how database queries returned metadata, causing the configuration file used by Cloudflare’s Bot Management system to double in size due to duplicate entries.

Cloudflare outage

The feature file, which usually contains around 60 machine learning features for bot detection, ballooned beyond 200 entries due to duplicate data from underlying database tables.

This exceeded hard-coded memory limits in Cloudflare’s proxy software, causing critical systems to crash when attempting to load the oversized file.

What made the diagnosis particularly challenging was the intermittent nature of the failures.

The problematic configuration file was regenerated every five minutes, but only produced insufficient data when queries hit updated database nodes.

This created a pattern where services would fail, recover briefly, then fail again as new files propagated across the network.

The erratic behavior initially led Cloudflare engineers to suspect a massive distributed denial-of-service attack, particularly after their external status page went offline at the same time.

Internal communications referenced recent high-volume Aisuru DDoS attacks, causing the team to investigate attack scenarios before identifying the actual configuration issue.

Cloudflare’s incident response began at 11:32 UTC when automated tests detected problems, though the full scope wasn’t immediately apparent.

Teams initially focused on the Workers KV service degradation and attempted various mitigations, including traffic manipulation and account limiting.

Wide-Ranging Service Impact

The outage affected numerous Cloudflare services. Core CDN and security services returned HTTP 5xx errors to end users. Turnstile authentication failed, preventing dashboard logins. Workers KV experienced elevated error rates.

Access authentication failed for most users, though existing sessions remained functional. Email security lost access to reputation sources, temporarily reducing spam detection accuracy.

Both Cloudflare’s legacy proxy system and newer FL2 proxy engine were impacted, though differently.

FL2 customers encountered outright errors, while legacy system users received incorrect bot scores of 0, potentially leading to false positives in bot-blocking rules.

Engineers identified the root cause at 13:37 UTC and stopped generation of new configuration files at 14:24 UTC.

They manually deployed a known-good version of the feature file and forced proxy restarts.

Core traffic resumed normal flow by 14:30 UTC, though complete service restoration took until 17:06 UTC as teams restarted affected systems and cleared backlogs.

A temporary workaround at 13:05 UTC allowed Workers KV and Access to bypass the failing proxy layer, reducing impact to dependent services before the complete fix was deployed.

Cloudflare acknowledged this as their worst outage since 2019 and committed to multiple remediation efforts.

The company plans to harden configuration file ingestion with validation checks, enable more global kill switches for features, prevent error reporting from overwhelming system resources, and review failure modes across all proxy modules.

The incident highlights how seemingly minor infrastructure changes can cascade into major failures when proper validation and size limits aren’t thoroughly tested across interconnected systems.

Follow us on Google News, LinkedIn, and X to Get Instant Updates and set GBH as a Preferred Source in Google.

Source link

Search

Database Permission Change Triggers Cascade Failure

Wide-Ranging Service Impact

Latest Posts