Cloudflare Confirms API Outage Caused By React UseEffect Overload Issue

Cloudflare experienced a significant outage on September 12, 2025, affecting its Tenant Service API, multiple APIs, and the Cloudflare Dashboard.

The company has confirmed that the incident was primarily triggered by a React programming bug that caused excessive API calls, overwhelming critical infrastructure components.

Technical Root Cause Identified

The outage originated from a coding error in Cloudflare’s dashboard involving a React useEffect hook. Engineers mistakenly included a problematic object in the hook’s dependency array, causing React to treat the object as “always new” during state or prop changes.

This resulted in the useEffect hook executing repeatedly during single dashboard renders instead of running just once as intended.

The bug coincided with a service update to the Tenant Service API, creating a perfect storm that overwhelmed the service and prevented proper recovery.

Cloudflare dashboard was severely impacted throughout the full duration of the incident

Each dashboard interaction triggered multiple unnecessary API calls, exponentially increasing the load on backend systems beyond their capacity limits.

When the Tenant Service became overloaded, the effects rippled throughout Cloudflare’s infrastructure because the service forms a critical part of API request authorization logic.

Without functional Tenant Service operations, the system could not evaluate authorization requests properly, causing API calls to return 5xx status codes across multiple services.

The outage timeline shows the incident began at 17:57 UTC when the Tenant API Service became overwhelmed during new version deployments.

Dashboard availability dropped significantly, though API availability briefly recovered to 98% after additional resources were allocated at 18:17 UTC.

Cloudflare’s incident response team initially focused on reducing load and increasing available resources for the Tenant Service.

They implemented a global rate limit and increased the number of Kubernetes pods running the GoLang-based service. However, these measures proved insufficient for complete service restoration.

A critical mistake occurred at 18:58 UTC when engineers attempted to remove erroring code paths and released a new Tenant Service version.

This change worsened the situation, causing increased API impact until the problematic changes were reverted at 19:12 UTC, finally restoring dashboard availability to 100%.

Cloudflare has identified several improvement areas to prevent similar incidents. The company is prioritizing migration to Argo Rollouts for automatic deployment monitoring and rollback capabilities, which would have limited the second outage’s duration.

Additional measures include implementing random delays in dashboard retries to prevent thundering herd scenarios when services recover, substantially increasing Tenant Service capacity allocation, and enhancing monitoring systems for proactive alerting before capacity limits are reached.

The company is also improving API call visibility by adding metadata to distinguish between retry requests and new requests, enabling faster identification of similar loop-based issues in the future.

Find this Story Interesting! Follow us on LinkedIn and X to Get More Instant Updates.

Source link