Cloudflare API Outage Linked To React UseEffect Bug Causes Service Overload And Recovery Failure

Cloudflare has published a detailed post-mortem explaining the significant outage on September 12, 2025, that made its dashboard and APIs unavailable for over an hour.

The company traced the incident to a software bug in its dashboard that, combined with a service update, created a cascade failure in a critical internal system.

The incident began with the release of a new version of the Cloudflare Dashboard. According to the company’s report, this update contained a bug in its React code that caused it to make repeated, excessive calls to the internal Tenant Service API. This service is a core component responsible for handling API request authorization.

Google News

The bug was located in a useEffect hook, which was mistakenly configured to trigger the API call on every state change, leading to a loop of requests during a single dashboard render. This behavior coincided with the deployment of an update to the Tenant Service API itself.

The resulting “thundering herd” of requests from the buggy dashboard overwhelmed the newly deployed service, causing it to fail and recover improperly.

Because the Tenant Service is required to authorize API requests, its failure led to a widespread outage of the Cloudflare Dashboard and many of its APIs, starting at 17:57 UTC.

Incident Response and Recovery

Cloudflare’s engineering teams first noticed the increased load on the Tenant Service and responded by trying to reduce the pressure and add resources.

They implemented a temporary global rate-limiting rule and increased the number of Kubernetes pods available to the service to improve throughput. While these actions helped restore partial API availability, the dashboard remained down.

A subsequent attempt to patch the service to fix erroring codepaths at 18:58 UTC proved counterproductive, causing a second brief impact on API availability. This change was quickly reverted, and full service was restored by 19:12 UTC.

Importantly, Cloudflare noted that the outage was limited to its control plane, which handles configuration and management. The data plane, which processes customer traffic, was unaffected due to strict separation, meaning end-user services remained online.

Following the incident, Cloudflare has outlined several measures to prevent a recurrence. The company plans to prioritize migrating the Tenant Service to Argo Rollouts, a deployment tool that automatically rolls back a release if it detects errors.

To mitigate the “thundering herd” issue, the dashboard is being updated to include randomized delays in its API retry logic. The Tenant Service itself has been allocated substantially more resources, and its capacity monitoring will be improved to provide proactive alerts.

Find this Story Interesting! Follow us on Google News, LinkedIn, and X to Get More Instant Updates.

Source link

Search

Incident Response and Recovery

Latest Posts