Google links massive cloud outage to API management issue

Google links massive cloud outage to API management issue

Google says an API management issue is behind Thursday’s massive Google Cloud outage, which disrupted or brought down its services and many other online platforms.

Google says the cloud outage started around 10:49 ET and ended at 3:49 ET, after causing issues for millions of users worldwide for over three hours.

Besides Google Cloud, the incident also impacted Gmail, Google Calendar, Google Chat, Google Cloud Search, Google Docs, Google Drive, Google Meet, Google Tasks, Google Voice, Google Lens, Discover, and Voice Search.

However, it also caused widespread issues for third-party platforms that rely on Google Cloud, including but not limited to Spotify, Discord, Snapchat, NPM, Firebase Studio, and a limited number of Cloudflare services relying on the Workers KV key-value store.

“We are deeply sorry for the impact to all of our users and their customers that this service disruption/outage caused. Businesses large and small trust Google Cloud with your workloads and we will do better,” Google said.

While it’s still working on publishing a full incident report, Google revealed today the root cause of what caused an increased number of 503 errors in external API requests during yesterday’s three-hour-long outage.

As the company explained today, its Google Cloud API management platform failed due to invalid data, an issue that wasn’t discovered and remediated promptly because it lacked effective testing and error-handling systems.

“From our initial analysis, the issue occurred due to an invalid automated quota update to our API management system which was distributed globally, causing external API requests to be rejected. To recover we bypassed the offending quota check, which allowed recovery in most regions within 2 hours,” the company added.

“However, the quota policy database in us-central1 became overloaded, resulting in much longer recovery in that region. Several products had moderate residual impact (e.g. backlogs) for up to an hour after the primary issue was mitigated and a small number recovering after that.”

Cloudflare services taken down by Google’s outage

After successfully restoring its own impacted services, Cloudflare also revealed in a post-mortem that yesterday’s incident was not caused by a security incident and that no data was lost.

Workers KV error rate during outage
Cloudflare Workers KV error rate during outage (Cloudflare)

“The cause of this outage was due to a failure in the underlying storage infrastructure used by our Workers KV service, which is a critical dependency for many Cloudflare products and relied upon for configuration, authentication, and asset delivery across the affected services,” Cloudflare said.

“Part of this infrastructure is backed by a third-party cloud provider, which experienced an outage today and directly impacted the availability of our KV service.”

Even though it didn’t share the name of the cloud provider behind the Thursday outage, a Cloudflare spokesperson told BleepingComputer yesterday that only Cloudflare services relying on Google Cloud were affected.

In response to this incident, Cloudflare says it will migrate KV’s central store to its own R2 object storage to reduce external dependency and prevent similar issues in the future.

Tines Needle

Patching used to mean complex scripts, long hours, and endless fire drills. Not anymore.

In this new guide, Tines breaks down how modern IT orgs are leveling up with automation. Patch faster, reduce overhead, and focus on strategic work — no complex scripts required.



Source link