AWS Resolves Major Outage After Nearly 24 Hours of Service Disruption

AWS Resolves Major Outage After Nearly 24 Hours of Service Disruption

Amazon Web Services experienced a significant service disruption in its US-EAST-1 region that lasted nearly 24 hours, affecting over 140 services and causing widespread issues for customers worldwide.

The outage began late on October 19, 2025, and was fully resolved by the afternoon of October 20.

Root Cause Identified as DNS Resolution Issue

The incident started at approximately 11:49 PM PDT on October 19, when AWS engineers detected increased error rates and latencies across multiple services in the critical US-EAST-1 region.

At 12:26 AM on October 20, AWS identified the trigger as DNS resolution issues affecting regional DynamoDB service endpoints. This initial problem created a cascading failure that impacted numerous other services.

After resolving the DynamoDB DNS issue at 2:24 AM, AWS faced a subsequent impairment in EC2’s internal subsystem responsible for launching new instances due to its dependency on DynamoDB.

The situation escalated further when Network Load Balancer health checks became impaired, resulting in network connectivity problems across services including Lambda, DynamoDB, and CloudWatch.

To manage the recovery process, AWS temporarily throttled several operations including EC2 instance launches, SQS queue processing via Lambda Event Source Mappings, and asynchronous Lambda invocations.

Engineers worked through the morning to restore Network Load Balancer health checks, achieving this milestone at 9:38 AM PDT.

Throughout the day, AWS gradually reduced operation throttling while addressing network connectivity issues.

By 3:01 PM PDT on October 20, all AWS services returned to normal operations. However, some services, including AWS Config, Redshift, and Connect, continued processing backlogs of messages for several hours after the primary resolution.

The outage particularly impacted global services and features that rely on US-EAST-1 endpoints, including IAM authentication and DynamoDB Global Tables.

Customers experienced EC2 instance launch failures, Lambda function invocation errors, and difficulties accessing storage and database services.

The disruption also prevented customers from creating or updating support cases during the peak of the incident.

AWS has committed to sharing a detailed post-event summary to provide customers with a comprehensive understanding of what occurred and the measures being implemented to prevent similar incidents.

The company recommends that customers configure Auto Scaling Groups across multiple Availability Zones and avoid targeting specific zones during instance launches to improve resilience against regional issues.

Follow us on Google News, LinkedIn, and X to Get Instant Updates and Set GBH as a Preferred Source in Google.



Source link

About Cybernoz

Security researcher and threat analyst with expertise in malware analysis and incident response.