Amazon Reveals Technical Fault Behind Widescale AWS Service Outage

Amazon Reveals Technical Fault Behind Widescale AWS Service Outage

Amazon Web Services experienced a major outage that affected millions of customers and Amazon’s own operations on October 19 and 20, 2025.

The company has now confirmed that a DNS resolution issue with regional DynamoDB service endpoints was the root cause of the disruption, which lasted approximately two hours and thirty-five minutes.

What Went Wrong with DNS

The outage began at 11:49 PM PDT on October 19 and continued until 2:24 AM PDT on October 20.

During this window, AWS services in the US-EAST-1 region experienced significantly increased error rates.

The problem wasn’t widespread infrastructure failure but rather a specific issue with how the system was resolving addresses for DynamoDB endpoints.

DynamoDB is Amazon’s high-performance database service that powers countless applications. When the DNS system couldn’t properly direct requests to these services, it created a cascade of problems throughout the AWS ecosystem.

Amazon.com itself went down during the incident, along with numerous Amazon subsidiary services and AWS customer support operations.

AWS engineers identified the DNS resolution problem at 12:26 AM PDT and immediately began mitigation efforts.

They successfully resolved the core DynamoDB DNS issue by 2:24 AM PDT, marking the first major milestone in recovery.

However, solving the primary problem didn’t instantly restore everything to normal. A small subset of internal subsystems remained impaired even after the DNS issue was fixed.

These lingering problems forced AWS to take a temporary but strategic step: they throttled certain operations, particularly new EC2 instance launches.

This means the system intentionally slowed down or delayed some requests rather than letting them fail completely.

While this sounds counterintuitive, it actually helped the system recover more smoothly by preventing it from becoming overwhelmed.

By 12:28 PM PDT, significant recovery progress was visible across AWS services and customer systems.

AWS continued gradually reducing the throttling on EC2 instance launch operations throughout the afternoon.

The company’s technical teams worked methodically to address remaining impact areas while monitoring system health continuously.

By 3:01 PM PDT on October 20, AWS announced that all services had returned to normal operations.

The entire recovery process, from initial detection to complete restoration, took approximately 15 hours.

While the outage lasted only about two and a half hours, the aftereffects and recovery operations extended much longer.

AWS has published a detailed post-event summary explaining exactly what happened, how their teams responded, and what changes they’re implementing to prevent similar incidents.

Amazon advises customers experiencing any lingering issues to check the AWS Health Dashboard for real-time status updates and additional information about any services that may still be experiencing difficulties.

Follow us on Google News, LinkedIn, and X to Get Instant Updates and Set GBH as a Preferred Source in Google.



Source link

About Cybernoz

Security researcher and threat analyst with expertise in malware analysis and incident response.