Amazon Web Services encountered significant operational challenges in its US-EAST-1 region on October 28, 2025, with elevated latencies affecting EC2 instance launches and cascading issues across container orchestration services.
The disruption, which began earlier in the day, impacted multiple AWS offerings reliant on Elastic Container Service (ECS), highlighting ongoing vulnerabilities in the cloud giant’s densely interconnected infrastructure.
Customers reported delays and failures in launching virtual machines and tasks, underscoring the region’s critical role in global operations.
The incident originated in the use1-az2 Availability Zone around midday PDT, where EC2 instance launches faced prolonged delays due to internal networking and resource provisioning hiccups.
AWS quickly notified affected users via the Personal Health Dashboard, but the problem soon extended to ECS, causing elevated failure rates for task launches on both EC2-backed and Fargate serverless containers.
A subset of customers in US-EAST-1 experienced container instances disconnecting unexpectedly, leading to halted tasks and disrupted workflows.
Beyond core compute, the outage rippled into analytics and data processing tools like EMR Serverless, which relies on ECS warm pools for rapid job execution.
Jobs in EMR faced execution delays or outright failures as unhealthy clusters persisted in impacted cells. Other hit services included Elastic Kubernetes Service (EKS) for Fargate pod launches, AWS Glue for ETL operations, and Managed Workflows for Apache Airflow (MWAA), where environments stalled in unhealthy states.
App Runner, DataSync, CodeBuild, and AWS Batch also saw increased error rates, though existing EC2 instances remained operational.
ECS’s cellular architecture, which distributes clusters across regional cells, amplified the scope; clusters assigned to affected cells saw impacts across all availability zones.
AWS identified the root issues in a small number of these cells but withheld specifics on the underlying cause, reminiscent of prior dependency failures in the same region, according to the status page.
Recovery Timeline
AWS initiated throttles on mutating API calls in use1-az2 to stabilize the system, advising retries for “request limit exceeded” errors. By 3:36 PM PDT, EC2 launches normalized, but ECS recovery lagged, with no immediate customer-visible improvements.
Progress accelerated by 5:31 PM, as AWS refreshed EMR warm pools and observed Glue error rate reductions, estimating full resolution in 2-3 hours.
At 6:50 PM, ECS task launches showed positive signs, prompting recommendations for customers to recreate impacted clusters with new identifiers or update MWAA environments without config changes.
Throttles continued in three ECS cells, but the EMR Serverless warm pools were nearly finished. By 8:08 PM, EMR was fully refreshed, and ECS successes increased, with an estimated time of arrival (ETA) of 1 to 2 hours.
A significant recovery hit at 8:54 PM, and by 9:52 PM, two cells had fully recovered, lifting their throttles, while the third lagged.
The issue was entirely resolved at 10:43 PM PDT, restoring normal operations across all services. AWS confirmed no lingering impacts, though some backlogs might cause minor delays.
This episode, following a major US-EAST-1 outage on October 20, exposes persistent fragility from internal service interdependencies. While not as widespread as the earlier DynamoDB-triggered event, it disrupted workflows for developers and enterprises in the busiest AWS region.
Experts note that such incidents, though contained, erode trust in multi-region strategies without robust failover. AWS urged diversified cluster placements and proactive monitoring to mitigate future risks.
Follow us on Google News, LinkedIn, and X for daily cybersecurity updates. Contact us to feature your stories.




