AWS apologises for 14-hour outage and sets out causes of US datacentre region downtime

AWS apologises for 14-hour outage and sets out causes of US datacentre region downtime

Amazon Web Services (AWS) has issued an apology to its customers inconvenienced by its largest US datacentre region suffering a 14-hour outage on 20 October, in a blog detailing the precise nature of the technical difficulties its services suffered.

As previously reported by Computer Weekly, the outage originated in the public cloud giant’s US-East-1 datacentre region in North Virginia, and caused large-scale disruption to a host of companies across the world, including in the UK.

Social media and communications services such as Snapchat and Signal suffered disruption to their services, as did Amazon-owned internet entities such as its retail site, Ring doorbell and Alexa services.

Financial services provider Lloyds Bank Group, along with its Halifax and Royal Bank of Scotland subsidiaries, and the government tax collection agency HM Revenue and Customs, were also affected in the UK by the outage.

As a result, HM Treasury is now facing calls to give an account as to why – given its role as a major supplier of cloud services to the UK financial services sector – AWS has not been called into scope of its Critical Third Parties (CTP) regime before now.

The initiative gives HM Treasury powers to designate suppliers to the financial services sector as being CTP, meaning their activities can be brought into the supervisory scope of the UK’s various financial regulators.

The intention being that doing so might help better manage any potential risks to the stability and resilience of the UK financial system that might arise as a result of a third-party supplier suffering from service disruption, as happened with AWS this week.

The company has now published an extensive post-event summary document, which confirms the outage occurred in three distinct phases as a result of issues occurring within several parts of its infrastructure.

As such, the company said that just before 8am UK time on 20 October, its fully managed, serverless, NoSQL database offering Amazon DynamoDB began to experience increased application programming interface (API) error rates, which lasted for just under three hours.

Then, from around 1pm UK time on 20 October, some of the network load balancers (NLB) within its US-East-1 region started to experience increased connection errors, which persisted until around 10pm the same day. “This was caused by health check failures in the NLB fleet, which resulted in increased connection errors,” the summary document stated.

In addition to this, AWS said issues occurred when attempts were made to launch instances of its Elastic Cloud Compute (EC2) virtual servers, which is an issue that persisted from around 10.30am on 20 October UK time until 6.30pm.   

“New EC2 instance launches failed and, while instance launches began to succeed from 10:37 AM PDT [6.37pm UK time], some newly launched instances experienced connectivity issues which were resolved by 1:50 PM [9.50pm UK time],” the summary document continued.

It also confirmed that other AWS services hosted within US-East-1 suffered knock-on effects as a result of the issues experienced by DynamoDB, EC2 and its network loan balancing setup.

“We are making several changes as a result of this operational event,” the company said. “As we continue to work through the details of this event across all AWS services, we will look for additional ways to avoid impact from a similar event in the future, and how to further reduce time to recovery.”

The company then concluded the summary document with an apology to any customers affected by the outage.

“While we have a strong track record of operating our services with the highest levels of availability, we know how critical our services are to our customers, their applications and end users, and their businesses,” said the summary document. “We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.”



Source link

About Cybernoz

Security researcher and threat analyst with expertise in malware analysis and incident response.