Technology & Innovation

Amazon apologizes to customers affected by the massive AWS outage


Amazon Web Services (AWS) has apologized to customers affected by Monday’s massive outage, after it knocked some of the world’s largest platforms offline.

It was Snapchat, Reddit, and Lloyds Bank Among the more than 1,000 websites and services reported to be down As a result of problems at the heart of the cloud computing giant’s operations in Northern Virginia in the United States on October 20.

In a detailed summary of the reason for the outage, Amazon said it occurred as a result of errors that meant its internal systems were unable to link websites to the IP addresses that computers use to find them.

“We apologize for the impact this event has caused to our customers,” the company said.

“We know how important our services are to our customers, their applications, their end users and their businesses.

“We know this event impacted many customers in significant ways.”

While many platforms like online games Roblox and Fortnite were back up and running within a few hours of an outage, some services experienced extended downtime.

This included Lloyds Bank, where some customers experienced issues as of mid-afternoon, as well as US payments app Venmo and social media site Reddit.

The outage had a far-reaching impact, even reportedly disrupting the sleep of some smart bed owners.

Eight Sleep, which makes sleep “pods” with temperature and altitude options that require an internet connection, said it will “prevent interruptions” of its mattresses. After some overheating and even got stuck in tilt mode.

Many experts said that the outage showed the extent to which the technology depends on Amazon’s dominance in the cloud computing sector, as a market largely confined to AWS and Microsoft Azure.

The company said it would also “do everything we can” to learn from the event and improve availability.

In its lengthy summary of Monday’s outageAmazon said it was an issue with US-EAST-1 — its largest group of data centers that powers much of the Internet.

Critical processes in the zone database that store and manage Domain Name System (DNS) records, allowing computers to understand website URLs, have effectively become out of sync.

According to Amazon, this triggered a “latent race condition” — or in other words, the discovery of a dormant error that could occur in an unexpected sequence of events.

The delay in one operation, which Amazon said occurred in the early hours of Monday morning, had the adverse effect of causing its systems to stop working properly.

Most of this process is automated, meaning it takes place without human intervention.

Dr Junad Ali, a software engineer and fellow at the Institute of Engineering and Technology, told the BBC that “faulty automation” was at the core of Amazon’s problems.

“The specific technical reason was an automation glitch that disabled the internal ‘address book’ systems that that region relies on,” he said.

“So they couldn’t find one of the other major systems.”

Like others, Dr Ali believes this highlights the need for businesses to be more agile and diversify their cloud providers “so they can turn to data centers and other providers when one is not available”.

“In this case, those who had a single point of failure in this Amazon were vulnerable to being disconnected from the Internet,” he said.

Leave a Reply

Your email address will not be published. Required fields are marked *