The long tail of AWS outages
The vast Amazon region The cloud outage of web services that began early Monday morning highlighted the fragile interconnectedness of the Internet as major telecommunications, financial, healthcare, education and government platforms around the world suffered disruptions. As the day went on, AWS diagnosed the issue and began working to correct it, which originated from the company’s important US-EAST-1 region based in Northern Virginia. But the chain of effects took time to fully resolve.
Researchers pondering the incident particularly highlighted the length of the outage, which began around 3 a.m. ET on Monday, October 20. AWS said in status updates that as of 6:01 PM ET on Monday “all AWS services had returned to normal operations.” The outage arose directly from Amazon’s DynamoDB APIs and, according to the company, “affected” 141 other AWS services. Several network engineers and infrastructure specialists told WIRED that errors are understandable and inevitable for so-called “hyperscalers” like AWS, Microsoft Azure and Google Cloud Platform, given their complexity and sheer scale. But they also point out that this reality should not simply excuse cloud providers from being out of business for an extended period.
“The word Too late is the key. “It’s easy to know what went wrong after an incident, but AWS’s overall reliability shows how difficult it is to prevent every failure,” says Ira Winkler, chief information security officer at reliability and cybersecurity firm CYE. “Ideally, this will be a lesson learned, and Amazon will implement more redundancies that will prevent a disaster like this from happening in the future — or at least prevent it.” of remaining in a state of malfunction for the duration of its occurrence.”
AWS did not respond to WIRED’s questions about the length of time for customer refunds. An AWS spokesperson says the company plans to publish one of its “post-event summaries” about the incident.
“I don’t think this was just an outage. I would have expected a full fix much faster,” says Jake Williams, vice president of research and development at Hunter Strategy. “Giving them their due, cascading failures are not something they get a lot of experience working with because they don’t experience power outages very often. So that’s to their credit. But it’s really easy to get into the mindset of giving these companies a pass, and we shouldn’t forget that they are creating this situation by actively trying to attract more customers to their infrastructure. Customers have no control over whether They overextend themselves or what might happen financially.”
The cause of the incident was a common cause of web outages – Domain Name System resolution issues. DNS is basically the Internet’s telephone directory mechanism for directing web browsers to the correct servers. As a result, DNS issues are a common source of outages, because they can cause requests to fail and prevent content from loading.