Post Mortem: When Amazon's Cloud Turned On Itself - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Cloud // Infrastructure as a Service
05:29 PM
Charles Babcock
Charles Babcock
Connect Directly

Post Mortem: When Amazon's Cloud Turned On Itself

For the cloud to be a permanent platform for enterprise computing, it can't be an environment where both computing and errors just occur on a larger scale.

The snafus in the cloud, it turns out, aren't so different from those occurring in the overworked, under-automated and undocumented processes of the average data center. According to Amazon's post mortem explanation of its recent hours-long outage, the failure was apparently triggered by a human error.

If so, processes susceptible to human error are not going to be good enough in the future, if the cloud is going to be a permanent platform for enterprise computing.

The cause of Amazon's recent outage, which would have been more of a disaster than it was but for the low Easter holiday traffic, was the result of a configuration error in a scheduled network update. The change was attempted in the middle of the night--at 3:47 a.m. in Northern Virginia-"as part of our normal scaling activity," according to the official explanation. That sounds like the EC2 data center was anticipating the start of early morning activity, where big customers such as Bizo or Reddit start refreshing hundreds of websites in preparation to meet the day's earliest readers.

The primary network serving one of the four availability zones in EC2's U.S. East-1 data center needed more network capacity. The attempt to provide it mistakenly shifted the traffic off a primary network onto a secondary and lower bandwidth network used for backup purposes. This is a change that has been probably correctly implemented thousands of times. It's the kind of error an operator could makes as a wrong choice on a menu or the entry of the name of the last network worked on instead of the one needed. In short, it was a human error that's all too likely to occur with anyone momentarily preoccupied with the price of mangoes or a flare up with a spouse.

However, I thought the Amazon Web Services cloud used more automated procedures than that. I thought clearly obvious errors had been anticipated and worked through, with defenses in place. Two lines of logic, checking the operator's decision, would have halted him in his tracks. A simple network configuration error should not be the source of a monumental hit to confidence in cloud computing. But apparently it is.

What happened next is not so different from what we speculated in Cloud Takes A Hit; Amazon Must Fix EC2 a week ago, based on the cryptic postings on the Services Health Dashboard. Eight minutes after the change marked the start of what the Amazon Service Health Watch dashboard described as "a networking event." The misconfiguration choked the backup network, which caused "a large number of EBS nodes in a single EBS cluster lost connection to their replicas."

An EBS cluster is servers and disk serving as short-term storage for running workloads in a given availability zone. The preceding description doesn't sound like much of an event, but in the cloud, it triggers a massive response. Suddenly large sets of data no longer knew whether their backup copy still existed on the cluster, and a central tenet of the cluster's operation is that a backup copy is always available--in case of a hardware failure.

The networking error in itself was relatively minor and easily rectified. But the error set up a massive "re-mirroring storm," a new and valuable addition to computing lexicon's already long list of disaster terms. So many Elastic Block Store volumes were trying to find disk space on which to recreate themselves that when they failed to find it, they aggressively tried again, tying up disk operations in a zone. You get the picture.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
1 of 2
Comment  | 
Print  | 
More Insights
2021 Outlook: Tackling Cloud Transformation Choices
Joao-Pierre S. Ruth, Senior Writer,  1/4/2021
Enterprise IT Leaders Face Two Paths to AI
Jessica Davis, Senior Editor, Enterprise Apps,  12/23/2020
10 IT Trends to Watch for in 2021
Cynthia Harvey, Freelance Journalist, InformationWeek,  12/22/2020
White Papers
Register for InformationWeek Newsletters
Current Issue
2021 Top Enterprise IT Trends
We've identified the key trends that are poised to impact the IT landscape in 2021. Find out why they're important and how they will affect you.
Flash Poll