Post Mortem: When Amazon's Cloud Turned On Itself - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Cloud // Infrastructure as a Service
Commentary
4/29/2011
05:29 PM
Charles Babcock
Charles Babcock
Commentary
Connect Directly
Twitter
RSS
E-Mail
50%
50%

Post Mortem: When Amazon's Cloud Turned On Itself

For the cloud to be a permanent platform for enterprise computing, it can't be an environment where both computing and errors just occur on a larger scale.

In building high availability into cloud software, we've escaped the confines of hardware failures that brought running systems to a halt. In the cloud, the hardware may fail and everything else keeps running. On the other hand, we've discovered that we've entered a higher atmosphere of operations and larger plane on which potential failures may occur.

The new architecture works great when only one disk or server fails, a predictable event when running tens of the thousands of devices. But the solution itself doesn't work if it thinks hundreds of servers or thousands of disks have failed all at once, taking valuable data with them. That's an unanticipated event in cloud architecture because it isn't supposed to happen. Nor did it happen last week. But the governing cloud software thought it had, and triggered a massive recovery effort. That effort in turn froze EBS and Relational Database Service in place. Server instances continued running in U .S. East-1, but they couldn't access anything, more servers couldn't be initiated and the cloud ceased functioning in one of its availability zones for all practical purposes for over 12 hours.

The accounts that I have paid the most attention to in the aftermath have been those whose operations didn't fail, despite the Amazon architecture's breakdown. Accounts like the one from Donnie Flood, VP of engineering at Bizo, or Oren Michels, CEO of the Mashery. In talking to Jesse Lipson, CEO of ShareFile, an original EC2 beta customer in 2008 and still a customer, he said, "We're pretty paranoid about betting on any company, even if it's Amazon," and his firm invoked the option of redirecting its traffic to Amazon's West Coast data center when it found its servers failing. ShareFile, which supplies a file sharing and storage service to business, maintains its own "heartbeat" monitoring system for its servers, and the system detected ShareFile servers disappearing after the "network event" in EC2. The system automatically shifted ShareFile traffic toward those that were in the West Coast data center.

I think Amazon itself should have a traffic shifting system that reroutes the bulk of customer traffic when an availability zone or whole data center is no longer available. It should shift it, as individual customers did, from East to West, degrading service no doubt, but keeping customers online. Lipson points out, however, that linking data centers might allow the harm to spread. Inside the Northern Virginia data center, availability zones--which are subdivisions of the data center operating independently--the trouble spread like a contagion. Backup measures that worked in individual cases or across a small set cascaded out of control when invoked on a scale that had previously been unanticipated.

Despite that risk, I still think Amazon must link data centers, but it must also include a circuit breaker that queues up traffic or shunts it away if it turns into a threat to the functioning facility. Within a data center, availability zones need to be, well, available, even if there is trouble in one of them. I think that means architecting services so that they operate in some isolation in one zone from troubles in another. In the aftermath, the EBS and RDS services operated across availability zones, and freezing them in one froze them in all.

All of this is much easier said than done when operating on the scale and complexity of Amazon's EC2. Amazon has done such a good job of pioneering the cloud that there is an immense reservoir of faith among its customers that it will eventually get it right. No one I've talked to says they're willing to switch. Cloud computing may have had a setback, but it will make a quick comeback. There is a widespread belief that when it does, it will be better. Still, it remains to be said: Amazon has got to do better than this. It has got to get it right.

Charles Babcock is an editor-at-large for InformationWeek.


We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Previous
2 of 2
Next
Comment  | 
Print  | 
More Insights
Commentary
2021 Outlook: Tackling Cloud Transformation Choices
Joao-Pierre S. Ruth, Senior Writer,  1/4/2021
News
Enterprise IT Leaders Face Two Paths to AI
Jessica Davis, Senior Editor, Enterprise Apps,  12/23/2020
Slideshows
10 IT Trends to Watch for in 2021
Cynthia Harvey, Freelance Journalist, InformationWeek,  12/22/2020
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
2021 Top Enterprise IT Trends
We've identified the key trends that are poised to impact the IT landscape in 2021. Find out why they're important and how they will affect you.
Slideshows
Flash Poll