Cloud Takes A Hit: Amazon Must Fix EC2 - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Cloud // Infrastructure as a Service
03:32 PM
Charles Babcock
Charles Babcock
Connect Directly

Cloud Takes A Hit: Amazon Must Fix EC2

Amazon's "availability zones" were a key protective concept for the cloud, but they failed to protect access to data when EC2 went down.

It seems to me the outage of Amazon’s cloud computing service yesterday was a signal event. IT advocates of cloud computing face severe internal skepticism that the cloud is a reliable, distributed environment. In the past, they’ve responded that skilled service providers, such as Amazon, architect against failure with availability zones, independently running sections in one data center. If you run your application in one and keep a mirror image in another, you’re protected. Some enterprises found out yesterday the architecture doesn’t work. Their critics had a field day.

Amazon’s outage in Northern Virginia yesterday impeded customer access to data beyond one availability zone in that center. Amazon has a West Coast data center as well as one in Northern Virginia, but something that wasn’t clear before became clear yesterday. Amazon zones don’t extend to a different data centers in different geographic locations. This fact is reverberating today among users of cloud computing. The different availability zones are supposed to keep services running, even if part of the data center fails. They didn‘t function as advertised.

Amazon Web Services has been posting its usual terse explanations to its Service Watch Dashboard, but for the anxious IT manager they don't say much. They don't say, for example, when the cause of the trouble can be expected to be alleviated. Service troubles started at 5 minutes before 1 a.m. Pacific time on Thursday. At 11:09 a.m., the dashboard acknowledged many customers were asking when service would be back: "We deeply understand why this is important and promise to share this information as soon as we have an estimate that we believe is close to accurate." Their best guess: "in a few hours."

Let's be clear on what did and did not happen. Amazon's EC2 infrastructure as a service, the compute servers, stayed up and running in Northern Virginia, but some of them lost the ability to access data, launch a customer's stored instances, and save results of running instances. That means those customer servers or “instances” that were running time sensitive applications or customer facing apps were rendered useless.

On the other hand, some customers may not have been affected at all. CloudSleuth, an EC2 monitoring service from Compuware that's meant to illustrate the capabilities of its Gomez monitoring service, had two test applications running in Northern Virginia Thursday and they responded to pings indicating that they had stayed up and running through the outage. Neither of the test apps were making use of Relational Database Service or Elastic Block Store, key affected services. If they had needed them, they would have stalled.

A disruption to the RDS appears to have lead to interruptions of the EBS storage service that Amazon offers customers to capture data and record the application instance. The failure of these services in a zone of what's known as US-East 1, an Amazon data center in Northern Virgina, was bad enough, but their failure in turn triggered RDS and EBS service disruptions in additional availability zones.

Most enterprise applications in EC2 would be making use of EBS and some would use RDS as well. Their inability to access data would render them useless in many cases for the length of the service disruption. Until Amazon can demonstrate that it knows what caused the problem and how to fix it, this disruption puts a stake in the heart of the argument that Amazon zones are adequate protection against failure.

That's because Amazon itself presents the zones as the chief protection against your application failing. "By launching instances in separate Availability Zones, you can protect your applications from failure of a single location," states the guidance for users of Amazon Machine Images.

What is a zone? Only Amazon knows for sure. I know the new New York Stock Exchange data center in Mahwah, N.J., designed for high availability, was built on the border of two utility companies, giving it two sources of power. To me, a cloud data center has at least two zones with distinct electricity sources. One can fail, and the rest of the facility keeps running. Likewise, with telecommunication carriers, two or more are necessary. Zones within the data center tap into difference services; they're architected against both failing at the same time. Yesterday's outage, on the contrary, says zones are not insulated from one another and a service failure of one can spill over into another. This is a body blow to cloud computing.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
1 of 2
Comment  | 
Print  | 
More Insights
2021 Outlook: Tackling Cloud Transformation Choices
Joao-Pierre S. Ruth, Senior Writer,  1/4/2021
Enterprise IT Leaders Face Two Paths to AI
Jessica Davis, Senior Editor, Enterprise Apps,  12/23/2020
10 IT Trends to Watch for in 2021
Cynthia Harvey, Freelance Journalist, InformationWeek,  12/22/2020
White Papers
Register for InformationWeek Newsletters
Current Issue
2021 Top Enterprise IT Trends
We've identified the key trends that are poised to impact the IT landscape in 2021. Find out why they're important and how they will affect you.
Flash Poll