Amazon SLAs Didn't Cover Major Outage - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Cloud // Software as a Service
Commentary
5/9/2011
07:43 PM
Charles Babcock
Charles Babcock
Commentary
Connect Directly
Twitter
RSS
E-Mail
50%
50%

Amazon SLAs Didn't Cover Major Outage

Customers affected by the recent EC2 outage were compensated by Amazon, but not because the terms of the service level agreement required it.

Service level agreements in the cloud don't necessarily guarantee very much. Amazon.com's SLA says it's obligated to provide 99.95% uptime, but that SLA didn't apply to many customers caught in the recent outage.

The SLA says customer instances--their application workloads running in Amazon virtual machines--need to be up and running 99.95% of the time. During EC2's Easter outage, most instances that were running before the trouble started continued to run. It might have been impossible to get a sleeping instance started, but sleepers aren't covered by the SLA. Also, the major failure wasn't the core EC2 instances covered by the SLA but the services on which the instances depend. Those services are not mentioned as being available 99.95% of the time in the SLA, even if your site depends on them. They're not mentioned at all.

In forthrightly describing the problem and owning up to it, Amazon has gone beyond the terms of the SLA and offered compensation to customers affected by the outage. It compensated those affected by the outage with 10 days of free use of EC2. But make no mistake, it didn't have to, and there's no guarantee it would do so in the future.

Here are examples of companies that kept running and those that didn't. CloudSleuth, an Amazon EC2 monitoring service, had two test applications running in Amazon's U.S. East-1 availability zone as the incident began. And it confirmed that those apps were running all through the Easter weekend outage. All they could do was send back a ping confirming they were up and running, but that's all they are designed to do.

The many websites depending on that zone, however, from Blue Sombrero to Zencoder to the better-known HootSuite and Reddit sites, were dead in the water for the better part of 12 to 24 hours and some for three days. What's the difference between them and a CloudSleuth app? While the core Reddit apps are running, they need data delivered by EC2 services Elastic Block Store, which draws customer data off of disks, and Relational Database Service, which draws data out of MySQL databases. They use it to maintain and update their sites. These services were not available.

Amazon mistakenly shifted primary network traffic onto a network that wasn't designed for it. That network choked, prompting Elastic Block Store to discover that backup copies of data that it expected to be there were no longer available. It set off a furious "remirroring storm," which in turn froze operations in one section of Amazon's U.S. East data center, then spread to other availability zones.

Again, Amazon did what was right. But SLAs exist so companies don't have to depend on goodwill. I am reminded of one irate website maintainer's post in the midst of the crisis: "Amazon's updates [to its Service Watch dashboard] read as if they were written by their attorneys and accountant, who were hedging against their stated SLA rather than being written by a tech guy trying to help another tech guy."

Bryson Koehler, senior VP at InterContinental Hotel Group, once made this comment to me in an interview: "EC2 is a best effort" service, not a sure thing. That assessment is reinforced by the narrow definition of protection in Amazon's SLA.

At San Francisco's Engine Yard, a service that hosts Ruby applications in EC2, the trouble brewing in the middle of the night April 21 was spotted right away. Technical support staff members were called in, and they started using a beta service Engine Yard had in place to move customer EC2 instances from U.S. East to other Amazon data centers, primarily U.S. West in Northern California but also to centers in Dublin, Ireland; Asia; and Japan.

There were several instances of Engine Yard's own management dashboard ceasing to function for a few minutes, but it always came back and the process continued until all customers had been transferred or transferred themselves. Engine Yard posted instructions on how to use the service and made it available to everyone.

When it comes to cloud computing, this example may illustrate where the real guarantee of service continuity lies--with your contingency planning. Engine Yard depends on Amazon's infrastructure as a service, but Mike Piech, VP of product management, said, "Amazon has a strictly defined SLA" and it wouldn't cover most of the cases impacted by the recent outage.

That's why cloud users need to figure out up front what they're going to do in the event of a cloud data center failure--besides go back and finally read the fine print in their SLAs.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Slideshows
11 Things IT Professionals Wish They Knew Earlier in Their Careers
Lisa Morgan, Freelance Writer,  4/6/2021
News
Time to Shift Your Job Search Out of Neutral
Jessica Davis, Senior Editor, Enterprise Apps,  3/31/2021
Commentary
Does Identity Hinder Hybrid-Cloud and Multi-Cloud Adoption?
Joao-Pierre S. Ruth, Senior Writer,  4/1/2021
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Successful Strategies for Digital Transformation
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Slideshows
Flash Poll