Cloud SLAs: Improvements Still Needed - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Cloud // Infrastructure as a Service
02:45 PM
Charles Babcock
Charles Babcock
Connect Directly

Cloud SLAs: Improvements Still Needed

There's been a slight improvement in cloud SLAs as leading service providers like HP, Amazon, and Google guarantee 99.95% uptime by the month, not year. But performance issues remain.

Cloud Contracts: 8 Questions To Ask
Cloud Contracts: 8 Questions To Ask
(Click image for larger view and slideshow.)

Cloud service level agreements (SLAs) remain a sore spot among customers of cloud computing services. Not only are SLAs rife with ill-defined terms and vendors' self-serving phrases such as "at our discretion," but they also tend to use only one metric: service uptime.

And it is a somewhat lenient metric at that. The major service providers such as Google, Amazon, and HP guarantee 99.95% uptime, meaning the service can be unavailable for 4 hours and 23 minutes a year without incurring any penalty to the supplier.

"If I were an enterprise cloud user, I'm not sure that's the only thing I would be concerned about," said Sharon Wagner, CEO of Cloudyn, a third-party monitoring service. Wagner has invented a language for describing SLAs in a way that allows them to be automatically monitored and enforced by software systems -- and he's even seeking a patent on it. He said in an interview that customers should look at additional metrics such as predictable performance levels, consistent response times, and expandable service when you need it, which is called cloud elasticity. Most SLAs are silent on those points, and performance levels could vary widely in the course of a month or year without customers having any recourse.

In one respect, there's been a slight improvement in the otherwise weak, common cloud SLA. The HP Helion Cloud uses its SLA as a point of distinction from better established providers. HP points out that most cloud service providers average their SLA uptime percentage over a year. "At Helion Public Cloud, we consistently offer protection with an SLA of 99.95% monthly availability," says the company's SLA.

[What is in store for the cloud in 2015? Read 10 Cloud Analytics & BI Platforms For Business.]

It's a small difference, but 99.95% a month allows just 22 minutes of downtime, because the metric is applied to each month separately rather than averaged over the course of a year. Amazon Web Services, which used to state the annual percentage, also adopted the monthly metric without fanfare sometime in 2014. Microsoft said its its 99.95% uptime guarantee now applies on a monthly basis. But it had to supply a download link to a recent SLA document for a reference to its “monthly” application to appear.

Table 1: What Your Cloud SLA Really Means
Stated uptime Downtime/year
99.99% 53 min.
99.95% 4 hrs., 23 min.
99.90% 8 hrs. , 20 min.
Source: InformationWeek, January 2015.

Amazon's EC2, Microsoft Azure, and Google Compute Engine all use the 99.95% guarantee. Google, Microsoft and Amazon have recently switched to the monthly application of it. For a major retailer using a cloud service to host its ecommerce systems, a 22-minute outage approaching the holidays in December has a lesser business impact than does a four-hour-and-23-minute one -- which would be allowable for 99.95% uptime over the course of a year. That shows how the precise wording of a SLA can make a big difference.

The Rackspace SLA for cloud server hosts doesn't mention a percentage of uptime. Rather, it says if a host goes down, Rackspace will repair it within an hour. If it remains down over an hour, a penalty of 5% of the customer's server time per month is applied for each hour of outage. In other words, after 21 hours Rackspace owes the customer 100% of the month's bill for that server as repayment for workloads that were down. That would amount to a small amount of money compared to the business impact such an outage might have. Furthermore, even if the server is only down for an hour, that amounts to an SLA that is weaker than Amazon's. It is 98.57% uptime for that month, before any penalty kicks in.

Amazon's SLA is more straightforward than it used to be, but still retains phrasing such as: "Your sole and exclusive remedy for any unavailability, non-performance, or other failure by us to provide Amazon EC2 or Amazon EBS is the receipt of a Service Credit." In other words, don't expect any money to change hands, even if there has been damage to the business. Penalties are paid out in grants of free time on EC2 servers.

Providers try to couch their limited guarantees in self-protective language and what should be covered by the agreement doesn't warrant a mention, according to knowledgeable observers. When a reader asked if Amazon had a cloud SLA on the public forum Quora in 2010, Jason Read, co-founder of the third-party monitoring service CloudHarmony, responded: "SLAs don't really mean much. The typical financial compensation offered for not meeting SLAs is close to nothing."

Henrik Schinzel, CTO and co-founder of Avail Intelligence in Malmo, Sweden, responded: "I have read a bunch of SLAs and dissected them, and come to the conclusion that most of them serve two purposes: 1) Create a false sense of security for the customer; 2) Provide the companies with a bunch of legal loopholes."

Don't look at the SLA; look at the company's track record, he advised.

Some SLA language has been made less vendor self-excusing since four years ago, but the penalties remain the same in all cases: time credits instead of money payment.

And performance is an area that still doesn't even get addressed. A survey by the application performance management company Compuware last year found that 79% of cloud users found their SLAs "too simplistic," and 73% believed cloud providers were hiding infrastructure problems that affect workload performance.

That's also a clear worry of Cloudyn CEO Wagner: "Workload performance metrics are not represented at all."

"My concern is, if you do experience a failure, how fast will you recover? And will you be able to restore my data?" he added. A metric in the SLA that would govern frequency of failures would be a stated mean time between failures, which could be accompanied by stated mean time to recovery.

Another metric he recommends is stating the number of technical support trouble tickets that can be escalated by the customer, since limits frequently exist in the mind of the provider.

Additional metrics will start appearing in cloud SLAs as competition continues to heat up, he predicted. "More metrics will be added so enterprises will trust service providers." He favors more measures and more visibility into cloud operations over more penalties, until the industry matures further.

Even with these metrics, customers will need a way to monitor and manage their providers. For those who have rushed into cloud computing as a way to expand their data center resources quickly, they must depend on stats from the suppliers themselves.

Larger customers are already inserting some of the metrics he advocates in negotiations over their support contracts, Wagner said. But improved SLA measures should not be reserved to the largest customers. Standard SLA metrics need to become the order of the day for all users along with ways to ensure they are enforced, he said.

Attend Interop Las Vegas, the leading independent technology conference and expo series designed to inspire, inform, and connect the world's IT community. In 2015, look for all new programs, networking opportunities, and classes that will help you set your organization’s IT action plan. It happens April 27 to May 1. Register with Discount Code MPOIWK for $200 off Total Access & Conference Passes.

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
Charlie Babcock
Charlie Babcock,
User Rank: Author
1/22/2015 | 3:04:53 PM
SLAs may vary by service
SLAs will vary for different services. For example, Microsoft Active Directory and API Management Service are not offerred at 99.95% availability but 99.9%. That's 8 hours and 20 minutes of downtime versus 4 hours and 23 minutes.
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

Becoming a Self-Taught Cybersecurity Pro
Jessica Davis, Senior Editor, Enterprise Apps,  6/9/2021
Ancestry's DevOps Strategy to Control Its CI/CD Pipeline
Joao-Pierre S. Ruth, Senior Writer,  6/4/2021
IT Leadership: 10 Ways to Unleash Enterprise Innovation
Lisa Morgan, Freelance Writer,  6/8/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
Planning Your Digital Transformation Roadmap
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Flash Poll