Microsoft Azure Outage Explanation Doesn't Soothe - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Cloud // Infrastructure as a Service
10:51 PM
Charles Babcock
Charles Babcock
Connect Directly

Microsoft Azure Outage Explanation Doesn't Soothe

Microsoft leader's post mortem on Azure cloud outage cites a human error factor, but leaves other questions unanswered. Does this remind you of how Amazon played its earlier lightning strike incident?

Microsoft's Azure cloud outage Wednesday was apparently caused by a glitch related to leap day, according to a post mortem offered by the computer giant. Late Wednesday, the Microsoft Azure team blogged that it had moved quickly once it discovered the leap year bug to protect customer's running systems. But it could not prevent access being blocked to services in several Azure data centers.

There was good news and bad news in the disclosure. Bill Laing, corporate VP for server and cloud, wrote in a blog Wednesday afternoon that his engineers had realized there was a leap day bug affecting the compute service at 1:45 a.m. Greenwich Mean Time Wednesday, which was 5:45 p.m. Tuesday in the Pacific Northwest. They discovered it early, while many of the affected slept.

The bug is likely to have been first detected through the Microsoft Azure data center in Dublin. "While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year," wrote Laing. The computer clocks of its Dublin facility would have been well into their leap day at 1:45 a.m. GMT.

"Once we discovered the issue, we immediately took steps to protect customer services that were already up and running and began creating a fix for the issue," Laing wrote. In other words, Microsoft appears to have given priority to protecting running systems and did so at the expense of granting access to incoming requests for service. Few would quarrel with the decision.

[ Want to learn more about a possible route out of a cloud that's experiencing a service failure? See Amazon Cloud Outage Proves Importance Of Failover Planning. ]

But for some reason, the United Kingdom's recently launched government CloudStore, which is hosted in the North Europe region, went offline, according to a Computer Business Review report.

"The fix was successfully deployed to most of the Windows Azure sub-regions and we restored Windows Azure service availability to the majority of our customers and services by 2:57 a.m. PST," or a little over nine hours later, Microsoft's Laing wrote.

But that wasn't the end of the story; Laing continued: "However, some sub-regions and customers are still experiencing issues and as a result of these issues they may be experiencing a loss of application functionality. We are actively working to address these remaining issues."

Which customers are affected, how are they affected, and what is the nature of the ongoing outage? Instead of touching upon any of these points in a transparent way, Laing's sharp focus has faded to fuzzy gray, with the thrice-cited "issues" serving as a substitute for saying anything concrete about the remaining problems.

The sub-regions most directly affected by the original loss of access were named in the Azure Service Dashboard Wednesday as North Europe, which best estimates suggest the Microsoft data center in Dublin, Ireland, and the North Central and South Central United States. Microsoft operates Azure data centers in Chicago and San Antonio, Texas, in the Central time zone.

Microsoft also stated that its Azure Storage service was never down or inaccessible.

Prior to Laing's disclosures, Microsoft had stated that "incoming traffic may not go through for a subset of hosted services … Deployed applications will continue to run …" The subset of services affected included the SQL Azure Database and SQL Azure Data Synch services, SQL Azure Reporting, and Windows Azure Service Management.

While some services were not available in particular regions, Azure Service Management was out worldwide, an event that happened early--and was probably the first sure sign of trouble. On the other hand, the Azure Compute service continued as normal until 10:55 a.m. GMT, when the dashboard signaled that new service couldn't be granted to incoming requests in three sub-regions.

This incident is a reminder that the best practices of cloud computing operations are still a work in progress, not an established science. And while prevention is better than cure, infrastructure-as-a-service operators may not know everything they need to about these large-scale environments. The Azure Chicago facility is built to hold 300,000 servers, with a handful of people running it.

It might seem foreseeable that security clocks or system clocks could experience problems on the 29th day of February. Many were probably attended to or engineered correctly, but there's always one sleeper able to wake up and cause trouble. Thus, Microsoft's "cert issue triggered by 2/29/2012" announcement early Wednesday can join with Amazon's "remirroring storm" of April 22-24, 2011. Microsoft's cryptic message suggests a security certificate was unprepared for the leap year.

And don't forget the Dublin lightning strike last Aug. 7. It was said to have hit a utility transformer near the Amazon and Microsoft facilities, robbing them of power for an hour. In the aftermath, repeating what they had been told by the utility, Amazon operators said the force of the charge had been so great that it disrupted the phase coordination of backup generators coming online, causing them to fail.

The only problem was the utility concluded three days later there had been no lightning strike. It said instead there had been an unexplained equipment faiIure.

The lightning strike, in its way, had been a more acceptable explanation. What does it say about the cloud if random equipment failures disrupt it as well as acts of God? You can begin to see the boxes cloud providers end up in after quick explanations for reliability failures. It might be wise in the event of the next outage to remember that there are still things we don't understand about operating at the scale of today's cloud.

As enterprises ramp up cloud adoption, service-level agreements play a major role in ensuring quality enterprise application performance. Follow our four-step process to ensure providers live up to their end of the deal. It's all in our Cloud SLA report. (Free registration required.)

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
User Rank: Apprentice
3/4/2012 | 8:16:02 PM
re: Microsoft Azure Outage Explanation Doesn't Soothe
There is a very simple lesson to be learned from the Azure outage (and last year's Amazon outage): You must perform detailed Business Continuity (DR) planning. Identify all potential single points of failure. The lesson here is that an "entire vendor" can be a single point of failure. Amazon advertised "availability zones" to protect their clients from outages. Oooops, Amazon suffers a multi-availability-zone outage. This Azure outage was multi-data center. Murphy is alive and thriving in the cloud community, just like he(she) is in corporate data centers. So, plan for it.

If you are using cloud services, you must have contingency plans in place for a complete vendor failure, whether that is bringing critical apps back in-house, or switching to another provider. The cloud providers are victims of their own hype, in that the growth is too rapid for them to cover all their bases. We all know that change is public enemy #1 to reliability. The growth within our cloud providers requires constant change as they expand their environments, especially when it pushes the limits of their architectures.

There is another aspect of these outages that I find disconcerting, namely, how the providers handle the problems, especially as it relates to client communications. This was not a major issue with outsourcing, because clients had dedicated account management teams. With the commodity pricing of cloud, we don't have the luxury of Customer Relationship Management with the providers. The cloud community must address this, either by providing far better on-line communications vehicles, or biting the bullet and having account managers that can serve as communications conduits in either direction.

Cloud is here to stay, and I am sure we will witness an evolutionary process. With what we are witnessing and experiencing, it is time for the cloud providers to acknowledge their lack of enterprise class maturity and develop the plans to bridge the gaps.
User Rank: Apprentice
3/3/2012 | 3:39:54 PM
re: Microsoft Azure Outage Explanation Doesn't Soothe
No wonder! How should Microsoft have anticipated this brand new concept of a leap day? After all, when Azure was designed there was no Feb 29 on the calendar.
User Rank: Apprentice
3/2/2012 | 6:27:20 PM
re: Microsoft Azure Outage Explanation Doesn't Soothe
The first cloud that Microsoft bought I had the misfortune of having data on it. Due to active misconduct by executives they lost 1/3 of my data. Publicly the claimed to have recovered or compensated everyone, but some how that did not include any one I knew who they trashed. They still do not have the mindset and understanding of what it takes to run a cloud that could be trusted. I have long since written them off the list of providers I would ever trust again.
User Rank: Apprentice
3/2/2012 | 8:19:08 AM
re: Microsoft Azure Outage Explanation Doesn't Soothe
This is not a "cloud" issue. This is a Microsoft issue.
User Rank: Apprentice
3/1/2012 | 9:03:13 PM
re: Microsoft Azure Outage Explanation Doesn't Soothe
How can you possibly state this is a "cloud" operations issue and not just a bad operations issue? Bad operations is just bad operations regardless if it is for a cloud service or not.

"This incident is a reminder that the best practices of cloud computing operations are still a work in progress"
Why 2021 May Turn Out to be a Great Year for Tech Startups
John Edwards, Technology Journalist & Author,  2/24/2021
How GIS Data Can Help Fix Vaccine Distribution
Jessica Davis, Senior Editor, Enterprise Apps,  2/17/2021
11 Ways DevOps Is Evolving
Lisa Morgan, Freelance Writer,  2/18/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
2021 Top Enterprise IT Trends
We've identified the key trends that are poised to impact the IT landscape in 2021. Find out why they're important and how they will affect you.
Flash Poll