Azure Outage Caused By False Failure Alarms: Microsoft

IT Infrastructure

Microsoft's post-mortem on recent Azure cloud outage says a deep Azure security mechanism contained a bug that didn't recognize Leap Day. This set off a cascade of false hardware failure notices.

Charles Babcock, Editor at Large, Cloud

March 11, 2012

8 Min Read

A process meant to detect failed hardware in Microsoft's Azure cloud was inadvertently triggered by a Leap Day software bug that set invalid expiration dates for security certificates. The bad certificates caused virtual machine startups to stall, which in turn generated more and more readings of hardware failure until Microsoft had a full-blown service outage on its hands.

There was no real epidemic of hardware failure. It was simply one cloud mechanism misinterpreting what was going with another and overreacting. The software bug itself was isolated with 2.5 hours and corrected within 7.5 hours of the first incident.

Microsoft's Bill Laing, corporate VP for the server and cloud division, posted a blog late Friday, following up on a March 1 blog on the outage, that aired the root cause analysis of its Azure Feb. 29th outage.

Here's a blow-by-blow description of what caused it.

When a virtual machine workload in Azure fails to start for the third time, what's known as the Host Agent on the server concludes it has detected a likely hardware problem and reports it to the governor of the server cluster, the Fabric Controller. The controller moves virtual machines off the "failing hardware" and restarts them elsewhere. At the start of Feb. 29, this Azure protection process inadvertently spread the Leap Day bug to more servers and eventually, other 1,000-unit clusters supervised by other controllers.

The ultimate result was service impacts of 8-10 hours for users of Azure data centers in Dublin, Ireland, Chicago, and San Antonio.

[ Want to learn more about previous cloud outages? See Post Mortem: When Amazon's Cloud Turned On Itself. ]

Only some services were affected; most running workloads continued operating. Azure Compute's ability to spin up new workloads was affected, although it continued to run existing workloads in many cases. Also, out were: Data Sync Services, Azure Service Bus, and SQL Azure Portal (the SQL Azure cloud database system itself continued operating). In some cases, these services were out for up to 16 hours.

As Feb. 28 rolled over to Leap Day at midnight in Dublin (and 4 p.m. Pacific time Feb. 28), a security certificate date-setting mechanism in the Guest Agent--an agent assigned to a customer workload by Azure--tried to set an expiration date 365 days later--Feb. 29, 2013, a date that does not exist. The lack of a valid expiration date caused the security certificate to fail to be recognized by a server's Host Agent, and that in turn caused the virtual machine it was meant to enable to stall in its initialization phase.

As virtual machines failed repeatedly, the Host Agent monitoring the server used an established workload standard--three failures in a row--as a sign of a likely or pending failure of some component of the server host's hardware. It reported the repeat failures to an automated supervisor, the cluster Fabric Controller.

Per the standard, after three repeat stalls, the Azure Fabric Controller places the server in a non-use, human investigate (HI), state. In the Azure cloud, HI is a mark of repeated failure of a virtual machine (VM) in which no customer code has run. That's suspicious because, theoretically, the VM has attempted to start up with known, clean code. In such circumstances, a hardware fault is the prime suspect.

The Fabric Controller, with 1,000 Azure servers under its domain, can act by moving the VMs off a failing server and onto other servers. The problem with that maneuver was that a software bug in the Guest Agent assigned to each customer's workload was again activated to generate a new security certificate in the new location. The SSL certificate is the first link established between the guest workload and host agent so that encrypted information may flow between the two and the workload initialized. More failed startups led to more hardware failure reports.

Once a server is placed in an HI state, the Fabric Controller performs an automated "service healing" by moving VMs off it to healthy servers. "In a case like this, when the VMs are moved to available servers, the Leap Day bug will reproduce during [Guest Agent] initialization, resulting in a cascade of servers that move to HI," wrote Laing in the blog post.

As Leap Day got underway in Microsoft's Dublin data center, this automated mechanism meant that the problem was going to be propagated beyond the original server cluster onto others, which would start generating certificates that would also fail. Eventually, the frequent attempts to move at-risk virtual machines on "failing" hardware would spread the Leap Day bug around a 1,000-server cluster. Microsoft had anticipated such a problem within a cluster, and had established a circuit-breaker mechanism against too many HI signals generating too many server shutdowns and VM transfers. Once a certain threshold of HI warnings was reached, the 1,000-unit cluster was placed in HI status, which halts the automatic service healing and gives human operators a chance to preserve what's still running while halting a cascading bug.

Laing explained how the bug spread anyway:

"At the same time, many clusters were also in the midst of the rollout of a new version of the Fabric Controller, Host Agent, and Guest Agent. That ensured that the bug would be hit immediately in those clusters and the server HI threshold hit precisely 75 minutes (3 times 25-minute timeout) later at 5:15 p.m. Pacific" Feb. 28, wrote Laing in his blog. The Pacific time translates to 1:15 a.m. Greenwich Mean Time in Dublin, or 75 minutes into Leap Day.

Microsoft engineers spotted the correlation between HI notices and the three 25-minute outages occurring after the start of Leap Day. They were able to identify the bug in the Guest Agent's certificate creator two hours and 38 minutes after it first triggered in Dublin. They quickly developed a rollout plan for the repair and had code ready seven hours and 20 minutes after the first incident. But in its eagerness to counteract the bug, Microsoft rolled out a fix that caused an additional problem.

It tested the repaired Guest Agent code on a test cluster, used it with its own applications on another cluster, then rolled it out to its first Azure production cluster at 2:11 a.m. Pacific standard time (10:11 a.m. in Dublin) on Feb. 29, or a little more than 10 hours after the initial triggering of the bug (at midnight Dublin time, or 4 p.m. PST Feb. 28).

Seven 1,000-server clusters in Azure were still in the midst of a partial update of the Fabric Controller, Host Agent, and Guest Agent. Because of that, the fix did not work with them. Microsoft's attempt to use a "blast" fix to all units at the same time, instead of a more measured, gradual deployment brought down not only the newly starting VMs but some existing healthy ones in the seven clusters.

Running workloads rely on Azure's Access Control Service, Azure Service Bus, and other services. "... Any application using them was now impacted because of the loss of services on which they depended," Laing conceded.

Microsoft fixed this problem as well for the seven affected clusters, and got Windows Azure back to a "largely operational" status by 8 a.m. PST, or a total of 16 hours after the first incident for the last seven clusters. Mopping up problems continued for several hours through human intervention.

From the start of his blog, Laing struck an apologetic note: "We know that many of our customers were impacted by this event and we want to be transparent about what happened, what issues we found, how we plan to address these issues ... Again, we sincerely apologize for the disruption, downtime, and inconvenience this incident has caused. We will be proactively issuing a service credit to our impacted customers ... Rest assured that we are already hard at work using our learnings to improve Windows Azure," Laing wrote.

The outage demonstrates a hitherto unknown security feature built into the Microsoft's Azure cloud. In order for an application to be started, the Guest Agent on the application's operating system creates a transfer certificate. The certificate includes a public key that it can hand off to the host agent to begin transfers of encrypted information between the hypervisor and application.

This is applying a secure method of operation both deep inside the cloud and inside the virtualized host. Once the Host Agent has been handed the key by the Guest Agent, the host server can send security certificates and other encrypted information to the application. If these transmissions were intercepted or for some reason ended up in the wrong hands, only the intended recipient has the means to decrypt them.

This is an application of basic public key infrastructure (PKI), widely used between remote parties exchanging information securely over the Internet. But a virtual machine on a server is very close to the hypervisor managing its affairs. In fact, they occupy bounded parts of the same physical RAM on the server. Using PKI in this setting is a new and fine-grained use that's not been commonly resorted to in the past and an example of an extra measure of security built into a public cloud infrastructure.

It also introduced greater complexity and several steps in which a hidden software bug, like a Leap Day problem, could cause something to go wrong. But in the long run, this type of architecture is likely to build greater confidence in public infrastructure as a service.

As enterprises ramp up cloud adoption, service-level agreements play a major role in ensuring quality enterprise application performance. Follow our four-step process to ensure providers live up to their end of the deal. It's all in our Cloud SLA report. (Free registration required.)

About the Author(s)

Charles Babcock

Editor at Large, Cloud

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive Week. He is a graduate of Syracuse University where he obtained a bachelor's degree in journalism. He joined the publication in 2003.

See more from Charles Babcock

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

About the Author(s)

Editor's Choice