In BC/DR, It's The Small Stuff That Can Get You - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Cloud // Cloud Storage
06:43 PM
Kris Domich
Kris Domich

In BC/DR, It's The Small Stuff That Can Get You

Companies that prep for hurricanes, floods, and earthquakes need to make sure they're not sunk by a power strip.

I see many organizations that believe they're well-insulated from major disasters--a sentiment that often grows into a sense of complacency, which eventually breeds a mindset that business continuity and disaster recovery planning and testing are basically unnecessary expenses. In the 2012 InformationWeek State of Storage Survey, of more than 300 respondents asked about their disaster recovery and business continuity strategies, only 38% have BC/DR processes and test them regularly.

After all, how often do thunderstorms knock out big swathes of the Northeast, right?

Given the extreme weather lately, calamities that can affect data centers happen more often than one might expect. But the bigger-picture answer is, disasters come in many forms. If your BC/DR planning centers on catastrophic events like a Sept. 11-style terrorist attack or the 2011 earthquake in the Washington, D.C., metro area that cut power for days, you've only addressed part of the risk. Common, everyday events like component failure, data corruption, a telecom outage, or plain old human error can cause the same level of service disruption.

While it's impossible to plan for every potential contingency, at least for those of us without unlimited budgets, there are a few simple best practices that can ensure appropriate levels of BC/DR for most businesses.

First, you need a mission statement: BC/DR planning must define appropriate measures to protect your organization against conceivable threats that may harm employees, customers, your ability to maintain service-level agreements (SLAs), your brand, your reputation, or any of your corporate values.

When thus encapsulated, it should become apparent that every organization--for-profit or otherwise--must take measures to protect itself from downtime, no matter how mundane or complex the cause. In this column, let's focus on the mundane side of the equation.

Common component failure--we're talking things like physical interface cards, fans, and power supplies--is one of the most frequent causes of service and application downtime in smaller data centers. Case in point, one of my clients, a moderately sized assisted living center, de-energized several critical servers during a planned outage scheduled to last two hours. The servers had been running nonstop for approximately two years. After cooling down to ambient data center temperatures for almost two hours, several of the servers' power supplies failed to initialize after being re-energized. Even though this client had an on-site support agreement covering the power supplies, without spares on hand, it took nearly five hours to receive the replacement units. This effectively tripled the planned duration of the outage.

No critical system within a data center should rely solely on any single instance of these components; they should always be redundant. Most data-center-grade equipment is designed to have redundant instances of these components; however, not all organizations take advantage of them. For example, a redundant power supply that is not plugged in or is plugged in to the same power strip or power distribution unit (PDU) as the primary one won't do you much good.

The key concept here is "separate": To maximize the capability of dual-power supply systems, the power supplies must be plugged into separate PDUs fed from separate breakers in separate power panels routed from separate UPS units. The UPS units should be fed by commercial power and protected by emergency generator power. While commercial power and generators can fail, these are typically the least likely to have frequent or long-term outages when compared with the downstream components; thus you have eliminated single points of failure in the places where failure is most likely to occur.

In addition to ensuring that critical equipment is under an on-site support agreement, IT can combat the problem of common component failure by keeping a reasonable inventory of common spares on site. Most components today, even power supplies and equipment fans, can be field-replaced by a Level 1 or 2 data center technician. Carrying common spares is a cost-effective way to mitigate the risk of outages; think of the expense as an insurance policy to cover the time elapsed between the failure of a component and the installation of a replacement part under the on-site support agreement.

Decisions on which spares to keep on hand should be made based on the downtime tolerance for any given system as compared with the SLA of the on-site repair contract. If a server can only be down for an hour, but the vendor's contracted response time is two or four hours, it makes sense to have spares on hand for that system.

At one time, BC/DR was generally a multisite, active-active or active-passive data center configuration involving redundant hardware that frequently sat idle. There are still cases where this is reality, such as when regimented approaches to assessing risk warrant this investment. For most of us, however, virtualization and other factors have reduced the redundancy. That's a good thing for budgets, but it can be dangerous as well. Don't let a fried NIC worth a couple hundred dollars cost your company thousands or more in downtime.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
User Rank: Apprentice
7/24/2012 | 6:57:11 PM
re: In BC/DR, It's The Small Stuff That Can Get You
Kris, you make some excellent points (with examples) that "common, everyday events like component failure, data corruption, a telecom outage, or plain old human error can cause the same level of service disruption" as a natural disaster and that organizations need to plan and test for both. Forrester analyst Rachel Dines wrote a report last year that broached this topic calling it IT service continuity management.

Regarding your summation, I wanted to point out that virtualization and other factors, such as new storage technologies and the Cloud, also allows for new types of recovery services that can help organizations address IT service continuity challenges. These cloud recovery services not only protect against natural disasters but can also deliver proactive failover support, providing a zero-downtime alternative for planned maintenance, site outages and upgrades as well as the examples you outlined above.
2021 Outlook: Tackling Cloud Transformation Choices
Joao-Pierre S. Ruth, Senior Writer,  1/4/2021
Enterprise IT Leaders Face Two Paths to AI
Jessica Davis, Senior Editor, Enterprise Apps,  12/23/2020
10 IT Trends to Watch for in 2021
Cynthia Harvey, Freelance Journalist, InformationWeek,  12/22/2020
White Papers
Register for InformationWeek Newsletters
Current Issue
2021 Top Enterprise IT Trends
We've identified the key trends that are poised to impact the IT landscape in 2021. Find out why they're important and how they will affect you.
Flash Poll