Three Surprising Tips To Predict Hard Disk Failures
What makes the difference between a good hard disk and a dud? Google is better able to answer this question than most companies -- and it's willing to share at least some of what it knows.
What makes the difference between a good hard disk and a dud? Google is better able to answer this question than most companies -- and it's willing to share at least some of what it knows.ServerWatch.com blogger Paul Rubens recently discussed a research paper that three Google employees presented at a 2007 storage technology conference. The paper, "Failure Trends in a Large Disk Drive Population," draws upon the lessons Google has learned while managing the disk drives used in its servers.
Rubens approaches the paper's findings from an enterprise perspective. There are several reasons, however, why many of Google's findings are relevant even to small businesses. For one thing, Google relies heavily upon consumer-grade SATA and PATA drives, rather than more reliable -- but also more expensive -- RAID-class disks.
Also, Google is working from a staggering sample size. The paper's authors gathered data from more than 100,000 disk drives, representing both SATA and PATA drives with capacities between 80GB and 400GB. The result is a body of statistical knowledge that is both unique and extremely valuable.
After reading through the paper, here are a few highlights that should be of interest to any business that manages even a modest set of storage hardware:
Temperature Matters -- But Not The Way You Think!
Most of us assume that high operating temperatures are the kiss of death for consumer-grade storage devices. According to Google's findings, this simply isn't the case.
In fact, Google's data suggests that disk failure rates are much higher at low temperatures. We're not talking frigid conditions here, either: While its data associates average temperatures as high as 50C (122F) with a two percent annual failure rate, it associates an average temperature of 25C (77F) with a three percent failure rate. Drop the average operating temperature to 15C (59F), and the annual failure rate skyrockets to nearly 10 percent!
This probably comes as a shock to data-center admins who obsess over keeping their servers running within a carefully-controlled temperature range. And it is probably something that manufacturers of specialized disk-cooling hardware -- a rapidly-growing market -- don't want to hear.
When SMART Works -- And When It Doesn't
SMART is a standard used to monitor the health and reliability of a hard disk. It can track dozens of different performance attributes (depending on the drive model and manufacturer); the idea is to alert the user before a disk failure actually occurs.
Here's the good news: Google found that four key SMART parameters -- scan errors, reallocation count, offline reallocation, and probational count -- were clearly associated with subsequent hard disk failures. If you use software capable of reading and reporting SMART data (a fairly complete list of such tools is available here), then these are the indicators to track. A disk that reports even a single scan error, for example, is nearly 40 times more likely to fail within two months than those with no scan errors.
Now, here's the bad news: According to Google, more than half of the failed drives in its sample showed no problems in these four categories before they went belly-up. Even when Google included other, weaker SMART indicators, 36 percent of its drives still failed after showing "zero counts on all variables."
Google had hoped to use its immense SMART data set to build a useful statistical model for identifying and replacing problem disks before they fail. Unfortunately, it turns out that around half the time, SMART simply can't deliver the goods.
One Question Google Won't Answer
Does age matter when it comes to hard disk failures? It's an obvious, and very important, question. It is also the one question that Google isn't willing to address.
Google's research does show one age-related disk reliability trend. There seems to be a slightly higher failure rate during the first three months; disks that survive this burn-in period are then uniformly reliable for around a year.
Does that mean you should run new disks in a test environment for three months before putting them into production use? For most companies, I think that suggestion is just as ridiculous as it sounds: The relatively slight risk of a burn-in failure simply can't justify the costs associated with such a plan.
After one year, Google admits that its published age-related failure statistics are not very useful. According to the report's authors, the failure rates "for 3 and 4 year old drives is more strongly influenced by the underlying reliability of the particular models in that vintage than by disk drive aging effects." In other words, the manufacturer of a three year-old drive has a bigger impact on reliability than the fact that the drive is three years old.
Here's the catch: Google considers reliability data associated with particular manufacturers or drive models to be proprietary. The company knows which manufacturers produce the best consumer-grade storage hardware, but it's not telling any of us.
About the Author
You May Also Like