HDD metrics and why mean time to failure is not terribly useful

HDD metrics and why mean time to failure is not terribly useful

In this podcast, we talk to Rainer Kaese, senior manager of business development for hard disk drives at Toshiba Electronics Europe, about hard disk drive (HDD) metrics.

In particular, he takes apart mean time to failure (MTTF) and shows why it’s not a very useful measure. In its place, he suggests annualised failure rate (AFR) as more useful, and shows why with reference to human lifespans.

He also talks about why mean time between failure is not applicable to hard drives, and why enterprise storage systems need enterprise drives.

What is the mean time to failure rate (MTTF)?

MTTF is the measure of the probability of how long it will take a hard disk drive to fail.

But it’s not a very useful metric. Let’s say a typical enterprise hard disk drive MTTF is 2.5 million hours. Which means it may take 2.5 million hours until a drive fails. But 2.5 million hours, if you do the math, is 285 years.

That’s not the correct interpretation of that value, and there’s a lot of misunderstanding, so I would like to clarify it here and go back to a more useful value.



This MTTF of 2.5 million hours can be calculated into an annualised failure rate, and this annualised failure rate for enterprise drives is 0.35%. This is a more useful value because it means that 0.35% of the HDDs you are running may fail within a year.

Let’s say you have a datacentre with 1,000 drives. [That means] 0.35% or 3.5 drives per year may fail. That would be within the reliability specification. So, you would have to budget for three to four failure replacements, and you can expect three to four failures per year.

That means hard disk drives are pretty reliable, with only three to four failures per year. Of course, this is all if you operate the hard disk drives within the agreed specification. That means 24×7 operation per year with a temperature less than 42°C on average and a workload less than 550TB [terabytes] per year, and also only within a warranty period of five years.

From this 0.35%, if you divide the number of hours per year, which is 8,760, by this AFR, you come to the mean time to failure.

So, 8,760 hours divided by 0.35%, or 0.0035 – this equation gives you 2.5 million hours. If you have only one hard disk drive, it will take 285 years for this one to fail on average, but only under the agreed condition, and the agreed condition is within five years of warranty.

This 2.5 million hours, or 285 years, would mean if you replace your hard disk drive every five years, then after 285 years, you will encounter a random failure. But again, 285 years is way too high. You could phrase it that if you have 2.5 million drives, you would have one failure per hour.

Or if you have 2,500 drives, you would have one failure every thousand hours. That would be a kind of realistic interpretation of this 2.5 million hours.

But if you only have the mean time to failure value, and you take 8,760 hours per year divided by this MTTF, you will have the annualised failure rate, which is a more useful value.

But, MTTF is not a very useful value and for low-failure-rate products like hard disk drives, it often leads to misunderstanding.

A better analogy to explain it with is another low-failure-rate type of product: the human being. My failure rate within the next year is quite low. Most people of my age operating under the specification “office worker” will survive next year.

I asked my health insurance company, ‘What is the probability that I will fail within the next year?’ They know this value because if I fail, if I die next year, they would have to pay. They know this value very well, and they told me it’s 0.16%. Out of 1,000 life insurance contracts of people like me, they have in their books, they are calculating for 1.6 deaths in the next year.

If I do the math and calculate from 0.16%, this will give an MTTF of five million hours, which means I’m twice as reliable as an HDD; five million hours is 625 years and, of course, I will not live 625 years. The life insurance company told me they’re calculating for 82 years.

That’s the reliability. It tells us how many failures there will be within the next year – and that’s all. It’s not 100 years.

Can you explain the difference between MTTF and MTBF (mean time between failures)?

We talked about MTTF, mean time to failure. Sometimes in data sheets, it is written as MTBF, mean time between failures.

Strictly speaking, mean time between failures is meant for technical products, which can be repaired. With cars, you can have a mean time to first failure. After the car is repaired, you then have mean time to the next failure.

As hard disk drives cannot be repaired, the correct term for the hard disk drive is MTTF, mean time to failure.

The next question is what causes drives to fail?

Anything. Drives are mechanical components with a lot of electronics.

There could be an electronic failure, electro-migration, some of the wires in the chip may tear off. There can be mechanical things like the glue of the head failing or a head crash. There are many different failure modes. Fortunately, drives are very reliable.

The old 0.35%; that’s highly reliable. It happens rarely. It happens seldom. It takes a long time on average for that to happen.

Most drives in their five-year warranty period, or even seven, eight, nine years of operation, won’t fail. The vast majority won’t fail, but it can happen.

This is why we have this statistic reliability values. Although it may happen rarely, it may happen late, it may not happen at all, there is still a remaining probability that a failure may happen to your particular drive at any time.

Even on the first or second day, it may happen even with lower probability, but it still may happen. This is why a backup is always important.

What’s the difference in terms of failure between a 10-drive setup and a 60-drive setup?

Failures may happen to any drive with very low probability. But there is a difference if you have only one drive or 10 drives, or if you have 60 or 120 drives. The more drives you have, the higher the probability at the same drive, but the higher the [probability] that you encounter one failure in the system.

Let’s say if you have one or 10 drives, you may be able to run it with drives of lower reliability. Let’s say desktop drives, for example. They have an annual failure rate of 1.5%, but if you have only one or two or four with an annual failure rate of 1.5%, you won’t have many failures.

Most of the systems will be stable. If you take this 1.5% annual failure rate into a 60-bay system, each system may fail every year. If you want to do that, you may be fine with it, but most drive failures cause interruptions in service and require manual interaction when replacing drives.

And when you operate a 60-bay system, you cannot afford that many failures or manual interaction cases with your system. You need to rely on low-failure-probability enterprise drives. That’s basically the difference.

Smaller systems can be run with lower-reliability drives because of the lower number of drives. With many drives in enterprise environments, you should use proper enterprise drives.

How should storage systems be set up to minimise the risk of hard disk drive failure?

Again, operate the hard disk drives in the reliability conditions which are in the data sheet. A hard disk drive that is non-24×7 should not be operated 24×7.

Hard disk drives should be operated within the temperature range. Hard disk drives should not exceed the workload that is set in the data sheets. The workload is just an indication.

It is not like an endurance limit. For enterprise drives, we say 550TB a year. If you read or write a little bit more, it doesn’t matter, but if you read or write double or triple, which you could do if you load the hard disk drive as much as you can, you have a lower reliability.

As long as you keep these operating conditions and within the temperature range – 42°C on average is the highest reliability – then you can enjoy a long lifetime for your hard disk drives.



Source link