Hard drive longevity and you (or your business)

Last Updated on April 14, 2017 by Dave Farquhar

I can’t believe I’ve never written about the Backblaze hard drive longevity study, but apparently I haven’t. At work we’re running up against the limitations of hard drives, so it’s good to know this kind of stuff.

Here’s what you need to know: Hard drives fail very early or very late. If a drive lasts more than a few days–which is why burning in new equipment is important–only 5% of them fail in their first 18 months. Then, for the next 18 months, only 1.5% fail. Golden years! At age 3, though, failure rates jump to 11.8% and stay there. So keeping hard drives much longer than 4 years is generally asking for trouble. 78% of drives live to age 4, but at that age the annual failure rate is very high.

This is Backblaze’s experience under specific and somewhat peculiar conditions, but it’s not far off at all from what I’ve observed.

Keep in mind this is the average. Every maker has made incredibly bad drives, and all of them who are left have made drives better than the average too. I don’t know why they go on hot and cold streaks, but they do, and they always have. That’s why some people think any given brand of drives are the best and others think the same brand are junk. With a little luck, you can buy one brand of drive and have all of them be spectacular, or with a little bad luck, buy different models of the same brand and have them all fail a day after the warranty expires.We’ve run into issues with some number of systems falling out of Microsoft SCCM. We’re still above what Microsoft says most people can expect, but the best are always getting better, and the way you get better is by finding what’s broken and figuring out why. So I asked everyone for what I always want–MOAR DATAS!–which I then used to get even more data to feed to my inner data monster to obsess over and analyze. I found we have too many systems that are four years old, or older. We’re doing better than Microsoft says we should under good conditions, and we’re not exactly giving SCCM good conditions.

Microsoft warns that disk errors cause WMI and/or SCCM to break, and older drives are more prone to disk errors. Hard drives don’t necessarily just drop dead suddenly. Often they work badly, and get worse, before dropping dead suddenly. Age alone doesn’t account for everything I’m seeing, but it’s a healthy chunk of it. And since I can only probe the age of the Windows build remotely, and not the age of the hardware itself, it’s likely some of the systems are much older than I’m detecting.

More and more companies are keeping their computers longer, but when you decide to push your computers out to age 5 or beyond, you’re gambling. Older systems are harder to manage and harder to keep secure. That’s beside the user experience they provide, which probably isn’t optimal.

If your budgets don’t allow system replacements as often as you’d like, at least change the hard drives every 3 years or so. Remember, the time you’re paying someone not to work while their computer is down costs money too, and so does the work that isn’t getting done during that time, so factor those two things against the price of that $50 hard drive. Or, better yet, the $80 SSD that lasts twice as long as a hard drive and works faster.

Yes, I’ll mention that in my report.

Now, here’s the important thing: Did the data I observe match up with Backblaze’s? Not exactly. So while I think age is contributing to what we’re observing, it’s not the only factor. Data’s job is to cause you to ask questions. In this case, the more questions I ask, the more questions I’m finding out I have to ask. But that’s OK as long as everyone is willing to keep digging.

Backblaze handles long-term reliability. There’s also a resource for DOA-type reliability too.

If you found this post informative or helpful, please share it!