Reduce your RAID failure rate

Last Updated on April 18, 2017 by Dave Farquhar

It’s not often that you end up talking about computer hardware at church. It’s especially not often that you end up talking about a RAID failure at church. But one such conversation got me thinking again about ways to reduce RAID failure rate.

This past Sunday, I talked with the executive director, who told me five of the drives in the 8-drive RAID array failed all at once. “That’s not supposed to happen,” he said.

It isn’t. But I know why it did.

The problem with RAID

The problem is that if you buy 8 drives that were made in the same factory on the same assembly line on the same day, chances are they’re going to be as identical as drives can be. They’re going to operate the same way. That’s nice, but if there’s anything wrong with one of them, chances are all of them will have the same problem. What happens if one of those identical drives fails prematurely? Odds are the rest of the drives have the same disease and will play monkey-see-monkey-do on you.

Preventing catastrophic RAID failure

I’ve always mixed my drives up as a matter of course. To the extent that any time I deployed a new server, I’d take a look around at other servers’ hot spares. If they were different from what I had, I’d swap them around. I wanted a mixture of Seagate and Fujitsu drives in my servers. Both types of drives were going to fail eventually–that’s just a given–so what you do instead is work to ensure they fail at different times.

A good vendor will mix your drives up for you when you order a RAID array. They may give you the same make and model, but they’ll give you drives from different batches, on the theory that if there was something wrong at a given factory one week in August, by September they’d probably fixed it.

If your vendor won’t do that for you, order from multiple vendors. Order 1/3 of the drives from CDW. Then order 1/3 from Insight, and 1/3 from Zones.

Now that we’re down to two companies making hard drives–Seagate and Western Digital–I would want four of each in an 8-drive array. I would want each drive to be from a different batch. There’s not much more you can do than that to randomize the life expectancy. Perhaps you could run half of the drives in an array alone for a month or so while the others sit on a shelf, then building a new array with all of them. But that’s a lot of unnecessary work.

I’ve talked about hard drive longevity before. Multiple times, actually. But with RAID, staggering your failures is the most critical thing. You’ll never completely prevent them.

My track record

In eight years of administering large systems–I once administered a system with 20,000 users on it–I never lost a RAID array, except for one incident that wasn’t my fault. I never lost an array though. I lost individual drives, of course, but whenever I replaced the failed drive, that was always the end of it.

The server vendors know this

Sometimes the drive was under warranty. Regardless of the make of machine–HP/Compaq or Dell–they never asked anything about the drive other than capacity. They’d send whatever they had via courier, and I’d put it in. Sometimes it was much faster than the old ones. Sometimes the capacity was larger. But it worked. So if you start out with an identical array, it won’t stay identical anyway–as long as you catch the failed drives before you lose enough of them to lose the array.

So remember: Hard drives aren’t like tires. You want to mismatch them as much as you can, unless you want to lose a server.

If you found this post informative or helpful, please share it!