RAID is not enough

In an earlier article I wrote that:
RAID is not backup. It’s just increasing the reliability of the storage device. It’s still just one copy of the data. If you delete or corrupt a file, that deletion or corruption is stored reliably.
This is very true, but there are more risks than just the above. There are a few mis-conceptions about RAID around, so I’m just going to expand on them a bit…

Flamingo Fly-by
Ngorogoro Crater, Tanzania
EOS 30D, 100-400mm (F1_53A1)
First of all, let’s recap some of the underlying technology:
RAID (Redundant Array of Inexpensive Disks) has been around for many years now. It usually refers to a group of disk drives that are treated by the computer as a single device for putting files onto, and behind the scenes the data is copied to multiple drives so that if one fails the computer can keep on running. The simplest form is RAID-1 (or “mirroring”) with drives in pairs. Fancier forms such as RAID-5 distribute the data across more drives (typically 3-5) and cope with the failure of any one of them.
“Software RAID” is where the disks are connected as normal devices to the computer, and the RAID function is provided by software. For instance OS X has this support as standard.
“Hardware RAID” is a device that connects to the computer and usually appears as a single device, but internally connects to multiple disks.
More-advanced models of hardware RAID use multiple connections to the computer (e.g. Firewire or FibreChannel connections to separate controllers in the computer), use dual redundant power supplies, etc. These try to improve the reliability further by coping with more risks than just the possible failure of a disk drive, but the costs of these extra efforts are not insignificant, and are usually reserved for “enterprise-level” systems needing 99.99% uptime.
RAID is usually single-disk protection
When a disk in a RAID volume fails, it needs to be replaced by another, and the RAID controller will then rebuild the contents of that disk from the contents of the others. Once the new disk is synchronised, protection is restored.
If you’ve got a shiny new Drobo unit with multiple drives humming away and you decide to demonstrate to a friend or your boss how robust it is by pulling out a drive, you need to be aware that if any of the remaining disks choose that time to have a problem, the whole shebang will come to a crashing halt. Once you replace the drive the system will begin to resynchronise, and it’s only when resynchronisation is complete that you are safe again. Smart RAID controllers may be able to optimise this with journals and not have to re-write the entire drive, but some (e.g. OS X’s software RAID) can take hours to rebuild the disk. With software RAID the reason for the rebuild can be as annoying as not having all the USB drives powered up when you rebooted…
Upgrading disks in a RAID volume usually involves “failing” a disk (by removing it) and replacing it with a bigger disk. Once the new disk has been re-populated, the fancy RAID systems such as the ReadyNAS X-RAID and Drobo’s BeyondRAID use this as the key to increasing the capacity of the device: once the device has been rebuilt, the extra space is added to the pool.

Few among Many
Serengeti Plain, Tanzania
EOS 30D, 100-400mm (F1_4E8A)
But until the disk is rebuilt your data is at risk: if any of the remaining disks hiccup then all your data will be lost. Most of these devices can only cope with a single disk failure. In rough terms, a ReadyNAS NV+ or a Drobo with four 1TB drives will provide you with almost 3TB of data space and the ability to cope with a single disk failure.
The DroboPro can handle up to eight drives, and with 1TB drives would provide you with 7TB of data space. However it does have an optional configuration where it only provides you with 6TB of space (assuming the same set of eight 1TB drives) but is able to handle the failure of any two drives. It’s nice, but you do need to sacrifice a chunk of your capacity to provide the extra safety.
So put simply: upgrading disks in a RAID system is a risky exercise. You won’t have any protection until it’s complete (except in the above DroboPro configuration). Too many people assume that once they’ve put their data onto a RAID drive such as a Drobo then they’re safe. Nothing’s ever that simple.
All this isn’t bad news though. It’s not at all a reason to avoid using RAID. If you’re using RAID it should be as part of your whole data management system (e.g. to protect the primary members of your data sets). You must also have regularly-updated backups of your data (possibly on another RAID device) and if you’ve updated a backup prior to doing a disk upgrade you shouldn’t lose anything if the upgrade of your RAID fails.

In my own environment I use a system of multi-stage backups, with a primary copy of my data on local Firewire-connected disks. I’m starting to add Firewire-connected Drobo units for improved speed and robustness, but I will continue to keep backup copies of all the data on external drives that get rotated off-site. Currently these backups are on a mixture of external USB drives and “bare” SATA drives which connect into a USB/SATA “dock” on my desk.
We do have a ReadyNAS NV+ connected to the gigabit Ethernet LAN (configured with “jumbo frames” on all the machines for maximum speed). It currently has only 900GB of RAID (four 320GB drives) although I am considering upgrading the disks soon. The file-transfer speed of this over the LAN is as good as most Firewire-connected drives, but due to the inherent restrictions of network filesystems I don’t currently store photo files on it. Lightroom catalogs can’t be locked properly on network filesystems: Lightroom insists that they be on local storage. Also things like folder searches are always slower over a LAN than to a local disk. At one stage most of my photo files were on the NAS box protected by RAID, but even though the file read/write speed is impressive, once I moved the photo sets to Firewire disks, Lightroom and Bridge “felt” a lot faster because there was so much directory-lookup activity. Waiting for the machine to catch up with you is VERY frustrating…

Lesser Flamingos
Ngorogoro Crater, Tanzania
EOS 30D, 100-400mm (F1_53A6)
The NAS machine’s major advantage is that it’s always accessible to all the machines on the network: for example rebooting the image workstation doesn’t affect other machines. As well as hosting shared “normal” files for the network machines, it also provides networked Time Machine storage for a wireless-connected Mac Mini.
Why use RAID?
RAID can be very useful, but it’s up to you to decide if the risks are outweighed by the costs of your system.
If you have your primary storage on “normal” (non-RAID) drives and a drive fails, your machine will grind to a halt and you’ll have to restore from your last backup onto a replacement drive. If your backups are frequent enough you shouldn’t lose much work, although the “down-time” while you restore the system will affect your productivity. For some people this is enough protection.
If you have your primary storage on RAID units and a drive fails, your machine will keep running and you should be able to replace the drive and continue with no interruption. But there is still a risk that the system will die before the drive is replaced and resynchronised, so you’d better still have access to frequent complete data backups.
Even if just because today’s “failure” was just that you deleted the wrong file!
— David

Leave a Reply