Linux Software RAID
Written by Kevin Korb
as a presentation for GOLUG
Presented on 2006-02-02
This document is available at http://www.sanitarium.net/golug/Linux_Software_RAID.html
- What is RAID?
RAID is a Redundant Array of Inexpensive/Independent Disks. The idea is to bundle multiple disks together in a way that provides increased speed, capacity, and redundancy to reduce the risk of data loss due to failed hard drives. The Linux kernel can provide RAID devices using the md driver which is standard in the kernel. User level tools are mdadm (http://cgi.cse.unsw.edu.au/~neilb/mdadm or raidtools (http://people.redhat.com/mingo/raidtools/). Note that the raidtools package is depreciated so unless you have a special need for it you should be using mdadm.
- What is RAID good for?
- Redundancy: All of the RAID levels other than RAID0 provide redundancy. Since hard drives have fragile moving parts and electrical motors+bearings that must spin at a constant speed they are a common failure point in computers and can often lead to data loss. RAID provides redundancy so that the loss of a single drive doesn't destroy any data.
- Performance: RAID arrays tend to perform faster than a single disk drive (this depends on the level of RAID and and is more true for reading than writing). Since hard drives contain moving parts there is a limit to how fast they can communicate with the computer. If you have multiple drives working in parallel you generally increase the overall performance.
- Capacity: With RAID you can bundle multiple drives together to allow a single filesystem to be bigger than any single drive could hold. If you need a filesystem that is bigger than 500GB RAID is the way to go. Of course LVM can also accomplish this but LVM is usually implemented on top of RAID since doing plain LVM gives you no redundancy or performance benefit.
- What can't RAID do?
- RAID is NOT a backup system. It does not protect you against accidental file deletion or corruption. If you accidentally delete a file from a RAID array it is just as gone as it would be from a single disk. Do NOT implement a RAID solution assuming you will no longer need to backup your data.
- RAID is NOT a guarantee of uptime. This is especially true of software based RAID. Usually if a disk suddenly dies in a SCSI system the entire bus will hang sometimes for minutes before the system gives up trying to access the disk and other times the SCSI controller will get stuck in an endless cycle of SCSI resets. I have seen the load average on a server go beyond 500 before the kernel finally kicked the disk from the array. On IDE systems when a disk suddenly dies it can cause the entire system to lock up. Even though RAID isn't giving you the guaranteed uptime it still protects the data though. Once the dead disk is pulled and the system is rebooted the data will still be there on the degraded array.
- What is the difference between hardware and software RAID?
In hardware RAID all of the RAID functionality is provided by a RAID controller (usually a PCI card or external chassis connected to a SCSI controller). You connect disks to the RAID controller and setup the array within its BIOS. The controller then exports a virtual disk to the OS that can be partitioned and formatted. You will need drivers for your OS to access the card and you will need user level tools to monitor and maintain the array. These drivers and tools depend on which controller you are using. In software RAID you take a bunch of regular disks, partition them, and use the md driver in the Linux kernel to create a RAID array on a set of the partitions. It is important to note the difference where in hardware RAID you partition the array while in software RAID you RAID the partitions.
- What are the advantages of software RAID over hardware RAID?
- Cost: Software RAID is cheaper than hardware RAID because no extra controller is needed. RAID controllers tend to be somewhat expensive compared to regular "dumb" or "JBOD" controllers. You can probably use the controllers built into your motherboard unless you need additional ports. Note that for IDE you should only put one drive on each bus since the master/slave setup will completely destroy the performance of an array.
- Performance: If you are on a modern system with a good bus and a good CPU it may actually be faster to run software RAID than hardware RAID (unless you buy a really expensive hardware RAID). Even if your CPU isn't the latest it is still likely to perform as well as all but the most expensive hardware RAID controllers. Note that most of the motherboards that have integrated RAID controllers have chipsets that are so bad they are actually slower than software RAID (and they tend to only support RAID0 or RAID1). Most of the integrated RAID chipsets (like HighPoint) actually are software RAID in disguise.
- Disk Flexibility: Since software RAID is RAIDing the partitions instead of the disks it is easier to compensate for disks that are slightly different sizes. I generally put a swap partition on each RAID disk that is either 2% of the disk or 2GB whichever is bigger. That way if I have to replace a failed disk with slightly smaller one I can compensate by reducing the size of the swap partition. The alternative is to wipe out and recreate the entire array to match the smallest drive size.
- Partition Flexibility: Since software RAID is RAIDing the partitions instead of the disks it is possible to have different RAID levels existing on the same disks. For instance you could have 4 disks with the first 2 containing a RAID1 of /boot, the second 2 containing a RAID1 of /, and then all 4 containing a RAID5 of /home. Some hardware RAID controllers can also do this but only the most expensive of them and they usually do it in really weird and confusing ways.
- Disk Monitoring and Diagnostics: Since most hardware RAID controllers only export the virtual RAID volumes it isn't easy to debug disk problems. 3Ware controllers allow you to probe the individual disks for SMART diagnostics but none of the other controllers (AFAIK) do. It is virtually impossible to benchmark or test an individual disk without removing it from the system and testing it on a non-RAID computer. This of course must be done with the array shut down the entire time or you will have to rebuild the array after each drive.
- What are the advantages of hardware RAID over software RAID?
- Performance: If you have the money for a top of the line RAID controller the performance will be significantly better than software RAID. There will also be less CPU overhead. If you are planning on a huge database with lots of activity then you probably need to invest in a hardware RAID array. This is especially true with RAID5 due to its constant calculation of parity.
- OS Independence: Since the RAID controller is exporting a single virtual disk to the OS it doesn't really matter what OS you are running. If your OS has drivers to communicate with the RAID controller itself then you can use the RAID volume. With software RAID each OS has a completely different and incompatible implementation so there is no way the same software array could be used from multiple OSs.
- RAID Level Flexibility: Since the RAID controller is exporting a single virtual disk to the OS you are able to boot from any RAID level instead of just RAID1 as with software RAID. If you want to boot from a RAID5 this is the way.
- Error Checking: When the good high end RAID controllers (like 3Ware) are running RAID5 they are constantly checking the parity of the data since that is all the CPU on the controller has to do. This sometimes allows them to find and repair bad sectors on a disk before they become problems. Software RAID OTOH just relies on the disks themselves to detect and repair bad sectors. If the disk is unable to repair a bad sector it will return an IO error and the kernel will kick the drive from the array leaving you with a degraded array.
- What are the different levels of RAID and what are they for?
- RAID0: This is simple striping of data across multiple disks. There is NO redundancy at this level. If one disk dies the entire array is lost. RAID0 is only useful for pure speed when redundancy is not needed or when combined with RAID1 to make a RAID10 or RAID 0+1 array. RAID0 gives you the capacity of all of your disks since there is no space reserved for redundancy.
- RAID1: This is simple mirroring. 2 or more disks that are exact copies of each other. If one disk is lost the other continues in degraded mode until a replacement is added. RAID1 does (usually) offer a small speed increase for reading but not for writing. RAID1 is the only RAID level that you can boot from because it is the only software RAID level that leaves a plain disk that the BIOS and boot loader can read. RAID1 gives you the capacity of one of your disks no matter how many disks are in the array.
- RAID5: RAID5 is a very popular RAID because it offers redundancy but uses parity to reduce the cost in terms of redundant disks. In RAID5 instead of mirroring data on dedicated mirror drives like RAID1 you have parity data that can be used to recreate the data that is stored on any one drive in the array. That means that if you have a 6 drive RAID5 array you have the capacity of 5 of the drives and if any one drive dies the array can continue in degraded mode until the dead drive is replaced. The array then reconstructs the missing data based on the remaining data and the parity data to complete the array. The parity data is striped across all of the disks so that there isn't a single dedicated parity drive doing all of the work like there is in RAID3 or RAID4. Here is a representation of a RAID5 array:
| Disk1 | Disk2 | Disk3 | Disk4 | Disk5 | Disk6 |
Stripe01 | Parity | Data1 | Data2 | Data3 | Data4 | Data5 |
Stripe02 | Data1 | Parity | Data2 | Data3 | Data4 | Data5 |
Stripe03 | Data1 | Data2 | Parity | Data3 | Data4 | Data5 |
Stripe04 | Data1 | Data2 | Data3 | Parity | Data4 | Data5 |
Stripe05 | Data1 | Data2 | Data3 | Data4 | Parity | Data5 |
Stripe06 | Data1 | Data2 | Data3 | Data4 | Data5 | Parity |
Stripe07 | Parity | Data1 | Data2 | Data3 | Data4 | Data5 |
Stripe08 | Data1 | Parity | Data2 | Data3 | Data4 | Data5 |
Stripe09 | Data1 | Data2 | Parity | Data3 | Data4 | Data5 |
Stripe10 | Data1 | Data2 | Data3 | Parity | Data4 | Data5 |
Stripe11 | Data1 | Data2 | Data3 | Data4 | Parity | Data5 |
Stripe12 | Data1 | Data2 | Data3 | Data4 | Data5 | Parity |
RAID5 is fast for reading however it is slow for writing and has heavy CPU overhead due to the parity calculations. When RAID5 is degraded it is slow all the time for the same reason. You need at least 3 disks for a RAID5 array.
- RAID3/4: These were the first of the RAIDs that used parity data. Neither is used much anymore since RAID5 is a much better implementation of the same concept. In RAID3 and RAID4 you have parity data like in RAID5 but the parity data is stored all on one disk. That disk tends to have more activity than any of the other disks which tends to cause it to fail more often. Like RAID5, RAID3 and RAID4 give you the capacity of the number of disks you have minus one. Also like RAID5 you need at least 3 disks for a RAID3 or RAID4 array.
- RAID6: RAID6 is essentially the same concept as RAID5 except there are 2 disks worth of parity data. This means you have the redundancy to lose 2 disks at that same time without losing the array but at the cost of 2 disks worth of capacity. You need at least 4 disks for a RAID6 array however you will not see a benefit unless you have at least 5 disks.
- RAID10 (1+0): This is the best of the RAID levels. It is very fast for reading and writing, and has the most redundancy. The idea for RAID10 is that you make a RAID0 array out of a set of RAID1 arrays. The RAID0 portion may not be redundant however each component of it is. Here is a diagram of a RAID10 array:
RAID0 |
| Disk | Disk |
RAID1 (1) | Disk1 | Disk2 |
RAID1 (2) | Disk3 | Disk4 |
RAID1 (3) | Disk5 | Disk6 |
Here is a representation of the same array with a bad disk:
RAID0 |
| Disk | Disk |
RAID1 (1) | Disk1 | Disk2 |
RAID1 (2) | Disk3 | Disk4 |
RAID1 (3) | Disk5 | Disk6 |
As you can see when disk #4 dies it leaves RAID1 #2 degraded however the other 2 RAID1 arrays are still redundant. That means that you could lose up to half of the disks in the RAID10 as long as it is not 2 that are a RAID1 pair.
- RAID0+1: This is the same concept as RAID10 except that it is a RAID1 of 2 RAID0 arrays:
RAID1 |
RAID0 (1) | RAID0 (2) |
Disk1 | Disk2 |
Disk3 | Disk4 |
Disk5 | Disk6 |
The disadvantage of RAID0+1 can be easily seen when I kill disk #4 again:
RAID1 |
RAID0 (1) | RAID0 (2) |
Disk1 | Disk2 |
Disk3 | Disk4 |
Disk5 | Disk6 |
As you can see when 1 drive dies it takes its RAID0 with it leaving only a non-redundant RAID0. The other disks that made up the failed RAID0 simply sit idle until the entire array is repaired. When the bad disk is repaired the RAID rebuild will also take longer since the second RAID0 is considered to be new and the entire thing must be synchronized. RAID0+1 should only be used on systems that do not support the much better RAID10.
- Hot Spares: Hot spares are simply extra disks attached to a redundant RAID array but not actively participating in it. When one of the drives in an array fails the array is rebuilt using the hot spare. The purpose of a hot spare is to minimize the amount of time where an array is running in degraded mode.
- Setting up a RAID array
Note: These commands are all demo commands that create various RAID volumes on 6 partitions on 6 SCSI disks. You would normally be using different disks for this.
- RAID5 with 64k stripes and no spare:
mdadm --create /dev/md0 --chunk=64 --level=5 --raid-devices=6 /dev/sd[abcdef]1
RAID 0+1:
mdadm --create /dev/md0 --chunk=64 --level=0 --raid-devices=3 /dev/sd[abc]1
mdadm --create /dev/md1 --chunk=64 --level=0 --raid-devices=3 /dev/sd[def]1
mdadm --create /dev/md2 --level=1 --raid-devices=2 /dev/md[01]
RAID 1+0 (RAID10)
mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sd[ab]1
mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sd[cd]1
mdadm --create /dev/md2 --level=1 --raid-devices=2 /dev/sd[ef]1
mdadm --create /dev/md3 --chunk=64 --level=0 --raid-devices=3 /dev/md[012]
RAID1 (only 2 disks)
mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sd[ab]1
RAID1 with a spare (only 3 disks)
mdadm --create /dev/md0 --level=1 --raid-devices=2 --spare-devices=1 /dev/sd[abc]1
A degraded RAID1 with only 1 disk
mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda1 missing
Filesystem considerations
Different filesystems have options when you create them that can optimize them for a RAID array instead of a regular disk.
- ext3: With ext3 the stride option can greatly increase write performance on a RAID array by ensuring that the filesystem always writes data in even chunks equal to a stripe on the array. The syntax is mke2fs -E stride=n where n=stripesize/blocksize so if you are using 64k RAID stripes and 4KB filesystem blocks you would want a stride of 16.
- xfs: With xfs there is an option to control the number of allocation groups. This helps the filesystem write data in a more parallel way. The syntax is mkfs.xfs -d agcount=n. This option may or may not help your performance depending on how you are using the filesystem so you should do some testing before deciding what options to use.
- reiserfs and jfs: These two filesystems don't have any tuning options that make a difference for RAID.
Monitoring an array
A redundant RAID array must be monitored for degraded conditions. Since it is possible for a disk to fail without any interruption to the system you may have a degraded array and not even know it until a second disk dies destroying your data. Monitoring the status of your array is as important as backing it up. There are several ways to do this:
Mounting an array from Knoppix
Knoppix (or whatever you use for rescue boots) may or may not detect your array at bootup. This is especially true of compound RAID levels (10 or 0+1) since the top level RAID volume is not on a partition. Once you have booted you can check /proc/mdstat to see if the array was detected. If it wasn't you can use mdadm --assemble with the same syntax as mdadm --create to activate the existing array. Mdadm --assemble also supports some options for scanning the disks in the system to find existing arrays if you don't remember what was there. Run mdadm --assemble --help for more info.
Rebuilding a failed array
When a disk does die you must replace it. After you replace the disk you have to add it to the array so that the array can be rebuilt (happens live) and recover from degraded mode. Here are the steps:
- Replace the disk with a working one.
- Partition the disk to match the partitions of the one it replaced. Here is a trick to copy the partition table from one disk to another:
sfdisk -d /dev/olddisk | sfdisk /dev/newdisk
- Check dmesg to make sure the kernel saw the new partition table. If not you may have to reboot.
- Use mdadm --manage to add the partitions on the new disk to the arrays:
mdadm --manage /dev/md?? --add /dev/newpartition
- Repeat step 4 for each array on the disk.
- Check /proc/mdstat to make sure the array(s) are rebuilding.
- Check later to make sure the array(s) finished rebuilding and are no longer degraded.
Migrating an existing filesystem to a RAID1 array
One thing that happens many times is that someone sets up a system with just a single hard drive and then later decides to make the system redundant by adding a second drive and setting up RAID1 mirroring. This used to be a rather complex procedure that required several reboots however now with live CD distros such as Knoppix you can make it much easier. Here are the steps to migrate a system with a single /dev/sda to a RAID1 using a brand new /dev/sdb:
- Shut down the system, add the new drive, and boot from Knoppix
- Duplicate the partition table on the old disk to the new disk using the sfdisk trick I showed earlier.
- Create degraded RAID1 arrays for each non-swap partition like this:
mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdb1 missing
- Create filesystems on the new raid arrays
- Mount both the old /dev/sda? partitions and the new /dev/md?? arrays
- Copy all data from the old partitions to the new arrays (The arrays will be a tiny amount smaller due to the md superblock)
- Unmount the old partitions
- Use mdadm --manage to add the old partitions to the array and let them sync up
- While it is syncing the disks modify the /etc/fstab and grub config files to show the new md device names
- If using lilo reinstall it
- Once the resync is finished reboot the system from the disks and it should come up like normal with the array mounted.
Examples of /proc/mdstat entries
md0 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
5019520 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
The same RAID5 with a failed disk:
md0 : active raid5 sdf1[5] sde1[4](F) sdd1[3] sdc1[2] sdb1[1] sda1[0]
5019520 blocks level 5, 64k chunk, algorithm 2 [6/5] [UUUU_U]
The same RAID5 during recovery:
md0 : active raid5 sdf1[5] sde1[4](F) sdd1[3] sdc1[2] sdb1[1] sda1[0]
5019520 blocks level 5, 64k chunk, algorithm 2 [6/5] [UUUU_U]
[>....................] recovery = 0.2% (2356/1003904) finish=6.9min speed=2356K/sec
Good RAID1 with a good spare:
md0 : active raid1 sdc1[2] sdb1[1] sda1[0]
1003904 blocks [2/2] [UU]
Good RAID10 with 6 disks:
md3 : active raid0 md2[2] md1[1] md0[0]
3011520 blocks 64k chunks
md2 : active raid1 sdf1[1] sde1[0]
1003904 blocks [2/2] [UU]
md1 : active raid1 sdd1[1] sdc1[0]
1003904 blocks [2/2] [UU]
md0 : active raid1 sdb1[1] sda1[0]
1003904 blocks [2/2] [UU]
The same RAID10 with a failed disk:
md3 : active raid0 md2[2] md1[1] md0[0]
3011520 blocks 64k chunks
md2 : active raid1 sdf1[1] sde1[0]
1003904 blocks [2/2] [UU]
md1 : active raid1 sdd1[1] sdc1[0](F)
1003904 blocks [2/1] [_U]
md0 : active raid1 sdb1[1] sda1[0]
1003904 blocks [2/2] [UU]
Good RAID0+1 with 6 disks:
md2 : active raid1 md1[1] md0[0]
3011648 blocks [2/2] [UU]
md1 : active raid0 sdf1[2] sde1[1] sdd1[0]
3011712 blocks 64k chunks
md0 : active raid0 sdc1[2] sdb1[1] sda1[0]
3019776 blocks 64k chunks