Linux Software RAID

Written by Kevin Korb as a presentation for GOLUG Presented on 2006-02-02

This document is available at http://www.sanitarium.net/golug/Linux_Software_RAID.html

What is RAID?

mdadm (http://cgi.cse.unsw.edu.au/~neilb/mdadm

raidtools (http://people.redhat.com/mingo/raidtools/)

What is RAID good for?

Redundancy: All of the RAID levels other than RAID0 provide redundancy. Since hard drives have fragile moving parts and electrical motors+bearings that must spin at a constant speed they are a common failure point in computers and can often lead to data loss. RAID provides redundancy so that the loss of a single drive doesn't destroy any data.
Performance: RAID arrays tend to perform faster than a single disk drive (this depends on the level of RAID and and is more true for reading than writing). Since hard drives contain moving parts there is a limit to how fast they can communicate with the computer. If you have multiple drives working in parallel you generally increase the overall performance.
Capacity: With RAID you can bundle multiple drives together to allow a single filesystem to be bigger than any single drive could hold. If you need a filesystem that is bigger than 500GB RAID is the way to go. Of course LVM can also accomplish this but LVM is usually implemented on top of RAID since doing plain LVM gives you no redundancy or performance benefit.

What can't RAID do?

RAID is NOT a backup system. It does not protect you against accidental file deletion or corruption. If you accidentally delete a file from a RAID array it is just as gone as it would be from a single disk. Do NOT implement a RAID solution assuming you will no longer need to backup your data.
RAID is NOT a guarantee of uptime. This is especially true of software based RAID. Usually if a disk suddenly dies in a SCSI system the entire bus will hang sometimes for minutes before the system gives up trying to access the disk and other times the SCSI controller will get stuck in an endless cycle of SCSI resets. I have seen the load average on a server go beyond 500 before the kernel finally kicked the disk from the array. On IDE systems when a disk suddenly dies it can cause the entire system to lock up. Even though RAID isn't giving you the guaranteed uptime it still protects the data though. Once the dead disk is pulled and the system is rebooted the data will still be there on the degraded array.

What is the difference between hardware and software RAID?

What are the advantages of software RAID over hardware RAID?

Cost: Software RAID is cheaper than hardware RAID because no extra controller is needed. RAID controllers tend to be somewhat expensive compared to regular "dumb" or "JBOD" controllers. You can probably use the controllers built into your motherboard unless you need additional ports. Note that for IDE you should only put one drive on each bus since the master/slave setup will completely destroy the performance of an array.
Performance: If you are on a modern system with a good bus and a good CPU it may actually be faster to run software RAID than hardware RAID (unless you buy a really expensive hardware RAID). Even if your CPU isn't the latest it is still likely to perform as well as all but the most expensive hardware RAID controllers. Note that most of the motherboards that have integrated RAID controllers have chipsets that are so bad they are actually slower than software RAID (and they tend to only support RAID0 or RAID1). Most of the integrated RAID chipsets (like HighPoint) actually are software RAID in disguise.
Disk Flexibility: Since software RAID is RAIDing the partitions instead of the disks it is easier to compensate for disks that are slightly different sizes. I generally put a swap partition on each RAID disk that is either 2% of the disk or 2GB whichever is bigger. That way if I have to replace a failed disk with slightly smaller one I can compensate by reducing the size of the swap partition. The alternative is to wipe out and recreate the entire array to match the smallest drive size.
Partition Flexibility: Since software RAID is RAIDing the partitions instead of the disks it is possible to have different RAID levels existing on the same disks. For instance you could have 4 disks with the first 2 containing a RAID1 of /boot, the second 2 containing a RAID1 of /, and then all 4 containing a RAID5 of /home. Some hardware RAID controllers can also do this but only the most expensive of them and they usually do it in really weird and confusing ways.
Disk Monitoring and Diagnostics: Since most hardware RAID controllers only export the virtual RAID volumes it isn't easy to debug disk problems. 3Ware controllers allow you to probe the individual disks for SMART diagnostics but none of the other controllers (AFAIK) do. It is virtually impossible to benchmark or test an individual disk without removing it from the system and testing it on a non-RAID computer. This of course must be done with the array shut down the entire time or you will have to rebuild the array after each drive.

What are the advantages of hardware RAID over software RAID?

Performance: If you have the money for a top of the line RAID controller the performance will be significantly better than software RAID. There will also be less CPU overhead. If you are planning on a huge database with lots of activity then you probably need to invest in a hardware RAID array. This is especially true with RAID5 due to its constant calculation of parity.
OS Independence: Since the RAID controller is exporting a single virtual disk to the OS it doesn't really matter what OS you are running. If your OS has drivers to communicate with the RAID controller itself then you can use the RAID volume. With software RAID each OS has a completely different and incompatible implementation so there is no way the same software array could be used from multiple OSs.
RAID Level Flexibility: Since the RAID controller is exporting a single virtual disk to the OS you are able to boot from any RAID level instead of just RAID1 as with software RAID. If you want to boot from a RAID5 this is the way.
Error Checking: When the good high end RAID controllers (like 3Ware) are running RAID5 they are constantly checking the parity of the data since that is all the CPU on the controller has to do. This sometimes allows them to find and repair bad sectors on a disk before they become problems. Software RAID OTOH just relies on the disks themselves to detect and repair bad sectors. If the disk is unable to repair a bad sector it will return an IO error and the kernel will kick the drive from the array leaving you with a degraded array.

What are the different levels of RAID and what are they for?

RAID0: This is simple striping of data across multiple disks. There is NO redundancy at this level. If one disk dies the entire array is lost. RAID0 is only useful for pure speed when redundancy is not needed or when combined with RAID1 to make a RAID10 or RAID 0+1 array. RAID0 gives you the capacity of all of your disks since there is no space reserved for redundancy.
RAID1: This is simple mirroring. 2 or more disks that are exact copies of each other. If one disk is lost the other continues in degraded mode until a replacement is added. RAID1 does (usually) offer a small speed increase for reading but not for writing. RAID1 is the only RAID level that you can boot from because it is the only software RAID level that leaves a plain disk that the BIOS and boot loader can read. RAID1 gives you the capacity of one of your disks no matter how many disks are in the array.
RAID5: RAID5 is a very popular RAID because it offers redundancy but uses parity to reduce the cost in terms of redundant disks. In RAID5 instead of mirroring data on dedicated mirror drives like RAID1 you have parity data that can be used to recreate the data that is stored on any one drive in the array. That means that if you have a 6 drive RAID5 array you have the capacity of 5 of the drives and if any one drive dies the array can continue in degraded mode until the dead drive is replaced. The array then reconstructs the missing data based on the remaining data and the parity data to complete the array. The parity data is striped across all of the disks so that there isn't a single dedicated parity drive doing all of the work like there is in RAID3 or RAID4. Here is a representation of a RAID5 array:


	Disk1	Disk2	Disk3	Disk4	Disk5	Disk6
Stripe01	Parity	Data1	Data2	Data3	Data4	Data5
Stripe02	Data1	Parity	Data2	Data3	Data4	Data5
Stripe03	Data1	Data2	Parity	Data3	Data4	Data5
Stripe04	Data1	Data2	Data3	Parity	Data4	Data5
Stripe05	Data1	Data2	Data3	Data4	Parity	Data5
Stripe06	Data1	Data2	Data3	Data4	Data5	Parity
Stripe07	Parity	Data1	Data2	Data3	Data4	Data5
Stripe08	Data1	Parity	Data2	Data3	Data4	Data5
Stripe09	Data1	Data2	Parity	Data3	Data4	Data5
Stripe10	Data1	Data2	Data3	Parity	Data4	Data5
Stripe11	Data1	Data2	Data3	Data4	Parity	Data5
Stripe12	Data1	Data2	Data3	Data4	Data5	Parity

RAID3/4: These were the first of the RAIDs that used parity data. Neither is used much anymore since RAID5 is a much better implementation of the same concept. In RAID3 and RAID4 you have parity data like in RAID5 but the parity data is stored all on one disk. That disk tends to have more activity than any of the other disks which tends to cause it to fail more often. Like RAID5, RAID3 and RAID4 give you the capacity of the number of disks you have minus one. Also like RAID5 you need at least 3 disks for a RAID3 or RAID4 array.
RAID6: RAID6 is essentially the same concept as RAID5 except there are 2 disks worth of parity data. This means you have the redundancy to lose 2 disks at that same time without losing the array but at the cost of 2 disks worth of capacity. You need at least 4 disks for a RAID6 array however you will not see a benefit unless you have at least 5 disks.
RAID10 (1+0): This is the best of the RAID levels. It is very fast for reading and writing, and has the most redundancy. The idea for RAID10 is that you make a RAID0 array out of a set of RAID1 arrays. The RAID0 portion may not be redundant however each component of it is. Here is a diagram of a RAID10 array:


RAID0
	Disk	Disk
RAID1 (1)	Disk1	Disk2
RAID1 (2)	Disk3	Disk4
RAID1 (3)	Disk5	Disk6


RAID0
	Disk	Disk
RAID1 (1)	Disk1	Disk2
RAID1 (2)	Disk3	Disk4
RAID1 (3)	Disk5	Disk6

RAID0+1: This is the same concept as RAID10 except that it is a RAID1 of 2 RAID0 arrays:


RAID1
RAID0 (1)	RAID0 (2)
Disk1	Disk2
Disk3	Disk4
Disk5	Disk6


RAID1
RAID0 (1)	RAID0 (2)
Disk1	Disk2
Disk3	Disk4
Disk5	Disk6


OK	degraded	idle	failed

Hot Spares: Hot spares are simply extra disks attached to a redundant RAID array but not actively participating in it. When one of the drives in an array fails the array is rebuilt using the hot spare. The purpose of a hot spare is to minimize the amount of time where an array is running in degraded mode.

Setting up a RAID array

Note

RAID5 with 64k stripes and no spare:

mdadm --create /dev/md0 --chunk=64 --level=5 --raid-devices=6 /dev/sd[abcdef]1

RAID 0+1:

mdadm --create /dev/md0 --chunk=64 --level=0 --raid-devices=3 /dev/sd[abc]1
mdadm --create /dev/md1 --chunk=64 --level=0 --raid-devices=3 /dev/sd[def]1
mdadm --create /dev/md2 --level=1 --raid-devices=2 /dev/md[01]

RAID 1+0 (RAID10)

mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sd[ab]1
mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sd[cd]1
mdadm --create /dev/md2 --level=1 --raid-devices=2 /dev/sd[ef]1
mdadm --create /dev/md3 --chunk=64 --level=0 --raid-devices=3 /dev/md[012]

RAID1 (only 2 disks)

mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sd[ab]1

RAID1 with a spare (only 3 disks)

mdadm --create /dev/md0 --level=1 --raid-devices=2 --spare-devices=1 /dev/sd[abc]1

A degraded RAID1 with only 1 disk

mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda1 missing

Filesystem considerations

ext3: With ext3 the stride option can greatly increase write performance on a RAID array by ensuring that the filesystem always writes data in even chunks equal to a stripe on the array. The syntax is mke2fs -E stride=n where n=stripesize/blocksize so if you are using 64k RAID stripes and 4KB filesystem blocks you would want a stride of 16.
xfs: With xfs there is an option to control the number of allocation groups. This helps the filesystem write data in a more parallel way. The syntax is mkfs.xfs -d agcount=n. This option may or may not help your performance depending on how you are using the filesystem so you should do some testing before deciding what options to use.
reiserfs and jfs: These two filesystems don't have any tuning options that make a difference for RAID.

Monitoring an array

Run mdadm --monitor --scan --mail=email_address as a service to email you if one of the arrays has a problem.
Write a program that parses the content of /proc/mdstat. This is especially helpful if you are already running a monitoring system such as Nagios. There is already a /proc/mdstat parser available in the Nagios plugins contrib dir.
Run this in an hourly cron job:
```
lsraid -A -p | grep -v ^$ | grep -v online | grep -v good
```
That will send a standard cron email of any non-good output that isn't filtered by the grep commands. Note that lsraid is part of the raidtools package. Note that if you use fcron you will have to use the ;true hack because the grep -v commands will terminate with an exit code of 1 if there is no output and you will get hourly cron emails of "no output".

Mounting an array from Knoppix

Rebuilding a failed array

Replace the disk with a working one.
Partition the disk to match the partitions of the one it replaced. Here is a trick to copy the partition table from one disk to another:
```
sfdisk -d /dev/olddisk | sfdisk /dev/newdisk
```
Check dmesg to make sure the kernel saw the new partition table. If not you may have to reboot.
Use mdadm --manage to add the partitions on the new disk to the arrays:
```
mdadm --manage /dev/md?? --add /dev/newpartition
```
Repeat step 4 for each array on the disk.
Check /proc/mdstat to make sure the array(s) are rebuilding.
Check later to make sure the array(s) finished rebuilding and are no longer degraded.

Migrating an existing filesystem to a RAID1 array

Shut down the system, add the new drive, and boot from Knoppix
Duplicate the partition table on the old disk to the new disk using the sfdisk trick I showed earlier.
Create degraded RAID1 arrays for each non-swap partition like this:

mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdb1 missing

Create filesystems on the new raid arrays
Mount both the old /dev/sda? partitions and the new /dev/md?? arrays
Copy all data from the old partitions to the new arrays (The arrays will be a tiny amount smaller due to the md superblock)
Unmount the old partitions
Use mdadm --manage to add the old partitions to the array and let them sync up
While it is syncing the disks modify the /etc/fstab and grub config files to show the new md device names
If using lilo reinstall it
Once the resync is finished reboot the system from the disks and it should come up like normal with the array mounted.

Examples of /proc/mdstat entries

Good RAID5 with 6 disks:

md0 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
      5019520 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]

The same RAID5 with a failed disk:

md0 : active raid5 sdf1[5] sde1[4](F) sdd1[3] sdc1[2] sdb1[1] sda1[0]
      5019520 blocks level 5, 64k chunk, algorithm 2 [6/5] [UUUU_U]

The same RAID5 during recovery:

md0 : active raid5 sdf1[5] sde1[4](F) sdd1[3] sdc1[2] sdb1[1] sda1[0]
      5019520 blocks level 5, 64k chunk, algorithm 2 [6/5] [UUUU_U]
      [>....................]  recovery =  0.2% (2356/1003904) finish=6.9min speed=2356K/sec

Good RAID1 with a good spare:

md0 : active raid1 sdc1[2] sdb1[1] sda1[0]
      1003904 blocks [2/2] [UU]

Good RAID10 with 6 disks:

md3 : active raid0 md2[2] md1[1] md0[0]
      3011520 blocks 64k chunks
md2 : active raid1 sdf1[1] sde1[0]
      1003904 blocks [2/2] [UU]
md1 : active raid1 sdd1[1] sdc1[0]
      1003904 blocks [2/2] [UU]
md0 : active raid1 sdb1[1] sda1[0]
      1003904 blocks [2/2] [UU]

The same RAID10 with a failed disk:

md3 : active raid0 md2[2] md1[1] md0[0]
      3011520 blocks 64k chunks
md2 : active raid1 sdf1[1] sde1[0]
      1003904 blocks [2/2] [UU]
md1 : active raid1 sdd1[1] sdc1[0](F)
      1003904 blocks [2/1] [_U]
md0 : active raid1 sdb1[1] sda1[0]
      1003904 blocks [2/2] [UU]

Good RAID0+1 with 6 disks:

md2 : active raid1 md1[1] md0[0]
      3011648 blocks [2/2] [UU]
md1 : active raid0 sdf1[2] sde1[1] sdd1[0]
      3011712 blocks 64k chunks
md0 : active raid0 sdc1[2] sdb1[1] sda1[0]
      3019776 blocks 64k chunks