RAID

RAID

Excellent websites:

http://www.pcguide.com/ref/hdd/perf/raid/levels/single.htm

http://www.acnc.com/04_01_00.html

http://www.staff.uni-mainz.de/neuffer/scsi/what_is_raid.html

The Origin of RAID

Files produced by some applications may be extremely large - gigabytes or terabytes in size. Disks large enough to hold such files can be extremely expensive or even unavailable, and so it would be convenient to break up files or file systems into pieces that could be placed on inexpensive, modestly sized disks. This solution generates a problem of its own. Disks have moving parts and are therefore prone to failure. If a single file or file system is spread over 10 disks, the probability of the failure of one disk, and therefore the loss of data, is multiplied 10 times. As a result, the use of multiple disks to support a single file system requires some kind of redundancy. Redundancy means that all data is written in a way that allows reconstruction of the original data in the event of the failure of one or more disks.

This need for large amounts of moderately sized, redundant storage was the origin of the concept of RAID, which originally stood for Redundant Array(s) of Inexpensive Disks. The term RAID is now commonly interpreted as Redundant Array(s) of Independent Disks. In a RAID configuration, individual disks or disk partitions are used as components in constructing a volume. Volumes may, in turn, be used as components in other RAID configurations. The volume is presented to the Solaris operating system as a single large disk and may be used for any file system or by an application.

The original definition of RAID included five configurations, RAID 1 through 5, described below. All included redundancy, the "R" in "RAID." Another definition has been added to the original five, RAID 0, which is not true RAID because it is not redundant. Like true RAID, it is used to place large file systems or large files on multiple disks, but because of the risk of disk failure, it must be used as a component of a true RAID configuration.

Performance Issues

Files are always written to a single disk sequentially, one part of the file following another. Files may be written to multiple disks in parallel, one sequential chunk of the file to each disk in the volume at the same time. This process is called striping. Writes to a striped volume will probably be faster than writes to a single disk. RAID therefore has the potential to improve disk write performance, but only if the disks are attached to different controllers. If all data has to flow through the same controller, disks will not be written in parallel. Throughout this discussion, we will assume that disks are on different controllers, so that there is no wait for controller access when writing data. If disks are on the same controller, there may be some, or a lot, of latency between writes, and performance will suffer.

The RAID configuration implies some kind of redundancy, which means that additional disk space must be used to write the redundant data, and also that additional writes, and perhaps calculations must be made. This will slow writes.

Using a RAID configuration always implies trade-offs between the security provided by redundancy, the performance implied by multiple simultaneous writes, and cost. The more calculations must be performed, the slower the writes, but the less additional disk space must be used to write the redundant information. The more disks are written in parallel, the faster the performance, but the greater the risk of disk failure. Designing a RAID always requires careful planning of resource use because of this kind of trade-off. The rule of thumb is "Low Cost, Good Performance, File Security - Choose two." It is possible to design a very fast, very secure, easily recovered RAID configuration, but it will not be cheap. A cheap configuration may have either good file security or it may be fast, but not both. Much thought and effort therefore goes into getting the best performance and best possible security out of inexpensive RAID configurations.

RAID Implementations

RAID configurations may be implemented in two ways: through a RAID controller on a hardware RAID array, or as software RAID on a JBOD (Just a Bunch Of Disks), which is a simple array with no hardware RAID controller. JBODs plus a software RAID application are much less expensive than hardware RAID, but software RAID requires all data to be handled by the software RAID program in addition to any other applications, and is always slower than hardware RAID. The cost is low, and file security is good, but performance suffers.

There are two software RAID programs commonly used on Solaris systems: the market-dominating Veritas Volume Manager, and Sun Microsystem's Solaris Volume Manager (SVM), which is called Solstice DiskSuite (SDS) in Solaris 8 and earlier. SVM/SDS is free with the Solaris Operating System.

Redundancy

Redundancy is handled in two ways, with mirroring, or with parity calculations. Each type of redundancy requires additional disk space to record the redundant information. In mirroring, the identical data is simultaneously written to two or more disks, so two complete copies of the data always exist. This form of redundancy is fast and secure but expensive, as it requires twice as much disk space as data.

Parity is used to provide redundancy to striped volumes. When a volume is striped, the same sized chunk of data is written to each disk at the same time. A calculation is then performed on all bits written to the same place on each disk in the RAID. The resulting single value is written to a parity disk. The need to compute the parity slows the write operation, but only one disk is used to write the redundant information, so parity is less expensive than mirroring.

Calculating the Parity

Parity is calculated using a mathematical operation called "XOR." The XOR operation compares two values and calculates a result as follows:

0 XOR 0 1

1 XOR 1 1

1 XOR 0 0

0 XOR 1 0

Each time a write occurs on a striped disk, all bits written in the same place on the disks are XOR'ed with each other, and the result is written to a parity disk. So if five disks have the following write in the same place:

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5

1 0 1 0 1

The parity will be calculated as:

1 (Disk 1) XOR 0 (Disk 2) = 0 XOR 1 (Disk 3) = 0 XOR 0 (Disk 4) = 1 (XOR) 1(Disk 5) = 1

If any one bit is lost, the remaining bits and the parity can be used to recover the value of the lost bit. If a disk is lost the remaining disks and the parity disk can therefore be used to reconstruct the lost disk.

RAID Configuration Definitions

RAID 0 - In RAID 0 configurations a large virtual disk, or volume, is formed from multiple physical disks. The volume management software "sees" the physical disks, and presents them to the operating system as one disk. The RAID 0 volume may be written in two ways: stripe or concatenation.

RAID 0 is fundamentally unreliable; because there is no built-in redundancy, and because the risk of data loss is multiplied by the number of disks. If one physical disk fails, all the data in the file system may be lost. RAID 0 is cheap and fast, because no disk space is used for redundancy and because no parity calculations must be made, but it is very risky since there is no redundancy. As a result RAID 0 volumes are commonly used as components in other RAID volumes.

RAID 0 – Striping - Files are striped when a block of data in the file is broken up into N pieces and written in parallel to N disks. This is very fast and very inexpensive, but there is no redundancy and so RAID 0 striping is not a true RAID configuration. If a striped RAID 0 volume must be made larger (grown), the file system must be unmounted and reconfiguration performed. RAID 0 striped volumes cannot therefore be "grown on the fly."

Characteristics: Write speed increases, read speed increases. The entire disk is used. Striped volumes may not be grown on the fly.

RAID 0 - Concatenation A concatenated disk is written sequentially, one physical disk partition at a time until the disk is full. This is no faster than writing to a single disk, and it may be a long time before the last disk actually contains any data. Concatenated disks may be grown "on the fly," that is, additional partitions may be added to the definition of the virtual disk while the disk is mounted and in use.

Characteristics: Reads may be faster since multiple disks can hold data and may be read in parallel. Write performance is the same as for a simple disk. Up until the disks become full, the last disks to be written are unused. Concatenated volumes may be grown on the fly.

RAID 1 – Mirroring - In a RAID 1 volume, data is simultaneously written to two (or more) mirrors, where a mirror can be a simple disk, a striped volume or a concatenated volume. It provides immediate and complete protection against the failure of a single device because all data is written to multiple disks nearly simultaneously. If a disk fails, the other disk can be used for reads and writes while the mirror is reconstructed, so all files are available at nearly full performance even in the case of failure. RAID 1 is the most expensive form of RAID, since two or more complete copies of data are required. RAID 1 is the only true raid that is not striped, although striped RAID 0 volumes may be used to construct the RAID 1 volume. Characteristics: Writes are slower (although not half as fast). Typically mirroring degrades write performance by 15 percent, although a three-way mirror can degrade write performance by up to 44 percent. Reads may be faster, since both disks are available for reads, but only for multi-threaded read requests or when multiple users are reading from the disks.

RAID 2 – Data is striped and parity calculations are performed using ECC (Error Correction Code). Rarely used.

RAID 3 – Data is striped by chopping up a data block into chunks of a specified number of bytes each. A parity value is calculated for each stripe and written to a dedicated parity disk. Every time data is written, the stripe chunks must be allocated, and the parity disk must be accessed, so access to the parity disk is a bottleneck in this kind of RAID, and it is therefore quite slow. Since only one disk need be used as a parity disk, it is fairly inexpensive, however. Requires hardware RAID controller. Rarely used.

RAID 4 – in RAID 4, data is written to disks one data block at a time, then a parity value is calculated across multiple disks and written to a single dedicated parity disk. The parity disk is a bottleneck, as in RAID 3, but data blocks do not need to be broken up for byte-level striping, so RAID 4 is faster than RAID 3. Requires hardware RAID controller. Rarely used.

RAID 5

RAID 5 is the most reliability for the least money, but it is not very fast as it requires the calculation of a parity value with every write. Because writes have become so extremely large, RAID 5 is nonetheless the most popular type of RAID.

RAID 5 is a sort of rotating stripe, in which data is striped across all but one disk; parity values for the data just written is then written to the remaining disk. This form of parity is called distributed parity, since the parity value is written to all disks in turn. The next write also places data on all but one disk: but this time the parity is written to a different disk. For a four disk RAID 5 configuration, the first write is striped on disks 1, 2 and 3, and the parity is written to disk 4. The next write is striped on disks 4,1 and 2, and the parity is written to disk 3. The following write is striped on disks 3, 4 and 1, and the parity is written to disk 2. Each subsequent write begins on the same disk to which parity was written. Distributing the parity writes makes the RAID no safer, but increases its speed– an important factor in an otherwise fairly slow storage configuration. In RAID 5, every time there is a write to a disk, there must also be a parity write to a disk. If all parity writes were made to the same disk, that disk will be much busier than the others and the parity writes would become a bottleneck. RAID 3 and RAID 4 are seldom used simply because they employ only one disk as the parity disk and therefore have this bottleneck that RAID 5 avoids. Write performance is often still quite slow. If more than 20% of disk accesses are writes, RAID 5 is not a good choice.

Reads are as fast as they would be on any striped volume, since there are multiple disks which may be simultaneously read if they are on separate controllers and because parity is not used in reads.

Characteristics: RAID 5 requires at least three disks. Can be done using software only, but performs very poorly in that case. Practically speaking, it requires a hardware RAID controller, since hardware RAID does the parity calculations internally.

RAID 5 I/O

Full stripe I/O - Parity is computed from the data, and then data and parity both are written in one full stripe.

Partial stripe I/O - Partial stripe writes are more complicated than full stripe writes. To improve performance, a partial stripe write is handled in two ways:

1: Read-Modify-Write: This type of I/O handling is used when the size of the update is less than half the stripe length. The buffers in stripe units that are going to be overwritten are read, then the parity is computed by XOR'ing the old data, the old parity and the new data. Read-modify-write operations slow the write process.

This is the reason that the parity is distributed in RAID-5, instead of being on a dedicated parity disk. When full-stripe-writes occur, it makes no difference if parity is written on one disk or distributed. However, a dedicated parity disk becomes a bottleneck when multiple simultaneous read-modify-write operations occur, as each must access the parity disk twice, once to read the parity, and once to write the new parity. Spreading the parity over all disks eliminates the bottleneck.

2: Reconstruction-Writes: This type of I/O handling is used when the size of update is more than half the stripe length. Stripe units that are not going to be written are read in from the disk. The parity is computed by XOR'ing the data read and data to be written. The new parity and data are written in similar way to Read-Modify-Write I/O.

If a disk fails, it is still possible (though not advisable) to use a RAID 5 volume. It is possible to read the data that was on that disk, albeit more slowly since the parity calculation must be done to recover the lost data. If the data was on any other disk than the failed one, there is no change in read performance. If the failed drive would have held the parity, a write is done without parity. If the failed drive held data, then data must be read off the parity drive and the other drives and the parity recalculated before the write can be done (a read-write-modify), as the parity will be wrong after the write is finished, regardless of how much is written, since it must take the failed drive's data into account. Simply writing over the parity without recalculating using the parity information from the failed disk would result in the loss of all data on the failed disk. It is also possible to write to a failed drive simply by reading the data and parity from the surviving drives and recalculating the parity to reflect what the new data would have been.

The failed disk can be recovered by reading the surviving stripes, XOR'ing the data, and writing to the new disk.

I/O Size – In sequential access, data is written or read continuously in sequence to or from the disk. The I/O size of the application generally determines other I/O size parameters for RAID 5. This I/O should match the stripe width and the stripe width should also be some multiple of the disk allocation unit (DAU) for the file system. Thus the I/O put out by the application is packaged into stripes that will write complete chunks to the file system.

Example: A five disk RAID5 - The file system writes 8k blocks, while the application uses 64k requests. Since four disks will hold data, make the stripe width 64k. Thus each stripe will be a complete request, and each disk will have 16k, or two data blocks on it.

In random access, data is read or written off variable parts of the disk. Ideally an entire request will be served off one disk. Since random access requests are usually small, this should not be too difficult.

RAID 6 – Data striped one data block at a time on all disks, parity is also distributed over all disks, and is written twice. RAID 6 will support two disk failures, but takes more disk space, requires a minimum of four disks, and is somewhat slower than RAID 5. Requires a specialized and expensive hardware controller.

RAID 7 - Provided only by Storage Computer Corporation, and very expensive. RAID 7 uses hardware calculation of parity and caching to get around the problem of the parity disk bottleneck in RAID 4. It is fast on all reads and writes.

RAID 0+1: In this type of RAID, a striped volume (0) is constructed, then the entire striped volume is mirrored (1) to another disk. In RAID 0+1 volumes, if a single disk is lost, one half of the mirror is disabled, since the striped volume cannot function without even one disk. Advantages of RAID 0+1: You can break the mirror, take one mirror offline, and back up from that mirror. That isn't possible with RAID 1+0. RAID 0+1 is as secure as any mirrored volume, but is fast because the striping spreads the write/read load across numerous disks. The probability of a catastrophic failure (two disks in different stripes) decreases slightly with increasing numbers of disks.

Disadvantages: After a disk fails and must be replaced, RAID 0+1 volumes sync very slowly. Every subdisk in the stripe must be resync'ed since the whole stripe is mirrored, although Dirty Region Logging (DRL, a feature of Veritas Volume Manager) can speed this process.

For the RAID 0+1 configuration shown in the figures, the probability of losing the entire mirrored device with the loss of two disks is 9/15 or 60%.

Possible permutations of disk loss assuming two mirrored stripes, A and B, each with three disks, 1-3:

Results in total loss of the volume:

1. loss of disk 1 from stripe A and loss of disk 1 from stripe B

2. loss of disk 1 from stripe A and loss of disk 2 from stripe B

3. loss of disk 1 from stripe A and loss of disk 3 from stripe B

4. loss of disk 2 from stripe A and loss of disk 1 from stripe B

5. loss of disk 2 from stripe A and loss of disk 2 from stripe B

6. loss of disk 2 from stripe A and loss of disk 3 from stripe B

7. loss of disk 3 from stripe A and loss of disk 1 from stripe B

8. loss of disk 3 from stripe A and loss of disk 2 from stripe B

9. loss of disk 3 from stripe A and loss of disk 3 from stripe B

Will not result in total loss of the volume:

1. loss of disk 1 and disk 2 from stripe A

2. loss of disk 1 and disk 2 from stripe B

3. loss of disk 2 and disk 3 from stripe A

4. loss of disk 2 and disk 3 from stripe B

5. loss of disk 1 and disk 3 from stripe A

6. loss of disk 1 and disk 3 from stripe B

RAID 1+0

RAID 1+0 is called mirror-stripe RAID or RAID10; Veritas calls it StripePro (a marketing term). In RAID1+0, mirrors are constructed, then the mirrors are striped, the inverse of the RAID 0+1 design. This is very reliable and very fast, and exactly as costly as RAID 0+1 - you will require twice as much disk space as you have data. The advantage of RAID 1+0 is that it re-syncs much more quickly than RAID 0+1 since only the single mirror where the loss occurred must be resync'ed, rather than the entire stripe. When a disk fails, only the disks in the affected mirror must be re-synced. The other mirrors are not affected at all. You cannot break a RAID 1+0 device and back up the offline mirror. This is a disadvantage compared to RAID 0+1, however, this is not a recommended method of backing up in any case, since any data written to the active half of the mirror during the backup will be unprotected. RAID 1+0 offers greater availability than does RAID 0+1. If one disk of a six disk/three mirror RAID 1+0 is lost, the device will continue to function intact, since the loss of half a mirror will not cause the loss of the mirror. The loss of any two disks of a six disk/three mirror RAID 1+0 (see below if you like proof) has a 20% probability of causing the loss of the entire device. As the number of disks increases, the probability of the loss of an entire mirror decreases more rapidly than it does in RAID 0+1.

RAID 1+0 must be constructed using a layered volume format. Such a volume is simple to construct using VMSA or vxassist, but can be quite complex to understand. A layered volume consists of two or more actual volumes which are then treated as subdisks in the construction of another, overlying volume. When a RAID 1+0 volume is created, the individual mirror volumes are constructed and started, and then objects are created in Volume Manager which identify those volumes to VM as being a type of subdisk called a subvolume. VM then treats the subvolumes as subdisks in the construction of another, overlying striped volume. Thus a set of mirrored volumes are used in the construction of a striped volume.

Possible permutations for the loss of two disks for the RAID 1+0 configuration consisting of three striped mirrors, A, B, and C, each with two disks:

Will result in total loss of the mirror:

1. loss of disk 1 from mirror A and disk 2 from mirror A

2. loss of disk 1 from mirror B and disk 2 from mirror B

3. loss of disk 1 from mirror C and disk 2 from mirror C

Will not result in total loss of the mirror:

1. loss of disk 1 from mirror A and disk 2 from mirror B

2. loss of disk 1 from mirror A and disk 2 from mirror C

3. loss of disk 2 from mirror A and disk 1 from mirror B

4. loss of disk 2 from mirror A and disk 1 from mirror C

5. loss of disk 1 from mirror B and disk 2 from mirror C

6. loss of disk 2 from mirror B and disk 1 from mirror C

7. loss of disk 1 from mirror A and disk 1 from mirror B

8. loss of disk 1 from mirror A and disk 1 from mirror C

9. loss of disk 1 from mirror B and disk 1 from mirror C

10. loss of disk 2 from mirror A and disk 2 from mirror B

11. loss of disk 2 from mirror A and disk 2 from mirror C

12. loss of disk 2 from mirror B and disk 2 from mirror C