SAM-FS and QFS File System Tuning Issues

This work benefited from discussions with Lance Evans to whom I owe thanks for his generosity, knowledge and patience.

Introduction

The QFS file system is a high performance file system designed for use in a SAN environment. This paper will discuss performance tuning the QFS file system and QFS file system configurations. These performance issues also apply to a lesser extent to the basic file system SAM-FS. All parts of this discussion apply equally to archived (SAM-QFS) and unarchived (QFS) file systems and I will make no further distinction between them.

If you want to get the best possible performance out of your QFS file system you have to understand three aspects of the file system environment: 1) file system usage, 2) application I/O, and 3) hardware topology. We will consider each of these issues in turn, then discuss tuning the QFS file system.

Performance Tuning Issues

1. File system usage

File system usage above all determines how you should set up a QFS file system, and the more uniform the characteristics of files in a file system, the more successful your performance tuning. There is no performance penalty in QFS for having numerous file systems, so you can segregate files according to their properties into separate file systems. Applications should write to dedicated file systems, and if the application creates multiple file types, those should be sent to separate file systems. User home directories should similarly be placed in their own file system. The better you can segregate your files, the better your performance will be.

The following file characteristics should be considered when determining which files should be placed into which file systems:

Size: We have already looked at DAUs and disk device equipment types and the importance of tuning these to the size of files. Small files should be written in small DAUs so that file system space is not wasted. Larger files can be written in larger DAUs sized to optimize performance. The md devices are best used for file systems of small or mixed size files, like user home directories, because of their dual DAU scheme. The mr devices have a single DAU scheme, so work best with large files of consistent size where the dual DAU scheme would waste system overhead without improving disk usage.

Type of I/O: random or sequential. Do writes to the file system typically occur randomly within the file, or is data written in an ordered sequence at the end of the file? An x-ray image would almost certainly be sequentially written and read, beginning to end. A database of customer orders might be sequentially written but randomly read.

Read and write characteristics: Are files mostly written? Mostly read? Streaming video would be written once and read repeatedly, most likely sequentially. Files that are mostly written will get the biggest performance boost from data disk tuning. Files that are mostly read may benefit from tuning metadata as well as data.

2. Application I/O

You must know whether I/O is sent to disk in a continuous stream (direct I/O) or in defined writes of a specific size (paged I/O) by the application that writes to your file system. In either case, you must know the size of the writes produced. You must also know how the application handles file locking. Most applications use the UNIX system call lockf() to handle file locking but distributed databases handle file locking within the application itself.

3. Hardware Topology

On QFS file systems, the hardware used for data and metadata will be configured very differently, so we will discuss data disks and metadata disks separately.

Data storage: Each QFS file system is made up of disk devices. Those disk devices may be RAID-5, RAID-1 or even RAID-0 volumes. They may also be as simple as disk partitions. QFS makes no distinctions among these and has no awareness of how the device itself is written; the Unix Virtual File System (VFS) handles that. QFS stripes writes by default across all disk devices in a file system, but this striping includes no redundancy. Redundancy in QFS is provided exclusively by the underlying RAID-1 or RAID-5 disk devices. QFS file systems typically use RAID-5 volumes as disk devices. For some solutions using QFS, disk devices may be mirrored (RAID-1) volumes, but this is not common. In reality you can use any kind of LUN as a disk device for a QFS file system with two exceptions. You absolutely must not use a Solaris virtual LUN, and in a traditional array the physical disks that make up the file system cannot have any portion in use by any other file system, or performance will be affected.

For simplicity, in this discussion we will assume RAID-5 disk devices made up entirely of whole disks. We can then speak of the disks that make up the RAID-5 volume and the disk devices that make up the QFS file system.

In the case of a data storage system like the Sun StorEdge 9960 or 9990 (SE9990), many performance tuning issues are eliminated because the array efficiently handles distribution of writes (but not reads) whether those writes are to dedicated disks or “plaid” configurations. The extent to which this is true depends on the array configuration and its usage, but generally as long as you do not “blow through” your cache, that is, write to the cache faster than the SE9990 can distribute those writes to disk, it will perform optimally regardless of which physical disks are included in QFS disk devices. For reads the SE9990 is less efficient than for writes. If your files are mostly read, you will need to pay close attention to the physical disks where your data is located. As the usage of a SE9990 array increases, it is also possible that even write performance may suffer because some physical disks are being too heavily used, so even on a data storage array you cannot ignore traditional RAID-5 issues as you manage and grow your QFS file systems.

Metadata storage: The separation of data and metadata is a major part of why QFS is a high performance file system. Metadata disk devices are never seeking or reading data, so they are always available for retrieval of metadata. The more physical disks dedicated to metadata, and the faster those disks, the faster retrieval will be; the more inodes and directories on each disk, the slower it will be. Metadata should therefore be sent to many, small, fast, dedicated disks for best performance. Exactly how much metadata should be placed on a single metadata disk device depends on file system characteristics. The more frequently files are read, the less metadata should be stored on each disk. Sun recommends that metadata disk devices each hold between 350 to 750 Mbytes of metadata. Ideally the metadata devices are small solid state disks, mirrored for improved availability and performance, and separate from the data storage array containing data disk devices. This ideal is rarely implemented because it is too expensive. It is much more common for metadata to be placed on a small, separate RAID-1 array.

Increasingly metadata disk devices are configured from arrays such as the StorEdge 9990 that include large amounts of cache. Metadata disks are usually read much more than they are written, so the cache will probably not improve metadata performance much. Disks containing metadata must still be immediately accessed any time metadata is read. Unless you pin frequently used metadata in your cache, the physical disks used for metadata storage in such an array should be dedicated.

We will now turn to the means of tuning disk devices to the file system usage, application I/O and hardware topology of your file system. Every file system will be different, and so it is impossible to list hard and fast rules for tuning. Some guidelines to file system tuning and management will be covered, but you must customize your tuning to your own situation.

I will begin by defining the terms used in describing array configuration and behavior. Disk performance terminology is prolific and confusing so I have arbitrarily selected the terms used in this paper; I make no claim of having chosen the most correct.

Review of Standard Striping

Conventionally the term stripe width is used to describe the number of disks to which data is written in a striped volume (such as a RAID-5). Similarly, the total data written in a single write across all the devices in a striped volume is called the stripe size and is equal to the stripe width times the chunk size/interlace/stripe unit size/stripe depth, where all these terms refer to the amount of data written to a single device. We will use the term stripe unit size to refer to the amount of data written to a single device in a striped volume. The stripe unit size and stripe width are configured into the hardware in a hardware RAID, and configured into the software in a software RAID. For best performance, one full stripe worth of data should be written to a striped volume at a time, as this avoids a read-modify-write operation (see end of handout for definition). For example, if a RAID-5 volume (which would be used as a single disk device by SAM-FS/QFS) has a total of 9 disks with 1 of the 9 being the parity disk, the number of data disks or stripe width, is 8. If the RAID stripe unit size is 64k, then the stripe size is 64 kbytes * 8 = 512 kbytes and the optimal write is also 512 kbytes. It is also efficient to write an integral multiple of 512 kbytes such as 1024 kbytes.

Striping in QFS

In SAM-FS/QFS, the use of the term stripe width is completely different from the conventional use. Instead it refers to the multiplier of the DAU required to produce the data write that goes to each disk device in the SAM-FS/QFS file system. We will refer to these writes as chunks to distinguish them from the stripe units written to RAID-5 disks. The size of the chunk written to each device in a SAM-FS/QFS file system is the DAU multiplied by the SAM-FS/QFS "stripe width".

The chunk of data written to the disk device by the kernel module samfs will be distributed across the disks that make up that disk device by the software or hardware RAID controller. The same amount of data is written to each disk device in a file system regardless of the device type, so if you configure a QFS file system from RAID-5 volumes with varying stripe sizes, you will have poor performance on at least some of them.

For SAM-FS file systems and for QFS file systems composed of md and mr data devices, these "stripe widths" are set by default so that each chunk written to a device is about 128 kbytes. If the DAU is larger than 128 kbytes, the stripe width is set to 1 so that one DAU is written to each disk device. The size of the default chunk written to each disk device is the same as the default value of the UNIX maxphys kernel variable, which controls the maximum request size that the kernel will pass through to the device drivers. Thus, for a QFS file system with a DAU of 64 kbytes, the default stripe width will be 2 (because 64x2=128). The stripe width can also be set as an option to the mount command with -o stripe=n for data disk devices and -o mm_stripe=n for metadata disk devices; “n” is replaced by the value of the QFS stripe width. The maximum stripe width is 255.

It is also possible to round robin file writes among the devices in a file system by setting the option to mount -o stripe=0 or -o mm_stripe=0. If the QFS stripe width is set to zero, the chunk written to each disk device is one file, regardless of the file’s size. If a file is larger than a single disk device, the write continues onto the next disk device in the file system.

Be careful because we are discussing two different levels of writes: a) Writes of chunks by the SAM-FS/QFS file system to the disk devices that make up the file system followed by b) writes in which the chunk of data is distributed by the RAID controller over the physical devices that make up the disk device. The chunk size is equal to the SAM-FS/QFS stripe width times the DAU. The stripe unit size is set in the RAID configuration.

The default striping behavior for QFS file systems is listed in the table below:

	Metadata	Data
QFS standalone	Striped	Striped
QFS (striped groups)	Striped	Round-robin
QFS shared file system	Striped	Round-robin

QFS File System Data Writes

SAM-FS/QFS stripes have no parity so no read-modify-write operations are performed. For a file system whose constituent disk devices are disk partitions or simple mirrors, there is therefore no reason to tune writes, other than setting the DAUs so that disk space and overhead are conserved.

If writes to a file system composed of RAID-5 volumes are mostly random, or if files are small, it is also not generally useful to tune writes. The largest improvement in performance in QFS comes when you tune QFS writes to the output of an application that produces paged sequential output, and also to the stripe size of the RAID-5 volumes that constitute QFS file system disk devices. You want each write to each disk device in your QFS file system to match the stripe size of your underlying RAIDs, thereby avoiding read-modify-write operations and fully exploiting parallel writes. You also want the writes to equal the output of the application so that system overhead is not wasted. Some applications allow you to determine the size of writes, and you can usually set your RAID-5 stripe size, within some limits. The RAID-5 stripe size should be set equal to the application write size (the application output may also be an integral multiple of the stripe size). Then the QFS write (usually the DAU of the file system) should be matched to the application output.

In SAM-FS/QFS file systems a write of some particular size can be achieved by a combination of varying the SAM-FS/QFS stripe width, and varying the DAUs. For example, assuming that the RAID-5 disk devices on a QFS file system are configured with a 512 byte stripe size, you could simply set the DAU to 512, and leave the stripe width at the default of 1, so that each write is equal to the stripe size. You could also set the DAU to 64 kbytes and the stripe width to 8 and get the same result - each disk device in the file system will be written in chunks of 512 kbytes, which is then distributed to the disk devices in the underlying RAID-5 volume.

You may need to configure stripe width on file systems composed of md disk devices, where the DAUs are restricted to 64 kbytes. That may be less than your stripe size, so you can increase the size of the chunks you write by increasing the stripe width. You might also configure the stripe width if you have RAID5 stripe sizes less than 64 Kbytes or if you have a few very large files in a file system with mostly small files. In the latter case you could leave your DAU at a small value so as not to waste disk space on writes of small files, but set the stripe size to a value such that large files are written in full stripes.

Write performance may also be fine-tuned by varying the stripe width value. So far we have discussed setting the DAU (or DAU times stripe width) equal to the RAID-5 stripe size. This yields a large increase in performance by avoiding read-modify-write operations. You may also be able to improve performance somewhat by setting the DAU*(stripe width) to an integral multiple of the RAID-5 stripe size instead. There is no formula for determining what the stripe width should be in this case. You will have to try out different values of the stripe width and see how performance responds under conditions simulating those of production. On Sun StorEdge 9990s and similar arrays I have commonly seen the best performance with stripe width values in the range of 8-12 times the RAID-5 stripe, but the only way to tell is by testing. The stripe width is valuable in this kind of tuning even if you have not needed it to set your DAU equal to the RAID-5 stripe size because you do not have to reinitialize your file system every time you want to vary the size of the writes transferred to the disk device.

QFS file systems allow mr devices to be written in DAUs of up to 64 Mbytes, and so the DAU on an mr device should be set equal to the stripe size of the RAID, and the stripe width set at 1. Hypothetically you could multiply the maximum DAU of 64 Mbytes by the maximum QFS stripe width of 255 to produce a maximum chunk of around 16 Gbytes. According to Sun Support, however, 64 Mbytes is the largest write you should attempt to configure, (they have not seen a stripe larger than 12 Mbytes – and that was data fed from a satellite).

Make sure that your files are larger than your DAU. If you set the DAU on a QFS file system to 512 kbytes because that matches your stripe size, but the typical file is about 32 kbytes, you will not improve performance - you'll just waste a lot of disk space. Similarly files that are written randomly (as opposed to sequentially) will not be improved by matching stripe sizes to writes because you will usually not be writing to a complete stripe. Such files should be placed in their own file system and round robined.

Striping improves performance when you have very large writes of large files or with databases. In these cases, having parallel writes to multiple devices provides better performance. Otherwise striping is less efficient than round robin, particularly for shared file systems where you may have many clients accessing different files at once. If each client must tie up all the disk devices to read one file, read performance deteriorates.

QFS File System Metadata Writes

Metadata is mostly random and mostly read. It consists of small directories and one very large .inodes file. Writes are usually small and the DAU is set at 16 Kbytes. Metadata performance is therefore very dependent on the hardware topology and there is not much you can do to tune it by varying stripe widths. Metadata will usually perform best on a RAID-1 configured to allow reads from both submirrors. The faster the underlying disk devices and the smaller and more numerous the disks, the better the read and write performance.

The single writer, multiple reader configuration of QFS requires that metadata be striped with a stripe of 1, and unless you are using small solid state disks, metadata should generally be striped so no one disk gets too full. If you are using solid state disks metadata should be “round robined” by using the –o mm_stripe=0 option to the mount command. The metadata disk actually contains only one major item: the .inodes file, so the result of “round robining” the metadata disk is really concatenation. The first disk is filled, then the next is used. Round robining improves performance for the random I/O that is typical of metadata, since only one disk must be accessed to read or write the inode for one file. All metadata disks can then be read simultaneously to provide metadata access for multiple files. For large disks, however, round-robining is impractical. Since solid state disks are rarely implemented as metadata disk devices, most metadata is left at the default stripe of 1.

QFS Performance and mount options

Solaris controls the size of the maximum number of bytes that can passed to a device driver by the kernel. The maxphys kernel parameter controls the amount of space in RAM that the kernel will fill with data prior to passing that data to a device driver. The kernel must locate this much space in RAM each time it passes data, so the larger the maxphys value, the greater the performance overhead. As a result, the default value of maxphys is set at the smallest reasonable value:128 kbytes (same as the default chunk written by QFS). If you set the chunk size to values larger than 128 kbytes for any of your QFS file systems, you should increase maxphys correspondingly, or you will not get the benefit of increasing the DAU size – only 128 kbytes will be transferred to disk at a time, but keep in mind that there will be a small performance penalty for increasing this value. This value can be adjusted (in this example to 8 Mbytes) by adding a line to the file /etc/system:

set maxphys=0x800000 (or maxphys=8388608 if you prefer decimal)

If you are experimenting with different values of the stripe width, remember that the total chunk must be less than or equal to maxphys.

The parameters sd_max_xfer_size or ssd_max_xfer_size (for fibre disks) determine the maximum size of writes that may occur to scsi and fibre-channel connected disks, respectively. These are set at 1 Mbyte by default but may also be tuned to allow larger transfers to disks, in /kernel/drv/sd.conf or /kernel/drv/ssd.conf. These parameters act as a choke on the amount of data transferred to disks, so that it does not overrun the disk buffer. The default value is simply a “safe” value; unlike maxphys, increasing the size of (s)sd_max_xfer_size improves performance as long as the size of the buffer is not exceeded.

The sd_max_xfer_size in /kernel/drv/sd.conf can be set to 8 Mbytes in /kernel/drv/sd.conf with the line

sd_max_xfer_size=0x800000;

The entry for ssd_max_xfer_size has a parallel format. The semi-colon at the end of the line is required, and in Solaris 9 and 10, this line must be placed at the beginning of the file.

If you are using software RAID, there may be I/O request size limits in the volume manager. Make sure these are equal to or greater than your desired request size. For example, in older versions of VxVM, vxio:vol_maxio is only 256k by default. Some fibre channel HBAs do not allow transfers of greater than 8M. This is not tunable.

QFS also has two mount options that affect file transfer performance: writebehind and readahead. The writebehind parameter determines the amount of data accumulated in a buffer by the file system before it will be passed to storage. The default is 512 kbytes. For applications that produce paged I/O of a size that can be matched to a RAID5 stripe size, this parameter is not useful – the application itself provides paging. But if the application produces paged I/O of an odd size, setting the writebehind value to an integral multiple of the stripe size/DAU allows you to accumulate one full stripe worth of data to send to the array. A RAID-5 with a stripe size of 512 kbytes should have the writebehind set to 512, 1024, etc. kbytes, so that 512 kbytes can be written into each stripe at a time, avoiding read-modify-write operations.

An application that does not cache data internally and which reads data sequentially from the disk may benefit from readahead. Readahead is the amount of data accumulated in the kernel buffer that will be passed as paged I/O to the application. The default is 1024 kbytes, but it can be set to any value that is a multiple of 8 kbytes. Readahead should be set to a value larger than the application request size, so that data is ready to read in when the application requests it. Reasonable values are 2-4 times the RAID-5 stripe size. If the application uses small files, or has small, random reads, readahead can deteriorate performance. The default value is suitable for applications that require large pieces of sequential data. If the application requires small pieces of data, or randomly read data, set readahead to the typical request size of the application. If you are short on RAM and have an application with multiple streams, you should be careful that you do not end up using all your memory on readahead.

If an application produces well-formed direct data streams, the forcedirectio mount option can be used to allow direct I/O. By default direct I/O is disabled unless you have also set the mount option mh_write. If mh_write (discussed in the shared QFS paper) is set, I/O is mostly direct.

Stripe Groups

A stripe group is a set of disk devices grouped together in the mcf file. Every write to a stripe group is always striped across all devices in the group, one DAU to a device. Multiple stripe groups make up a single QFS file system. A write to a single stripe group in a file system is always striped across the stripe group; writes to the file system may then be striped across the stripe groups that make up the file system or round robined between stripe groups. Stripe groups are therefore a way of layering stripes within QFS. They are designed to be used to store extremely large sets of files, such as paired audio and video feeds.

The minimum allocation of disk space for a stripe group is not one DAU (by default, 64 kbytes) as it is with other disk types. Instead, for stripe groups the minimum allocation of disk space is one DAU multiplied by the number of disk devices in the stripe group (Note that this is disk devices in the stripe group, not disks in the RAID). This allocation is written across all devices in the file system, so each disk in a stripe group is written with one DAU, the same as a typical QFS file system. This improves efficiency for file systems that require large writes because only one bit in the bit map is used to represent the write to all the disks in the stripe group, but it can lead to major wastes of disk space if file systems are not set up correctly. A write of one byte to an 8-disk device group with a DAU of 128 bytes will waste 7 * 128 + 127 bytes of disk space. Thus stripe groups are most efficient for extremely large files of predictable, large, write size. The DAU can be set such that it is equal to some integral multiple of the application I/O, and such that one minimum allocation (The DAU times the number of disks) is an integral multiple of the stripe size. By default, files are round robined between stripe groups in a file system.

Striping across a stripe group creates a file system little different from an ordinary QFS file system using mr devices. The idea behind stripe groups is to permit streaming data in parallel to different sets of disk devices, so the default is round-robin, not stripe.

Disk cache devices for stripe groups are set in the mcf file using the same group name for each device in the group. Thus a simple, 2-group file system called stripefs might look like this in the mcf file:

stripefs 100 ma stripefs

/dev/dsk/c0t1d0s0 101 mm stripefs

/dev/dsk/c0t2d0s0 102 g0 stripefs

/dev/dsk/c0t3d0s0 103 g0 stripefs

/dev/dsk/c0t4d0s0 104 g1 stripefs

/dev/dsk/c0t5d0s0 105 g1 stripefs

Stripe groups are always QFS file systems, never SAM-FS file systems. The first two entries in the mcf file reflect this - they are standard QFS file system declarations for the file system itself and a metadata device. The four data devices in the stripe group differ from those we have already seen for a QFS file system only in the equipment type identifier. Rather than md or mr, the device type identifier begins with a g, and is followed by a number that allows SAM-FS to identify all devices in the stripe group. The devices labeled g0 belong to one stripe group, while the g1 devices belong to the second. Both devices belong to file system stripefs. As with other disk cache device type identifiers, the manufacturer, size, etc of the disks are irrelevant to the device type identifier. Only the intended use determines the identifier.

Mismatched stripe groups

Stripe groups are generally matched for size. However, mismatched stripe groups may be used for any application that provides multiple data streams of very different size. They are often used to write streamed audio and video simultaneously. Since video files are much larger than audio files, they can be streamed to a larger stripe group, at the same time that audio files go to a smaller group. Such mismatched stripe groups must be round robined. They do not support striping across the stripe groups as matched stripe groups do.

With mismatched stripe groups it is essential that files be written to a specific stripe group, though this can be done for any stripe group with the setfa command. In the example above the video stripe group is g0, and the audio stripe group is g1. If you sent audio files to a directory called "/stripefs/audio" and video files to a directory called "/stripefs/video" you can make sure they go to the audio and video stripe groups respectively with these commands, executed in /stripefs:

# setfa -g0 video

# setfa -g1 audio

You might use this with formats, such as that used on DVDs, that require audio and video files to be in separate directories.

As always, performance considerations for any striped array apply to stripe groups. Devices should be on different controllers, or striping will provide little or no performance advantage. Striped devices should be identical in size to each other, or disk space will be wasted. Arrays with larger numbers of disks are faster and less reliable, etc.

Definitions used in this paper:

Stripe width- usual: The number of devices over which a stripe is written, not including the parity device. (Some writers also use this term to refer to the number of bytes sent to a set of contiguous blocks on a disk in a striped device to which data is written in one operation.)

Stripe width- SAM-FS: The integral number of DAUs written to each disk device in a file system, where one disk device is ordinarily a RAID-5 LUN or RAID-1 LUN: the default value sets the DAU equal to 128. If the DAU is greater than 128 kbytes the stripe width defaults to 1. It can be reset with the -o stripe=n option to mount. Allowable range of n is 0 to 255 where stripe=0 is round robin (no striping, each file is completely written to one disk device).

Chunk/interlace(Solaris Volume Manager)/stripe unit size(Veritas VM)/ stripe depth- The number of bytes written to a single disk device in one operation.

Stripe size - the total amount of data sent to all devices in a striped volume in a single write operation. Conventionally stripe size = stripe width * chunk. In SAM-FS/QFS each disk device in a file system (which may also be a striped volume in its own right) is written with an amount of data = SAM-FS/QFS stripe width * DAU.

Data alignment - matching the DAU of the SAM-FS/QFS file system to the stripe size of the RAID.

Read-modify-write operation – When a partially written stripe is later completed, (e.g. fewer records were originally written than the stripe size) a function called "read-modify-write" must take place. A read-modify-write operation reads the old parity block, uses it with the new data to calculate the new parity block, and writes the new parity block. This requires additional processing.

This is also the reason that the parity is distributed in RAID-5, instead of being on a dedicated parity disk. When full-stripe-writes occur, it makes no difference if parity is written on one disk or distributed. However, a dedicated parity disk becomes a bottleneck when multiple simultaneous read-modify-write operations occur, as each must access the parity disk twice, once to read the parity, and once to write the new parity. Spreading the parity over all disks eliminates the bottleneck.