Configuring the QFS Shared File System

Configuring the QFS Shared File System

Introduction

We have looked at QFS standalone file systems and the single writer, multiple reader QFS file system. The third QFS configuration is the shared QFS file system, which was originally called Multiple Writer, Multiple Reader file system. The shared QFS file system is a distributed file system in which data disks are directly connected to all servers and clients, and metadata disks are connected only to servers. At any time a single server provides metadata to clients over the network, while clients read and write data directly over fibre channel connections. It is possible to set up one or more secondary servers to which metadata sharing services may be failed over but only one server can be active at a time. No load-balancing is possible with this configuration of the QFS file system.

As of release 4.5 of the SAM-FS/QFS software, it will also be possible to create shared file systems of type SAM-FS.

The QFS shared file system may be archived as SAM-QFS or used without archiving. The current metadata server performs all archiving as it would if it were part of a standalone QFS configuration. There is no difference in the configuration of archiving for any of the three configurations of the QFS file system, so in this paper all discussions apply equally to shared QFS and to shared SAM-QFS.

The shared QFS file system allows much faster file sharing than NFS, in part because data goes directly to the client from the data disks, and in part because that data does not have to be interpreted by a protocol stack as it comes in over the network. NFS typically transfers data at around 70 Mbytes per second; QFS may transfer as much as 3 Gbytes per second for small numbers of clients. On the other hand, metadata transfer rates are no better with QFS than with NFS, so for small files, QFS provides little better performance, and for operations requiring only metadata seeks, performance may be quite poor. QFS also requires client access to disks, so installations with large numbers of clients all accessing the same disks will see performance deteriorate with increasing usage. NFS scales directly with the number of clients, performing nearly as well for one hundred as it did for one, and is the better choice in that case. It is fairly common for customer installations to use QFS shared file system clients as NFS servers for large numbers of NFS client systems, taking advantage of the benefits of both distributed file system protocols. Sun estimates that compared to the standalone QFS file system there is a slight performance decline on the order of 10% with the shared QFS file system

It is essential that all hosts on a particular platform run the same release of the QFS software. It is best to have the same PROM (or BIOS) firmware release on all, the same release and same patches for the OS, and the same release of the QFS software, although mixed environments do work. All UIDs, usernames and GIDs should be the same on all hosts, and time stamping should be synchronized with NTP. All mount options must be the same on both metadata servers.

Details of supported configurations change with every release of the software. Check the Release Notes for more information.

Page 2

Shared QFS daemon sam-sharefsd

All shared file system processes are controlled by sam-sharefsd. This stateless daemon establishes the socket and handles validation, block allocation, metadata operations, and locking. The sam-sharefsd daemon communicates with inetd on the server system, requesting the service defined in /etc/services by a port number added when the software is installed. By default, port 7105 is used for QFS, but this value can be changed to any desired value on all systems.

sam-qfs 7105/tcp # SAM-QFS

As the above entry from /etc/services shows, QFS uses the TCP protocol of the transport layer.

On the client, sam-fsd starts one sam-sharefsd daemon for each shared file system when the mcf file is read. On the metadata server system the master daemon sam-fsd also starts one sam-sharefsd daemon per shared file system.

In release 4.6 of the software, the daemon sam-sharefsd is started when the file system is mounted. Troubleshooting the shared QFS file system always requires reading the sam-sharefsd trace file /var/opt/SUNWsamfs/trace/sam-sharefsd.

Controlling File Access

The shared QFS file system adds a level of complexity to that of the single writer, multiple reader file system, because it allows writes by clients. For this reason, the metadata is controlled by a single server which locks files during client reads and writes. The server grants a read, write or append lease on a file to a client, and for the duration of the lease, the server locks the file, and the client has unlimited read, write or append use of the file.

File locking background: Unix applications open files with the fopen() function, to which they pass a file name and the read, write or append flag. Read opens the file for reading, write overwrites a file or creates a file for writing, and append opens a file for writing to the end of the file. Any application can use any of these flags when it opens a file. For example, the vi application uses read and write – not append, no matter what you do to the file. When you open a file in vi, the application performs a read operation and reads the file into a buffer. You can edit the file in the buffer, and when you issue the command “:w” vi opens the file in a write operation and overwrites the entire file. During the write operation, the file is locked so that no other process can simultaneously write to the file, thereby corrupting it. Both read and write operations are performed in just a fraction of a second and the only time when the file is locked is while it is being actively written, not while you are editing it in the buffer. A file may also be opened in append mode: the application “tee –a,” appends input to the end of the file when you type “CNTL-D”. An append operation also requires file locking. A Unix application locks a file with system calls such as lockf(), as vi does. The lock prevents more than one user from performing a write or append operation on a file for an unspecified period, until the first user to open the file has finished using it.

This does not work with network shared files. Network file servers must provide daemon processes that can take a lock request from a client and issue a lockf() or other system call to lock the file during the write operation to prevent file corruption from simultaneous multiple user access. NFS versions 2 and 3 use a daemon, lockd, that handles file locking. The lockd daemon on the client requests a lock on a file. The lockd daemon on the server actually performs the file locking until the lockd daemon on the client tells the server it has finished writing. Leases may also be used in this way, as they are in NFS 4. When a file is requested the server passes a file handle and also a lease to the client. Read leases are usually “non-exclusive” so multiple clients can read a file simultaneously. Write and append leases are typically exclusive of all other leases. The lease specifies that the client can read, write or append to the file until a specified time. During the lease the file will be locked. When the lease expires the lock will be released unless the client renews the lease.

QFS uses leases to control file locking for data writes. Metadata writes require no special locking protocol because only one server writes metadata. Clients trigger file locking by requesting

Page 3

metadata from the server for a file. The lease is passed along with the metadata and allows the client to open a file in read, write or append mode, as appropriate, for a period of time granted by the server. When the client gets a write or append lease it knows the server has locked the file and begins the data write. On the server, the daemon sam-sharefsd locks the file until the lease is up or until the client relinquishes the lease.

Multiple clients (and processes) can have read leases on the same file at the same time because no data blocks are modified by the read operation and no lock is required on the file. A write operation allows modification of an existing block, while append requires allocation of new blocks.

By default leases last 30 seconds; clients can request a renewal of an existing lease. Leases terminate when the operation completes, or when the lease time is up and another client needs the lease (for write or append) or when the server is unable to extend the lease.

Only one client can normally write or append to a file, but database software that manages file write access can allow simultaneous writes to the same file if the mh_write option to mount is set when the file system is mounted. The mh_write option may only be used with such a database. Unless the mh_write option is set and the application performing the write is a database that controls its own file locking, only one host can write to a file. It is only possible for one host to append to a file, under any circumstances. If multiple hosts can write (i.e. mh_write is set), then all I/O will be direct. If you set writebehind and mh_write mount options, the writebehind will therefore be ignored. Paging permits data to be transferred to or from the server host a chunk at a time which considerably speeds up data access. In direct I/O data is transmitted a byte at a time. If there are multiple writers on a file, direct I/O must be used for write operations because transferring a page might write over a write being made by another client.

You must still use the forcedirectio option to mount with mh_write if you want to ensure direct I/O, because I/O from a single node to a particular file will be paged. Only when another node tries to write that file will I/O become direct. The metadata server in this case must force all nodes to flush their data for that file, and will then automatically go into directio mode on that file. This process repeats for each file. Directio does not occur until two nodes want to write a file at the same time. This process only applies to individual nodes, not to threads within one node. Multiple threads can write to a file if the qwrite mount option is specified. File locking is not an issue in this case because Solaris manages contention between intra-node threads as it usually does.

The lease duration is configurable using the mount options -o rdlease=n, wrlease=n, and aplease=n on the server (for read, write and append leases where n is 15-600 seconds). Shorter leases generate extra traffic on the network, especially for applications which write or read a lengthy stream of data. Streaming data from video or a satellite might be written to a file system where write and append leases were set to 600s. Longer leases may cause another client to have to wait longer for access; unless your file system is used primarily for data written over long periods of time, the default lease is reasonable. All leases must expire before failover can occur, so the lease term must be added to any other time required for failover.

Page 4

Authentication

In the single writer, multiple reader QFS file system, there are no authentication issues. This file system is very secure because only the writer can change files, and the writer can be isolated from the main network. In the shared QFS file system, authentication must be handled by the metadata server. On this server, one file called /etc/opt/SUNWsamfs/hosts.<file system name> is set up for each shared file system. Each “hosts” file contains the names and IP addresses of all clients permitted to access the specified file system. Different file systems may support different client systems. There is no system-wide authentication. Metadata will be transferred only to clients whose hostnames and IP addresses are included in the “hosts” file, after which clients read data directly. The data in the /etc/opt/SUNWsamfs/hosts.<file system name> file is written into the shared file system superblock at the time the file system is initialized or at the time the superblock is updated using the samsharefs command; the file is not consulted during normal functioning of the file system. Clients of a shared QFS file system therefore do not use the /etc/opt/SUNWsamfs/hosts.<file system name> file, but it may be helpful to have a copy of the server’s file on the client system as a reference for the administrator.

This system of authentication permits very flexible file sharing. Multiple “hosts” files allow a single QFS metadata server to authenticate different clients for different file systems, or a host can be a server for one file system and a client for another. Clients know which system to use as a server because that information is written to the file system’s superblock when the file system is initialized.

QFS Shared File System Network Issues

The QFS shared file system functions at the application layer of the network protocol. This statement has enormous implications for configuring and troubleshooting the file system. It means that all packet configuration is handled by the implementation of the networking protocol and not by the QFS software. Troubleshooting and configuration issues can therefore be easily worked out using known networking protocols.

The following sequence of events illustrates a server-client exchange for the shared QFS file system “pxcr”. It shows how the QFS file system functions fit into the TCP/IP protocol implementation on Solaris:

1) The client sends a packet to the server with a request for metadata. In the Internet header of the packet is a source IP address and a destination IP address. The source address will be the IP address of the adapter through which the client’s Internet layer decided to send the packet. The destination address must be the IP address associated with the server in the /etc/opt/SUNWsamfs/hosts.pxcr file and written into the superblock at the time the shared QFS file system “pxcr” was initialized. The client obtains this address from the superblock of the file system.

2) The server processes the packet and decides whether to respond to the request. It looks at the source IP address of the packet and checks it against the list of client IP addresses originally listed in the “hosts.pxcr” file and written into the superblock. If the IP address is in the list, it responds to the request. If the IP address is not in the list, it does not respond.

Page 5

3) If the server responds, it generates data which it passes to the transport layer of the TCP/IP protocol for encapsulation. The data will be sent to the source IP address of the original packet. The connection is established and further exchange of packets occurs.

Given this information, how would you configure automatic adapter failover for a QFS file system server? The server’s IP address is encoded in the QFS shared file system superblock. That superblock can be changed only by using the samsharefs command, which means you cannot change the server IP address quickly or without operator intervention. You must find a way to fail over the server’s existing IP address to a backup adapter on the subnet, which is exactly what IPMP is designed to do. The QFS shared file system functions at the application layer of the TCP/IP model, so it has no awareness of IPMP, which functions at the Internet layer. Your QFS IPMP group can therefore be set up as it would for any other application. Packets sent by clients will continue to flow to the server; the protocol stack will ensure that the packets are passed to the application.

This is one example of how you can solve shared QFS file system problems knowing only that the information in the “hosts” file is encoded in the superblock as long as you have a good understanding of your network model.

Configuration of Shared QFS:

Check that software packages are installed, that the system is running Solaris 9 or above, that all disk devices are visible to metadata servers, and that all data disk devices are visible to clients. As always, run a backup in case something goes horribly wrong.

The format command is an easy way to check the world wide numbers of disks to verify that the same ones are being seen on every system in the network. You may also find it helpful to use the volname utility of format. This utility allows you to give the disk a short name. The name will be visible on all systems attached to the disk when they run format.

The rule of thumb with the QFS file system: If in doubt, make it the same. In the mcf file, use the same family set name for the file system and the same equipment ordinals on servers and clients. Include the same data disk devices, although their logical device names will vary. All the UIDs and GIDs of users on each system must be identical. If user maryann on server1 has the UID 2451, user maryann must also have the UID 2451 on client2. Time stamping must also be coordinated because leases expire at a set time, so all systems on the network must be running NTP off the same time server.

Metadata Server Configuration:

1. Create the QFS file system in the mcf file. All entries are the same as for an unshared file system except you will add the keyword "shared" to the Additional Parameters column for the first (file system) entry only. Device entries do not have the “shared” keyword:

Sample mcf File: File System Declarations for Shared file system

Page 6

# Equipment Eq Eq Family Device Addl

# Identifier Ord Type Set State Params

------------ --- ---- --- ----- ------

qfs1 20 ma qfs1 - shared

/dev/dsk/c1t2d2s1 21 mm qfs1

/dev/dsk/c6t1d0s0 22 mr qfs1

2. Create the file /etc/opt/SUNWsamfs/hosts.<file system name> . This file will be written into the superblock when the file system is initialized. It specifies the names and IP addresses of the adapters on client hosts permitted to request metadata for the shared file system. Make sure entries for the name and IP address of each adapter on client systems are also included in the /etc/inet/hosts file or in whatever name resolution you use. In the “hosts” file you must list the names as they are returned by the name resolver.

In the file hosts.<file system name> primary server entries will look like:

<servername> <IP> 1 - server

secondary server entries are:

<servername> <IP> 2 -

client entries are:

<clientname> <IP> - -

If your name resolution comes from NIS or /etc/inet/hosts, a complete file might look like:

# File /etc/opt/SUNWsamfs/hosts.pcrx

# Host Host IP Server Not Server

# Name Addresses Priority Used Host

# ---- ----------------- -------- ---- ------

psca 108.197.142.2 1 - server

pscb 108.197.142.3 2 -

psde 108.197.142.4 - -

psdf 108.197.142.5 - -

If your name resolution comes from DNS, a complete file might look like:

Page 7

# File /etc/opt/SUNWsamfs/hosts.pcrx

# Host Host IP Server Not Server

# Name Addresses Priority Used Host

# ---- ----------------- -------- ---- ------

psca.sun.com 108.197.142.2 1 - server

pscb.sun.com 108.197.142.3 2 -

psde.sun.com 108.197.142.4 - -

psdf.sun.com 108.197.142.5 - -

3. Inform sam-fsd of the changes:

# samd config

4. Initialize the file system:

# sammkfs -S -a <allocation unit> <file system name>

The superblock created by the -S option to sammkfs designates the server for the file system and includes the list of authorized clients. If you change that list of clients, you must either re-initialize the file system or use the command samsharefs to update the superblock. See the man page for samsharefs for details.

5. Verify that sam-sharefsd is running. Do not proceed until it is.

# ps -ef | grep sam

6. Make a mount point for the file system and add its entry to /etc/vfstab. This entry will have the format:

<file system name> - /<mount point> samfs - yes shared,bg

Sun recommends these file systems be mounted on the background so they do not stop the boot if there is a problem with the server. Any mount options can be included in the vfstab or in the file /etc/opt/SUNWsamfs/samfs.cmd.

7. Mount the file system.

# mount <file system name>

8. Run the command samsharefs to verify the configuration:

# samsharefs <file system name>

Configuring the client – The server must be configured and shared file systems created and mounted on the server prior to configuring the client.

Page 8

1. Configure the mcf file with entries for the file system. The format for the entry is the same as that for the server’s mcf file except that metadata device entries look like:

nodev <equipment ordinal> mm <family set name> on

nodev is the stand-in for the device name of metadata device in the mcf file. Only the metadata server has access to metadata devices.

Secondary servers will include logical device names for metadata device in the mcf file.

With the exceptions noted above, everything in this file must be the same as it is in the server's mcf file. Use the same equipment ordinals, the same family set name, etc.

2. Activate the changes:

samd config

3. Make a mount point for the system and edit the file /etc/vfstab as above and mount the file system.

Mount and unmount order:

The server system must be the first to mount the file system and the last to unmount it. Always ensure all clients unmount the file system before unmounting it on the server.

Adding a client

This procedure allows you to add a new client or to change IP addresses or add secondary servers.

Edit the file /etc/opt/SUNWsamfs/hosts.<file system name> on the server to add the new client, change IP addresses or make any other change to the configuration of the file system.

Update the binary hosts file on the server:

samsharefs -u <file system name>

if the file system is mounted OR

samsharefs -u -R <file system name>

if the file system is unmounted (counterintuitive, but correct).

Removing a client requires that you unmount the file system on the server, which means you must first unmount all clients. It is possible to unmount and unconfigure the client, then to do the server unconfiguration during scheduled downtime. Leaving the client in the configuration is a security hole, however, so it should be removed as soon as possible.

Page 9

Failing over the QFS Shared File System Metadata Services

Depending on the software release, failover of a QFS shared file system metadata server may be automated and may also be Sun Cluster certified. Check release notes for details. You may perform a manual failover any time you want to perform maintenance on the server. To perform manual failover:

1. On the original server issue the command:

# samsharefs -s <new server> <file system name>

To perform manual failover for a failed QFS shared file system metadata server:

1. Make sure the server is down and will stay down. If it comes up, it will try to flush any data in its buffers to the file system because it does not know that failover has occurred. That could cause file system corruption.

2. Wait until the maximum lease time is up.

3. On the new server issue the command:

# samsharefs -R -s <new server> <file system name>

Failing over archiving:

If you want to be able to fail over archiving you must make sure that the secondary server has all the configuration files that are on the primary server. You can simply send a copy over any time you change the files, or you can write a script to send them over once per day to the secondary server. The library catalog must also be available to the primary and any secondary servers. It should be placed on an NFS mounted file system, preferably on an ACSLS server so that the catalog server does not become a single point of failure. The mcf file for both the primary and secondary server must have complete configuration information for the tape library and drives.

As of release 4.6 of the software, the server system behaves like a standalone system with respect to accessing released files. If staging is allowed, the file stages. If the “never stage” attribute has been set, the file is accessed directly from an archive copy, but never written to disk. The client system’s behavior is different. If it tries to access a file and staging is allowed, the file stages to disk cache. If staging is not allowed, it stages anyway. (Jonathan Kennedy and Spencer McEwen).