Quorum

Failure Fencing, Cluster Amnesia and the Trade-offs of Clustering

Sun Cluster Concepts document:

http://192.18.109.11/819-2969/819-2969.pdf

Documentation on failure fencing:

http://docs.sun.com/app/docs/doc/819-2969/6n57kl13o?a=view

Sun Cluster 3.2 documentation

http://docs.sun.com/app/docs/doc/820-0335/6nc35dge2?a=view

Sun Cluster 3.2 Agent documentation

http://docs.sun.com/app/docs/coll/1574.1

Sun Cluster 3.2 QFS documentation

http://docs.sun.com/source/819-2758-10/chapter6.html

Failure Fencing

Sun guarantees that a Sun cluster can survive any one failure of either hardware or software without interrupting data services. If you have a three-node cluster, for example, and one node fails to boot, the cluster will function with the remaining two nodes. Sun does not guarantee the cluster if two elements of the cluster fail. If that happens, you may lose all data services on the cluster until the failed elements are repaired. Sun clusters can be designed so that they are likely to survive more than one failure, but this is never guaranteed.

If multiple elements in the cluster transport fail so that communication ceases between two or more groups of nodes, the result may be a cluster divided into partitions whose members have no way of knowing the state of nodes in other partitions. If isolated partitions of the cluster were to try to provide the same data services, multiple nodes could end up writing to the same files, causing data corruption. This is the split-brain scenario. To prevent split-brain, there must be a mechanism in the cluster that ensures that failure of multiple elements of the cluster transport does not result in multiple independent instances of a data service. In other words, there must be a mechanism that implements failure fencing. Failure fencing exists solely to guard against the risk of data corruption as the result of the formation of multiple independent partitions.

This document will discuss the elements of failure fencing starting with the simplest possible solutions, and progressing to more complex implementations. The same theme will guide the entire discussion: That clusters are a high-availability solution, so they must be designed to get as much availability as possible out of the least cost, without compromising performance.

Quorum Votes

Failure fencing is implemented by assigning “quorum votes” to elements of the cluster, then requiring that a node be able to “count” more than 50% of the defined votes in the cluster before it can continue in the cluster or boot into the cluster.

A node can count the following quorum votes:

· Its own quorum vote

· The quorum vote of any node with which it can communicate over the cluster transport

· A quorum vote assigned to a device other than a node, if certain conditions are met:

o The node is directly attached to the quorum device and its “quorum reservation key” is written on or otherwise assigned to the device

o The node is in communication with a node that has the right to count the device’s vote.

Quorum devices will be discussed in a later section of this paper.

Failure Fencing Using Node Quorum Voting

In the simplest implementation of failure fencing, each node is assigned one quorum vote by the cluster. When a node boots, or determines if it can continue in the cluster in the wake of a failure, it counts its own vote, plus the vote of any other node with which it can communicate over the cluster transport. As long as a node can see more than 50% of the quorum votes, it, and any other nodes with which it can communicate over the cluster transport can form a cluster. A node that cannot see more than 50% of the quorum votes panics, if it is running, or hangs, if it is booting. It is arithmetically impossible for two, mutually isolated, groups of nodes to each see more than 50% of the quorum votes. One cluster at most can form.

For example, assume that one node of a four-node cluster loses contact with the rest of the cluster when both of its cluster transport adaptors fail. The single isolated node would see only 25% of the quorum votes, and would kernel panic. The other nodes would see 75% of the quorum votes and so would continue to form the cluster. Any data services running on the isolated node would be failed over to a functioning node. In this example, if a subsequent node lost contact with the cluster, no node would be able to count more than 50% of quorum votes, and all nodes would panic. The cluster would completely cease to function. In this case the complete loss of the cluster is an acceptable result, because the two major requirements of the cluster have been met: The cluster survived the failure of a component, and data was not corrupted.

So far we have discussed quorum voting when only nodes have votes, which will not work for a two-node cluster. If one node in a two-node cluster shuts down, the remaining node sees only 50% of the node quorum votes, and must kernel panic. That violates the first rule of clustering: that the cluster will survive any single failure. For the two-node cluster then, an additional quorum vote must exist outside the nodes themselves. For clusters with more than two nodes, additional quorum votes outside the cluster will increase the cluster’s availability.

Quorum Device Votes

We can allow a two-node cluster to survive the complete failure of one node, and increase the probability that a larger cluster will survive more than one failure, by assigning quorum votes to quorum devices. The rule of quorum still applies, but with more potential votes; as long as a node can count more than 50% of the quorum votes, it can remain in the cluster or boot into the cluster, whether the votes it counts belong to nodes or to quorum devices.

A quorum device is a shared disk, a NAS device, or an unclustered host on the public network to which the cluster assigns one or more quorum votes. Conventionally a quorum device is a shared disk, so this discussion will assume quorum disk devices unless otherwise specified; the discussion of quorum disk devices can be generalized to NAS devices and hosts on the network. A quorum disk should be used for data so that it is regularly accessed. That way any failure will be noticed long before the disk is needed for the quorum vote.

A quorum disk is always shared by at least two nodes. The designation of a disk as a quorum device and the number of votes assigned to the device are held only in the Cluster Configuration Repository (CCR). Nothing is written to disk. For highest availability, a cluster has one fewer quorum votes assigned to devices than there are nodes in the cluster. In documentation this optimal number of quorum votes is usually written “N-1 quorum votes, where N is the number of nodes in the cluster.”

Most Sun Clusters, include just two nodes. Such a cluster is required to have one quorum device assigned one quorum vote. No other configuration is acceptable or supported.

Only a two-node cluster is absolutely required to have a quorum device vote. Clusters with three or more nodes may have N-1 quorum device votes, where N is the number of nodes in the cluster. They may have N-2 quorum device votes. They may have no quorum device votes at all. The only requirements for quorum device votes are: 1. A two-node cluster must have one quorum device vote 2. No cluster may have more than N-1 quorum device votes 3. Each quorum disk must be assigned one fewer quorum votes than the number of nodes to which it is attached. This number is also written “N-1,” where N is the number of nodes attached to any particular quorum disk.

In a four-node cluster, quorum disk A might be attached to nodes 1 and 2, quorum disk B to nodes 2 and 3, and quorum disk C to nodes 3 and 4. Each of those three disks has one quorum vote (N-1 where N is the number of attached nodes) and the total quorum votes are 3 (N-1 where N is the total of all nodes).

These rules can be confusing when a quorum device is attached to more than two nodes. For highest availability, a four-node cluster will have three quorum device votes (N-1, where N is 4). If that cluster has one quorum disk, attached to all four nodes that disk will be assigned three quorum votes, (N-l, where N is the number of nodes attached to the disk). The total number of quorum votes needed for highest availability and the number of votes assigned to the disk device are the same because the disk is attached to all the nodes in the cluster. If that quorum disk had been attached to only three of the four nodes in the cluster, it would have been assigned two quorum votes (N-1, where N is the number of attached nodes). This is a supported configuration, but will not provide maximum availability. For maximum availability, another quorum disk attached to two nodes would have to be configured.

A two-node cluster is a special case of the situation in which the quorum device is attached to all nodes; it must have one quorum device assigned one quorum vote. In a two-node cluster, the total number of quorum device votes is N-1, where N, the number of nodes in the cluster, is 2, so N-1 equals one. The number of votes assigned to the quorum disk in a two-node cluster is also N-1, where N is the number of nodes in the cluster attached to the quorum disk. This value of N is also 2 so N-1, the number of votes assigned to the single quorum disk, is therefore also one.

A node can count the vote(s) of a quorum device under two conditions: 1) It is attached to the quorum device, and its quorum reservation key (discussed below) is written on the disk OR 2) It can communicate with a node in the cluster whose quorum reservation key is written on the disk. If a quorum device is attached to two nodes in a four-node cluster, and those attached nodes fail, the quorum device vote will be lost.

Quorum votes should not be assigned randomly. The topology of the cluster must be analyzed, and every possible failure scenario considered. A well laid-out cluster can often survive the failure of all but one or two nodes, if that is desired.

Improving Availability with Quorum Device Votes

A two-node cluster is required to have a quorum device vote; it cannot survive the complete failure of a single node without one. If one node fails, the other node can count its own quorum vote and the vote of the quorum device. That is two votes out of three, which is enough to continue to form the cluster. Clusters with more than two nodes can survive a single failure without quorum device votes and are not required to have any, but usually do because they improve availability.

As we have seen, a four-node cluster without quorum device votes can survive the failure of only one node. If a four-node cluster has a single quorum device, attached to all four nodes, the quorum device will be assigned three quorum votes for a total of seven. That cluster can lose three nodes and still continue to function, because the surviving node can count its own quorum vote and the three votes assigned to the attached quorum device for a total of four. A four node cluster with three quorum disks, each attached to two nodes, can survive the loss of any two nodes, but the loss of three nodes will result in the loss of access to one of the quorum devices, and the cluster will fail. This is less expensive to implement than the previous example, but, predictably, results in decreased availability.

A detailed failure analysis of a four node cluster with three quorum devices is at the end of this document.

Quorum device votes allow improved availability in the case of node failure, but they greatly increase the complexity of implementing failure fencing. So far this paper has only covered failure fencing without quorum devices. The next section discusses failure fencing in a cluster that has quorum device votes. It covers: 1. Writing to disks that implement the SCSI-2 and SCSI-3 standards 2. Writing to disks in Sun Cluster using reservation keys 3. Quorum disks and quorum reservation keys 4. Racing for quorum.

Failure Fencing With Quorum Devices

Background: Writing to SCSI-2 and SCSI-3 Disks

Multi-hosted SCSI-2 disks are “non-shared,” so one host at a time can reserve the disk and write to it using the SCSI-2 Reserve/Release command set. Until that host completes its write and releases the reservation, no other host can access the disk, even to place data in the disk’s cache. SCSI-2 reservations are written into the disk cache, not to the disk itself, and data written to the cache does not persist across power failure or reboots of attached hosts.

The SCSI-3 standard allows multiple hosts to concurrently load data into the disk cache so writes to the disk are more efficient. For example, if multiple hosts are writing to the same database file, a SCSI-3 disk can organize writes by their location on the disk, regardless of which hosts originated the writes. A SCSI-3 disk does not require reservations, and many do not implement any form of reservations other than the SCSI-2 Reserve/Release command set, which is present for backwards compatibility with software using the SCSI-2 command set.

If a SCSI-3 disk does implement reservations, each host using the disk writes a host-specific 64-bit reservation key onto a private area in the disk cache prior to its first data write to the disk. These keys are never removed during the course of normal operations, and data written to SCSI-3 cache survives power failures and reboots of attached nodes. SCSI-3 reservations are therefore referred to as Persistent Reservations (PR) or Persistent Group Reservations (PGR). These terms are interchangeable.

The first four bytes of the SCSI-3 reservation key are randomly generated by the cluster; all reservation keys written on one disk will have this prefix. The last four bytes are host-specific and are generated by the software using the disk. In the case of Sun Cluster, the host portion of the reservation key is the node number, padded with zeros to 32 bits. A reservation key for Node 1, written in hexadecimal might be 0x4225ef3100000001, while Node 2’s reservation key on the same disk would be 0x4225ef3100000002.

Reservation Keys in Sun Cluster

In a Sun cluster, each attached node writes a unique reservation key on a disk when the disk is first used for data writes or as a quorum device. For disks attached to more than two nodes, a reservation key is written using the standard SCSI-3 PGR implementation, so multi-hosted disks used with Sun Cluster must implement SCSI-3 persistent reservations if they will be attached to more than two nodes. If you write a reservation key using SCSI-3 PGR to a SCSI-2 disk, it damages the disk cache and renders the disk unusable, although Cluster 3.2 now checks for compatibility prior to writing keys. Multi-hosted disks sold by Sun Microsystems, Hitachi and EMS generally implement SCSI-3 persistent group reservations, and may be attached to more than two nodes in a Sun Cluster, but it is critically important to check with Sun Sales or Sun Support prior to purchasing storage for a cluster. Not all arrays are supported for all Sun Cluster configurations.

For a disk attached to only two nodes, by default, reservation keys are written to a private area on the disk using a Sun Microsystems proprietary technology called Persistent Group Reservation emulation or PGRe, because it closely emulates the PGR system implemented under the SCSI-3 standard. It may also be called SCSI-2 Persistent Reservation Emulation or PRE. Sun does not make details of this technology generally available, and it will not be described here. It is not part of any SCSI standard. (From the behavior of hosts in Sun cluster, it is apparent that PGRe does directly call SCSI-2 Reserve and Release commands, since a node that does not have a PGRe reservation key written on a SCSI-2 disk is unable to write to that disk even if it never starts the cluster software. The PGRe implementation is obviously able to prevent a node from placing a SCSI-2 reservation on a disk, even without the node’s co-operation.)

The PGRe method of writing reservation keys is designated using the term “pathcount”. It is also possible to configure a device attached to the cluster to use SCSI-3 PGR keys, by resetting the reservation type to “prefer3.” Resetting the reservation type from “pathcount” to “prefer3” will have no effect unless the disk implements the SCSI-3 standard.

The underlying standard of a disk device used with Sun Cluster may be either SCSI-2 or SCSI-3. Which standard was used in the design of the disk is not relevant to the type of reservation key placed on the disk. If the disk is attached to two nodes, PGRe is used to write keys on the disk. If the disk is attached to three or more nodes, SCSI-3 PGR keys are written on the disk. Of course, the disk attached to three or more nodes must support SCSI-3 PGR, or it will not be qualified for use in the cluster.

The presence of its key on a disk allows the node to submit writes to that disk. If the disk implements the SCSI-2 standard, prior to writing the disk the node will have to place a normal SCSI-2 reservation on the disk as well, but the use of the SCSI-2 Reserve/Release command set to write to a disk is independent of the Cluster requirement that a PGR or PGRe key be present on the disk before it may be written by a node. A node’s reservation key never leaves the disk unless another node removes it. Once a node’s key has been removed from a disk, the node cannot write to that disk, even to add its reservation key. Another node must return the key to the disk.

Disks may be in one of three conditions with respect to reservation keys: 1. Unused disks have no reservation keys 2. When a quorum disk is designated by the cluster, standard PGR or PGRe reservation keys are written on the disk and recorded by the cluster in the file /etc/cluster/ccr/infrastructure 3. All other disks have standard reservation keys written to the disk when the disk is registered with the cluster. Both quorum and non-quorum disks can and should be used for data storage. The only difference between a quorum and a non-quorum disk is that the reservation key of the quorum disk is recorded in the CCR.

The commands to view PGR and PGRe keys are at the end of this document.

Reservation Keys and Quorum Votes

A node can count the vote(s) of a quorum device if its quorum reservation key is written on the disk. If its key is absent, a node cannot write to a quorum disk or count the quorum disk’s vote(s). Quorum reservation keys follow the rules for all reservation keys; how the key is written depends on whether the disk device is attached to just two nodes, or to more than two nodes. If a quorum disk is attached to more than two nodes, the SCSI-3 Persistent Group Reservation (PGR) quorum reservation keys written on the disk indicate which nodes can write to the disk and can count the quorum device’s vote. If a quorum disk is attached to just two nodes, the PGRe mechanism is used.

Failure Fencing Using Quorum Devices

A cluster that implements quorum device votes must use a different mechanism to prevent split-brain than a cluster that implements only node votes. If a node in such a cluster is isolated from the other node(s) of the cluster, it may still be able to count enough quorum votes to form a cluster, even though it cannot communicate with any other nodes. For example, if three nodes are attached to a quorum device with two votes, and the cluster transport completely fails, all three nodes will still see three quorum votes - their own and the two quorum device votes. None of the nodes will know if the other nodes have failed, or if the problem is in the transport. By the rules of cluster, only one of the nodes can form a cluster, and that must be a node that can see at least three quorum votes. In this case, two of those votes will be on quorum devices, and those votes can be counted by any of the nodes, as long as the node’s PGR reservation key is written on the disk. The quorum reservation keys of two nodes must be removed from the quorum disk so that only one of the three isolated nodes can form a cluster.

When a cluster node realizes that another node is no longer communicating with the cluster, it tries to remove the isolated node’s reservation key from any quorum devices. That way, if the isolated node is still active, it will be unable to count quorum device votes, and will panic. At the same time, the isolated node is trying to remove the reservation key(s) of nodes with which it cannot communicate. The result is a race to the quorum disk. The first node to access the disk will win the race and can form a cluster. If all of the nodes in the cluster are isolated from each other, all will race to the quorum disk to reserve it. The first node to gain write access to the disk will dismiss the keys of the other nodes. Those nodes will arrive too late, see that their keys are gone, and panic. The first node alone will form the cluster.

In the case of a two node cluster, if the cluster transport fails, both nodes will race for the quorum disk. The first to gain write access will dismiss the key of the other node, which was written on the disk using Sun’s proprietary PGRe technology. The winning node can count the disk’s vote and form a cluster. The losing node must panic and drop out of the cluster. If a node fails, surviving nodes race for the quorum device anyway. There is only one possible outcome to such a race, but the surviving nodes have no way of knowing that. They only know that one node is no longer in contact.

Only nodes attached to storage race for the quorum device, and only if one or more attached nodes are isolated. Consider a four-node cluster with two nodes attached to storage, plus two non-storage nodes. If one or both of the non-storage nodes fail, the storage-attached nodes will not race for the quorum disk because there are no reservation keys to remove. Only attached nodes write reservation keys on shared disks.

Multiple-Node Partitions and the Race for the Quorum Device

If a cluster transport fails so that two or more partitions are formed, each with multiple nodes that are aware of each other, the lowest numbered node in each partition races for any attached quorum devices. If the partitions are unevenly divided, for example if one partition contains two nodes and the other, one node, any partition that includes fewer than half the nodes in the cluster will delay racing for the quorum devices. That way a small partition cannot seize control of the cluster when a large partition is available.

Amnesia Prevention using PGR and PGRe

When a node is down for some period of time, changes may be made to the CCR that will not be propagated to the failed node. If all other nodes in the cluster are subsequently brought down and the failed node rebooted, it will have an obsolete version of the CCR. This problem is called amnesia.

When the node initially went down, its reservation keys were removed from quorum disks by an active node. When it boots, it will see that it cannot count the votes of the quorum disks, and therefore that it cannot form a cluster. The boot will hang, waiting for a node with a reservation key on the quorum device(s) (and therefore a current version of the CCR) to join the cluster. That node will write the reservation key of the first node back on all disks, and will also pass it the current version of the CCR.

One Last Problem

If a node fails and reboots, it may try to flush any cached writes to disk. Since data services will have been failed over to other nodes, this would likely result in corruption of data. Once again reservation keys prevent this. Surviving nodes purge the quorum reservation key of a failed node immediately upon realizing that the node is no longer communicating with the rest of the cluster. Those nodes then purge all other reservation keys belonging to the failed node from the remaining shared disks. Until that node completely boots and rejoins the cluster, it cannot write to any shared disks owned by the cluster. Once it has rejoined the cluster, the node recognizes that it is no longer running any data services and discards cached writes. For this reason, boot disks cannot be on shared disks; otherwise failed nodes would be unable to boot because surviving nodes would have accessed its boot disk and purged its reservation key from that disk.

Examples, Commands, Definitions

Counting Votes

Consider a four-node cluster, consisting of Nodes 1, 2, 3 and 4 with three quorum devices: A, B and C. Nodes 1 and 2 are attached to quorum device A. Nodes 2 and 3 are attached to quorum device B. Nodes 3 and 4 are attached to quorum device C. Here are some possible node failure scenarios:

Node(s) failed	Node votes visible to surviving node(s), of 1, 2, 3 or 4	Device votes visible to surviving node(s), of A, B and C	Total votes visible to surviving node(s), out of 7	Cluster survives?
1	2,3,4	A,B,C	6	Yes
2	1,3,4	A,B,C	6	Yes
3	1,2,4	A,B,C	6	Yes
4	1,2,3	A,B,C	6	Yes
1,4	2,3	A,B,C	5	Yes
2,3	1,4	A,C	4	Yes
1,2	3,4	B,C	4	Yes
1,2,4	3	B,C	3	No
1,2,3	4	C	2	No

In this example the cluster can survive the failure of any two nodes, but not of three. This kind of analysis should be done for each proposed cluster. Depending on the use of the nodes in the cluster, it may not be desirable for the cluster to continue to function if one or more particularly important nodes are lost. In that case quorum devices can be designated such that the entire cluster fails if those specific nodes fail.

Commands

To view SCSI-2 PGRe keys on a disk attached to just two nodes:

# /usr/cluster/lib/sc/pgre -c pgre_inkeys -d /dev/did/rdsk/d1s2
key[0]=0x42bac6c500000001.
key[1]=0x42bac6c500000002.

To view SCSI-3 PGR keys on a disk attached to three or more nodes:

# /usr/cluster/lib/sc/scsi -c inkeys -d /dev/did/rdsk/d1s2
Reservation keys(3):
0x4225ef3100000001
0x4225ef3100000002
0x4225ef3100000003

You can use the DID disk device names or Solaris logical device names with these commands. The result shows the reservation keys assigned to each disk.

Definitions

amnesia - A scenario in which: 1) one node leaves the cluster 2) changes are made to the cluster, resulting in changes to the CCR 3) the rest of the cluster is brought down 4) the node without the current CCR boots. The cluster metadata on that node is not in agreement with the volume managers or with file system assignments, and is corrupt.

CCR - Cluster Configuration Repository - the collection of files in /etc/cluster/ccr used to control the cluster communications. Changes to the CCR made on one node are automatically propagated over the cluster transport to the CCR on other nodes.

dual-hosted array - an array attached to two hosts

multi-hosted array - an array attached to two or more hosts.

failure fencing - method by which split brain is avoided. A cluster node must to be able to locate a majority of all configured quorum votes or it will kernel panic if it is running, or hang if it is booting.

partition - a subgroup of all nodes in the cluster which are in communication with each other. Partitions form when multiple elements of the cluster transport fail, thereby isolating single nodes or groups of nodes.

PGR/PGRe reservation key - a 64-bit key written on disks by the clusster. It consists of a 4 byte prefix generated by the disk firmware, followed by a 4 byte host-specific suffix consisting of the node number padded with zeroes. The presence of its key allows a node to write data to a disk. If the disk is further designated a quorum disk, the presence of its key allows a node to count the disk’s quorum vote or votes.

quorum device - a cluster element other than a node that is assigned one or more quorum votes in the CCR. Quorum devices are typically disks but a host on the public network can also serve as a quorum device as can a NAS device.

quorum reservation key - A 64-bit reservation key written on a quorum device, and only on a quorum device . Quorum reservation keys are held in the file /etc/cluster/ccr/infrastructure. Non-quorum disks also hold reservation keys, but those are not recorded by the cluster.

SCSI-2 - An ANSI standard, formally released in 1994, that governs the design of hardware used in computer peripherals such as hard drives, cables, scanners, printers and host bus adapters (HBAs). For the first time, it included an extensive set of standard commands called the Common Command Set (CCS). Sun Cluster employs the Reserve/Release SCSI-2 command set to permit multi-host disk access.

The Reserve/Release command set allows an attached host to place an exclusive write reservation on any disk that implements that SCSI-2 standard. Only the reserving host may write to the disk as long as the reservation is present. When the write is finished, the reserving host must release the disk.

SCSI-3 - A set of ANSI standards for computer peripherals. The first architecture document was released in 1996. It governs the design of hardware used in computer peripherals such as hard drives, cables, scanners, printers and host bus adapters (HBAs). Sun Cluster employs the SCSI-3 Persistent Reservations command set which allows multiple hosts to place reservations on a disk simultaneously.

split brain scenario - If all cluster transports fail, nodes can no longer coordinate in the cluster. Avoiding a split brain scenario is the goal of the quorum voting system, which ensures that only one partition continues to provide data services.