Planning for Recycling

 

At any time, the space on a given tape or online disk archive consists of current archive copies, stale archive copies, expired archive copies and free space. If files are frequently modified, their expired archive copies may consume a large proportion of the space on archive media and the media in a library or assigned to an archive set eventually fills with expired archive copies. Archiving can no longer function properly, even though the volume of current archives is far below the total capacity of the library or of the online disk used for archive copies.

 

The recycling process rearranges current archive copies on tape so that one or more mostly‑written tapes contain no current archive copies, that is, they are drained. Space on those tapes can then be reclaimed by relabeling the tape or exporting the tape and replacing it with a new, blank tape.

 

Recycling Background: Archive Copy States

 

A current archive copy contains the same data as the file on disk, and its location is included in the file’s inode. When a previously archived file is modified its archive copies immediately become stale and are marked with the flag “S” in the flags portion of the output of sls -D. Even though the data in a stale archive copy is no longer the same as that in the file on disk, its metadata cannot be deleted from the inode, because the stale archive represents the only backup available until the modified file is archived again. Once a stale file is archived again, the metadata from new archive copies overwrites the old metadata, and the file system no longer has any metadata pointing to those older archive copies. Archive copies for which the file system has no metadata are referred to as expired or obsolete. The terms “expired” and “obsolete” mean the same thing.

 

A current archive copy may be remade in a new location in the process of rearchiving, performed by the archiver. Do not use the word “rearchive” to describe the process that occurs when a file is modified and is archived again. Rearchiving only occurs when a current archive copy is remade in a new location. The location of the original archive copy in the file metadata is replaced with the new location, so that the archive copy in the original location is expired. Recycling rearchives current archives from a target VSN to other VSNs so that the target VSN is drained and the space on it can be reclaimed.

 

Recycling Concepts: The Nature of Expired Archive Copies

 

An expired archive copy is simply one that is unknown to the file system metadata; there is no reference to that archive copy in any inode in the file system. No mark is placed on the tape to indicate that the archive copy in that location is expired, and if you were to remove a tape cartridge from the tape library and read the tar files on it, there would be no difference between expired, stale and current archives.

 

Expired archive copies result from modification and archiving of files, from deletion of files, or from rearchiving or unarchiving of an archive copy. Each of these processes overwrites (or zeroes, in the case of unarchiving) an archive copy’s metadata so it is unknown to the file system in the file system’s current state. The state of an archive copy is determined solely by file system metadata, which means that archive copies that are expired for the file system state backed up in a particular metadata dump might be current in another metadata dump. If you restore a file system’s metadata from a metadata dump, the set of archive copies that the restored metadata considers current, stale or expired will be different from those that the metadata restored from a different metadata dump would consider current, stale or expired.

 

If you are planning to guarantee file recovery for some period of time, you will have to take into account that some of the archive copies your current file system considers expired may be needed for recovery purposes. You can deal with this by using a special recycling process, sam-nrecycler, discussed in detail later in this document, which looks at a directory containing metadata dumps and treats every file current in any of those dumps as current for the recycling run. Alternatively, you could run the regular recycler, sam-recycler, but never on VSNs containing a particular archive copy: archive copy 3, for example. Those VSNs, which would normally hold the archive copy sent to offsite storage, will be excluded from recycling, and can be used to recover lost data. You could also recycle all archive copies in the tape library using only the current file system metadata to determine which archive copies are current and which expired, and subsequently export all recycled VSNs, so they can be re-imported to the tape library later and the data on them recovered. Whatever method you choose to manage file recovery, it must be in place prior to putting your system into production.

 

Recycling Planning Decisions

 

Recycling can be configured in very different ways and your success in implementing recycling depends heavily on archiving configuration, so you must plan for recycling as part of the initial SAM-QFS system design.  If you wait to plan for recycling until the tape library begins to fill up, your choices will already have shrunk drastically. In particular, prior to putting a system into production, decide whether you will recycle by archive set copy or by tape library, and whether you will relabel VSNs or export and replace them.

 

Recycling Planning:  Recycling by Archive Set Copy or Tape Library

 

Recycling of tape media may be performed in two ways: by tape library or by archive set copy. Recycling of online disk archives may only be performed by archive set copy.

 

When you recycle by tape library, the recycler looks through the entire contents of one tape library at a time and selects VSNs to recycle, based on parameters configured for the library in the file recycler.cmd, which is located in /etc/opt/SUNWsamfs. When you recycle tape by archive set copy, each set of VSNs to be recycled by archive set copy is recycled individually. The recycler performs the recycling process only on VSNs assigned to one particular archive set copy at a time. Recycling by archive set copy requires that the “‑reserve set” parameter is configured for all archive set copies that might be involved in recycling. You cannot mix archive set copies on VSNs that will be recycled if you want to recycle by archive set copy. If you perform recycling by archive set copy you can still mix archive set copies on VSNs that are not going to be recycled, for example, you can send copy 3 for all archive sets to the same VSN if copy 3 is your offsite copy. Only VSNs resident in the tape library can, or need to be recycled. The VSNs that hold a particular archive set copy will be recycled with parameters configured in the file archiver.cmd.

 

Which of these forms of tape recycling is better for you will depend on the characteristics and modification behavior of your files. Most sites recycle tape by archive set copy, simply because it lets them easily tailor recycling to the different characteristics of their archive sets. If necessary, recycling can be performed just on the VSNs associated with a subset of  your archive set copies, thereby limiting the effect of recycling and simplifying troubleshooting. Files that are never modified or deleted will never have expired archive copies on tape. They can be archived in a single archive set and the VSNs associated with their archive set copies ignored during recycling runs.

 

It is possible to combine recycling by archive set copy and recycling by tape library, but it is never necessary, it greatly complicates recycling, and it is not commonly done.

 

Recycling Planning: Relabeling or Exporting

 

The end result of recycling is VSNs that contain only expired archive copies. Reclaiming space in the tape library requires you to either relabel these tapes or export them and replace them with new, blank tapes. The same amount of space is recovered either way. If you want the recycler to automatically export or relabel tapes, copy the file recycler.sh to /etc/opt/SUNWsamfs, and configure it according to the directions contained in the file.

 

Relabeling

 

Each file written to tape is followed with an end-of-data mark. That mark allows tape reader utilities like mt to figure out where a file ends, even if the utility doesn’t recognize the file format and cannot identify its header or trailer. The last file on a tape is followed by two EOD marks. The second mark indicates that there are no more files on the tape, and the tape reader utility should not look further. When a file is appended to a tape, the write begins at the second EOD mark, overwriting that mark.  There are only two ways to write a tape that already has writes on it: continue from the second EOD mark, or overwrite the existing data starting from the beginning of the tape.  It is impossible to overwrite just one file on a tape with a new version of the same file, for example because the new version might take more space than the original file.

 

SAM is equally restricted by the requirements of tape. Each tar archive is written to tape beginning at the second EOD mark. No existing tarfile can be rewritten once that final EOD mark has been placed on the tape. If space on a particular cartridge must be reclaimed then a new final EOD mark must be written on the tape immediately after the tape label, a process that occurs when tape is relabeled. Relabeling employs the same command used by SAM (or the administrator) to label tape in the first place: tplabel:

 

# tplabel -new  -vsn 1LYNCH -old 1LYNCH 100:10

 

The above command will update the time of labeling and other tape metadata in the tape label, but the VSN written to the tape must not be changed. The tape specified has a paper label showing the barcoded VSN “1LYNCH” and that is also the VSN that must be placed in the label.

 

Relabeling tape is the most economical choice, since you do not have to buy a new cartridge. Like most cost-conscious choices, it has a price in availability. When you relabel a tape in SAM, you abandon any files on the tape. Relabeling is destructive and permanent, and will compromise your ability to recover files from a metadata dump. A relabeled tape is also more prone to physical failure, since it has already been used once.

 

Exporting

 

Exporting the VSN and replacing it with a new VSN is more expensive, but it means you can always recover older versions of your file system from metadata dumps, simply by re-importing drained VSNs. Exporting drained VSNs may also be necessary for sites that have regulatory requirements that they keep all data for some period of time. If you export VSNs, you will have to find some way to store them such that they are available for recovery. Some sites that do not need to recover data from drained VSNs do not relabel them and instead export and destroy them to avoid the risks inherent in reusing tape.

 

Recycling Planning: Special Problems

 

If a tape cartridge contains a stale archive copy or a removable media file (see Disaster Recovery for a discussion of removable media files), it can not be recycled. The “stale” state is transitory, so a cartridge containing a stale archive copy will be recyclable as soon as the modified file is archived. If files are changed frequently so that stale copies are commonly present on all VSNs, it may become difficult to locate VSNs to recycle.  Directories change so often that their archives are frequently stale, so if you have directories distributed among your VSNs, you greatly reduce your available pool of VSNs for recycling.  If you archive directories they must go to dedicated VSNs so that the presence of stale directories on most of your VSNs does not interfere with recycling. There is really no reason to archive directories, and the best way to deal with this problem is to set the “archivemeta=off” directive in the archiver.cmd file.

 

If a removable media file is archived, the tape on which it is located can never be recycled. There is no value whatever to archiving removable media files, so they should be created in directories flagged “archive -n” and removed as soon as they are no longer needed.

 

Planning Issues for the Recycling Process

 

This section discusses the recycler process, configuration of recycling using entries in the recycler.cmd and archiver.cmd files. It also talks about differences in recycling various types of archive media and the sam-nrecycler process.

 

Default Recycling Requirements and Recycling Parameters

Parameter

Description

Parameter for Recycling by Library

Parameter for Recycling by Archive Set

Default

high water mark

Percentage of the total storage space in the library that must be utilized before recycling will occur

-hwm

-recycle_hwm

95

minimum gain

Percentage of the VSN that must be expired archives before it can be recycled

-mingain

‑recycle_mingain

50

VSN count

Maximum number of VSNs recycled in one recycler run

-vsncount

‑recycle_vsncount

One

data quantity

Maximum amount of current data that can require rearchiving

‑dataquantity

‑recycle_dataquantity

1 Gbyte

ignore

The inclusion of this parameter allows you to perform a recycling dry run; the recycler runs and writes output to the recycling log file, but does not perform any recycling. It can also be used to disable recycling by tape library.

-ignore

-recycle_ignore

N/A

 

 

 

The tape recycling process

 

When the command sam-recycler is run, the recycler selects and flags VSNs to recycle and flags any current archives on those VSNs with the rearchive flag. The rearchive flag shows up as the letter “r” in the flags portion of the archive copy entry in the output of sls -D. This flag indicates to the archiver that the archive copy present on the selected VSN must be rearchived.  Once all current archive copies on the tape have been flagged for rearchiving, the recycler then runs the /etc/opt/SUNWsamfs/recycler.sh script and exits.

 

During a recycling run, no flags are placed on the physical tape itself, nor does the recycler actually load or read any tapes. All necessary information about tape utilization is contained in the tape library catalog or in file inodes.

 

Subsequent to the recycler run, the archiving daemons rearchive the marked archive copies to other volumes, staging released files as necessary. Files marked “stage -n” will be rearchived directly from existing archive copies, usually from Copy 1. Although rearchiving such files uses no space on disk, archive copies accessed from tape do require access of the tape drives. Eventually rearchiving is complete, and the marked tape is drained. The first time that the recycler.sh script is invoked after a tape is drained, the script can 1) export the drained tape 2) relabel it 3) do nothing, depending on how the script is configured.

 

Recycling Disks

 

The only way to recycle online disk archive copies is to recycle by archive set copy. The recycler treats each disk VSN configured in diskvols.conf as though it were an archive media library consisting of a single VSN. The high water mark, minimum gain and current data quantity are all applied to the file system on which the online disk archive resides.

 

Online disk archives are placed in the directory specified in the file diskvols.conf. That directory may share a file system with other data or it may have exclusive use of the file system. In either case the recycler views the capacity of the online disk archive as equal to the total space in the file system. For online disk archives, the recycling high water mark tests the proportion of space in the file system used by archive copies. If the high water mark for an online disk archive is 2%, and its file system has 100 Gbytes of space, a total of 2 Gbytes of archive copies on the file system will satisfy the high water mark requirement and recycling can occur. The recycler also looks for the configured minimum gain and current data quantity, but applies those values as well to the entire file system. If the requirements are satisfied, the recycler marks current archive copies in tar files for rearchiving as it does for tape. Rearchiving occurs, and on the subsequent recycling run, the recycler deletes tar files containing only expired copies.  As soon as the expired copies are removed, disk space is freed for reuse.

 

For example, the online disk archive tar file f0 contains archive copy 1 of files named file1, file2 and file3. The archive copy for file2 in the archive f0 is expired, and the modified version of file2 has been archived in the tar file f1. The first run of the archiver will mark copy 1 of file1 and file3 for rearchiving. Nothing happens to the tar file on disk during this recycling run. Copy 1 of file1 and file3 are rearchived into a tar file called f2, leaving f0 drained. During the next recycling run, f0 will be deleted from the disk, and the space it occupied can be immediately reused.

 

Configuring the Recycling Process

 

Tape and magneto-optical disk recycling can be configured by media library, in the /etc/opt/SUNWsamfs/recycler.cmd file, or by archive set, configured in the /etc/opt/SUNWsamfs/archiver.cmd file. In either case, the recycling process is initiated by issuing the sam-recycler or sam-nrecycler command.

 

*PERFORMANCE ISSUE* The high water mark for libraries should not be left at 95%. Consider that when a tape library is recycled, current archive copies on tapes in the library will have to be moved to other tapes. If only 5% of the library is unused, there may not be enough space to rearchive those copies. In that case you will end up engaging in a painfully slow process of exporting tapes, importing empty tapes, manually rearchiving files, exporting more tapes and reimporting the original exported tapes. For a tape library with 100 or more slots, the high water mark should probably be set at around 50% and the minimum gain (mingain in the table above) to around 80%. That high water mark is low enough that there will be adequate tape for rearchiving, but high enough that recycling doesn’t unnecessarily tie up tape drives.  If you are recycling tapes by archive set copy, you should set the high water mark to around 20%, as you are working with a subset of the library. The number of VSNs that can be recycled in one recycling run (vsncount in the table above) can be set to two-thirds of the number of drives, so for a tape library with 10 drives you can set the VSN count to 6. If you archive by archive set copy, the total VSN counts for all archive set copies configured for recycling can be set to two-thirds of the number of drives in the library. More than that will have too much of an impact on the drives, which will need to continue to be available for regular archiving and staging activities.

 

The default maximum data quantity (dataquantity in the table above) to be rearchived is absurdly low for modern tapes, and leaving this value at the default will guarantee that recycling will almost never occur. Most sites configure out this parameter by setting it to a value greater than the capacity of a tape cartridge for example, to 9999Gbytes.

 

For online disk archives, recycling is uncomplicated and undemanding, so it makes sense to remove expired files every time the recycler runs even when the total disk utilization is low. One percent is a reasonable high water mark (recycle_hwm) for recycling an online disk archive, with 1% minimum gain (recycle_mingain) and data quantity (recycle_dataquantity) of 9999G. None of the other recycling parameters need to be set for online disk archive recycling. Recycling on disks in Release 4.4 and later releases of the software can substitute the recycle_minobs parameter <percentage of expired files> for the recycle_mingain parameter. When the percentage of files in a disk-archived tar file reaches the percentage specified after recycle_minobs, that tar file can be recycled.

 

A tape cannot be completely recycled without two recycling runs, so most sites need to run recycling two to three times a day. Some sites recycle as often as every two hours. If you have too many recycling runs, the archiver may not have time to finish rearchiving current archive copies before the recycler runs again. In that case you can end up with a backlog of files to be rearchived as VSNs are processed by the recycler faster than the archiver can rearchive the current copies. If you set up a cron job to run the recycler during a low usage time, you will have more latitude in choosing the number and timing of recycling runs.

 

The recycler.cmd File

 

The recycler.cmd file contains directives that allow you configure the recycling parameters discussed above for each media library. Without such configuration, when you invoke the sam-recycler command, recycling will occur on all tape libraries using the default recycling parameter values.

 

These directives take the format:

 

library-family-set-name  parameter1  parameter2 …

 

Where library-family-set-name  is the family set name of the robot as specified in the mcf file and the values parameter1  parameter2, ... are a space-delimited list of parameters that control recycling on the specified library. The parameters are listed in the table above. The library directive in the recycler.cmd file.

 

If a library directive in the file recycler.cmd has the “-ignore” parameter appended to it, the recycler will generate a log file but will not actually perform any recycling. The “-ignore” parameter keeps the recycler from recycling a particular library; use this when first configuring the recycler for testing purposes or when you are recycling only by archive set copy.

 

The recycler.cmd file should be configured regardless of whether you perform recycling by tape library or by archive set copy because it allows you to set up a recycling log file. Recycling log files contain valuable testing and troubleshooting information about this complex process, and you should configure one if you intend to perform recycling. Only one log file can be configured in the recycler.cmd file. It will log all recycling activity, whether it is performed by archive set copy, or by tape library.

 

The default behavior of recycling will cause recycling by tape library, so if you are going to recycle only by archive set copy, use the “-ignore” parameter to prevent recycling by tape library. A recycler.cmd file for a site that recycles only by archive set copy might look like this:

 

logfile = /logs/recycler-log

L1000  -ignore

 

The preceding entries place the recycling log in a convenient directory, in this case called “/logs.” The value of the directive “logfile” is the absolute path to the log file. The path must exist, but the log file itself will automatically be created. According to these entries, the tape library with the family set name “L1000” will not be recycled as a whole, so that recycling of tapes in the library can be controlled by archive set copy directives. Although the recycler is not a daemon-based process, configuration parameters are passed to it by sam-fsd at the time the recycler starts so when the recycler.cmd file has been configured, you must force the sam-fsd daemon to reread it with samd config as you would other configuration files.

 

 

Configuring Recycling in the archiver.cmd File

 

The recycler can recycle tapes by archive set copy rather than recycling by tape library. The recycler must recycle disk archives by archive set copy. Recycling by archive set copy must be configured in the archiver.cmd file, otherwise recycling will occur only by media library. If there is a conflict between the recycling configuration in recycler.cmd and archiver.cmd, the configuration in archiver.cmd has priority.

 

If you want to recycle tape by archive set copy, VSNs must be used only by one archive set copy, normally by configuring the “‑reserve set” parameter in archiver.cmd for all archive sets that will be recycled. VSNs containing archive set copies from multiple archive sets cannot be recycled by archive set copy.

 

The sam-nrecycler command

 

One major problem with recycling: “expired” archives are defined with respect to the current file system. If you recycle and relabel tapes, you may end up with worthless metadata dumps containing metadata for file data that are lost due to recycling. For sites that guarantee past data will be recoverable for a specified period of time, sam‑nrecycler automates the process of making sure that those archive copies are not destroyed by the recycling process. The command sam-nrecycler, available in Release 4.6 and later of the software, performs recycling, but bases its definition of an expired file on the current file system and also on past file systems as defined in specified metadata dumps. If you run sam-nrecycler specifying the metadata dump you did yesterday, the recycler will consider as current any files current on the present file system and any files current at the time the metadata dump was performed yesterday. If you run sam-nrecycler, VSNs containing archive copies current in past metadata dumps cannot be recycled, because those archive copies cannot be rearchived.

 

Obviously sam-nrecycler limits the proportion of VSNs that can be recycled. If enough past metadata dumps are included, recycling may halt completely, simply because so many VSNs are excluded from recycling. In addition, if you ever use sam-nrecycler, you can never run the sam-recycler command.

 

The file /etc/opt/SUNWsamfs/nrecycler.cmd is used to configure the sam-nrecycler process. It uses the same directives as does recycler.cmd, but additionally requires the path to a directory containing the metadata dumps that must be considered as part of the recycling process.

 

Hosted by www.Geocities.ws

1