volwatch - Monitors the Logical Storage Manager (LSM) for
failure events and performs hot sparing
/usr/sbin/volwatch [-m] [-s] [-o] [mail-addresses...]
Runs volwatch with the mail notification support to notify
root (by default) or other specified users when a failure
occurs. This option is started by default. Runs volwatch
with hot spare support. Specifies an argument to pass
directly to volrecover if it is running and hot spare support
is enabled.
The volwatch command monitors LSM waiting for exception
events to occur. When an exception event occurs, the volwatch
command uses mailx(1) to send mail to: The root
account. The user accounts specified when you use the
rcmgr command to set the VOLWATCH_USERS variable in the
/etc/rc.config.common file. The user account that you
specify on the command line with the volwatch command.
The volwatch command uses the volnotify command to wait
for events to occur. When an event occurs, there is a 15
second delay before the failure is analyzed and the message
is sent. This delay allows a group of related events
to be collected and reported in a single mail message. By
default, the volwatch command automatically starts when
the system boots.
You can enter the volwatch -s command to start the volwatch
command with hot-spare support. Hot-spare support:
Detects LSM events resulting from the failure of a disk,
plex, or RAID5 subdisk. Sends mail to the root account
(and other specified accounts) with notification about the
failure and identifies the affected LSM objects. Determines
which subdisks to relocate, finds space for those
subdisks in the disk group, relocates the subdisks, and
notifies the root account (and other specified accounts)
of these actions and their success or failure.
When a partial disk failure occurs (that is, a
failure affecting only some subdisks on a disk),
redundant data on the failed portion of the disk is
relocated and the existing volumes comprised of the
unaffected portions of the disk remain accessible.
Note
Hot-sparing is only performed for redundant (mirrored or
RAID5) subdisks on a failed disk. Non-redundant subdisks
on a failed disk are not relocated, but you are notified
of the failure.
Only one volwatch daemon can be running on a system or
cluster node at any time.
Hot-sparing does not guarantee the same layout of data or
the same performance after relocation. You may want to
make some configuration changes after hot-sparing occurs.
Mail Notification Support [Toc] [Back]
The following is a sample mail notification when a failure
is detected: Failures have been detected by the Logical
Storage Manager:
failed disks:
medianame
...
failed plexes:
plexname
...
failed log plexes:
plexname
...
failing disks:
medianame
...
failed subdisks:
subdiskname
...
The Logical Storage Manager will attempt to find spare
disks, relocate failed subdisks and then recover the data
in the failed plexes.
The following describes the sections of the mail message:
The medianame list under failed disks specifies disks that
appear to have completely failed; The medianame list under
failing disks indicates a partial disk failure or a disk
that is in the process of failing. When a disk has failed
completely, the same medianame list appears under both
failed disks: and failing disks. The plexname list under
failed plexes shows plexes that have been detached due to
I/O failures experienced while attempting to do I/O to
subdisks they contain. The plexname list under failed log
plexes indicates RAID5 or dirty region log (DRL) plexes
that have experienced failures. The subdiskname list specifies
subdisks in RAID5 volumes that have been detached
due to I/O errors.
Enabling Hot-Sparing [Toc] [Back]
By default, hot-sparing is disabled. To enable hot-sparing,
enter the volwatch command with the -s option, for
example: # volwatch -s
To use hot-spare support you should configure a disk as a
spare, which identifies the disk as an available site for
relocating failed subdisks. Disks that are identified as
spares are not used for normal allocations unless you
explicitly specify otherwise. This ensures that there is a
pool of spare disk space available for relocating failed
subdisks and that this disk space is not consumed by normal
operations.
Spare disk space is the first space used to relocate
failed subdisks. However, if no spare disk space is
available or if the available spare disk space is not
suitable or sufficient, free disk space is used.
You must initialize a spare disk and place it in a disk
group as a spare before it can be used for replacement
purposes. If no disks are designated as spares when a
failure occurs, LSM automatically uses any available free
disk space in the disk group in which the failure occurs.
If there is not enough spare disk space, a combination of
spare disk space and free disk space is used.
When hot-sparing selects a disk for relocation, it preserves
the redundancy characteristics of the LSM object to
which the relocated subdisk belongs. For example, hotsparing
ensures that subdisks from a failed plex are not
relocated to a disk containing a mirror of the failed
plex. If redundancy cannot be preserved using available
spare disks and/or free disk space, hot-sparing does not
take place. If relocation is not possible, mail is sent
indicating that no action was taken.
When hot-sparing takes place, the failed subdisk is
removed from the configuration database and LSM takes precautions
to ensure that the disk space used by the failed
subdisk is not recycled as free disk space.
Initializing and Removing Hot-Spare Disks [Toc] [Back]
Although hot-sparing does not require you to designate
disks as spares, HP recommends that you initialize at
least one disk as a spare within each disk group; this
gives you control over which disks are used for relocation.
If no spare disks exist, LSM uses available free
disk space within the disk group. When free disk space is
used for relocation purposes, it is likely that there may
be performance degradation after the relocation.
Follow these guidelines when choosing a disk to configuring
as a spare: The hot-spare feature works best if you
specify at least one spare disk in each disk group containing
mirrored or RAID5 volumes. If a given disk group
spans multiple controllers and has more than one spare
disk, set up the spare disks on different controllers (in
case one of the controllers fails). For a mirrored volume,
the disk group must have at least one disk that does
not already contain one of the volume's mirrors. This disk
should either be a spare disk with some available space or
a regular disk with some free space. For a mirrored and
striped volume, the disk group must have at least one disk
that does not already contain one of the volume's mirrors
or another subdisk in the striped plex. This disk should
either be a spare disk with some available space or a regular
disk with some free space. For a RAID5 volume, the
disk group must have at least one disk that does not
already contain the volume's RAID5 plex or one of its log
plexes. This disk should either be a spare disk with some
available space or a regular disk with some free space.
If a mirrored volume has a DRL log subdisk as part of its
data plex (for example, volprint does not list the plex
length as LOGONLY), that plex cannot be relocated. Therefore,
place log subdisks in plexes that contain no data
(log plexes). By default, the volassist command creates
log plexes. For mirroring the root disk, the rootdg disk
group should contain an empty spare disk that satisfies
the restrictions for mirroring the root disk. Although it
is possible to build LSM objects on spare disks, it is
preferable to use spare disks for hot-spare only. When
relocating subdisks off a failed disk, LSM attempts to use
a spare disk large enough to hold all data from the failed
disk.
To initialize a disk as a spare that has no associated
subdisks, use the voldiskadd command and enter y at the
following prompt: Add disk as a spare disk for newdg?
[y,n,q,?] (default: n) y
To initialize an existing LSM disk as a spare disk, enter:
# voledit set spare=on medianame
For example, to initialize a disk called test03 as a spare
disk, enter: # voledit set spare=on test03
To remove a disk as a spare, enter: # voledit set
spare=off medianame
For example, to make a disk called test03 available for
normal use, enter: # voledit set spare=off test03
Replacement Procedure [Toc] [Back]
In the event of a disk failure, mail is sent, and if volwatch
was configured to run with hot sparing support with
the -s option, volwatch attempts to relocate any subdisks
that appear to have failed. This involves finding appropriate
spare disk or free disk space in the same disk
group as the failed subdisk.
To determine which disk from among the eligible spare
disks to use, volwatch tries to use the disk that is closest
to the failed disk. The value of closeness depends on
the controller, target, and disk number of the failed
disk. For example, a disk on the same controller as the
failed disk is closer than a disk on a different controller;
a disk under the same target as the failed disk
is closer than one under a different target.
If no spare or free disk space is found, the following
mail message is sent explaining the disposition of volumes
on the failed disk: Relocation was not successful for subdisks
on disk dm_name in volume v_name in disk group
dg_name. No replacement was made and the disk is still
unusable.
The following volumes have storage on medianame:
volumename ...
These volumes are still usable, but the redundancy of
those volumes is reduced. Any RAID-5 volumes with storage
on the failed disk may become unusable in the face of further
failures.
If non-RAID5 volumes are made unusable due to the failure
of the disk, the following is included in the mail message:
The following volumes:
volumename ...
have data on medianame but have no other usable mirrors on
other disks. These volumes are now unusable and the data
on them is unavailable. These volumes must have their
data restored.
If RAID5 volumes are made unavailable due to the disk
failure, the following message is included in the mail
message: The following RAID-5 volumes:
volumename ...
have storage on medianame and have experienced other failures.
These RAID-5 volumes are now unusable and data on
them is unavailable. These RAID-5 volumes must have their
data restored.
If spare disk space is found, LSM attemps to set up a subdisk
on the spare disk and use it to replace the failed
subdisk. If this is successful, the volrecover command
runs in the background to recover the contents of data in
volumes on the failed disk.
If the relocation fails, the following mail message is
sent: Relocation was not successful for subdisks on disk
dm_name in volume v_name in disk group dg_name. No
replacement was made and the disk is still unusable.
error message
If any volumes (RAID5 or otherwise) are rendered unusable
due to the failure, the following is included in the mail
message: The following volumes:
volumename ...
have data on dm_name but have no other usable mirrors on
other disks. These volumes are now unusable and the data
on them is unavailable. These volumes must have their data
restored.
If the relocation procedure completes successfully and
recovery is under way, the following mail message is sent:
Volume v_name Subdisk sd_name relocated to newsd_name, but
not yet recovered.
Once recovery has completed, a message is sent relaying
the outcome of the recovery procedure. If the recovery was
successful, the following is included in the mail message:
Recovery complete for volume v_name in disk group dg_name.
If the recovery was not successful, the following is
included in the mail message: Failure recovering v_name in
disk group dg_name.
mailx(1), rcmgr(8), voldiskadm(8), voledit(8), volintro(8), volrecover(8), volrootmir(8)
volwatch(8)
[ Back ] |