OLAR_intro, olar_intro - Introduction to Online Addition
and Removal (OLAR) Management
Introduction to Online Addition and Removal (OLAR) Management
Online addition and removal management is used to expand
capacity, upgrade components, and replace failed components,
while the operating system services and applications
continues to run. This functionality, sometimes
referred to as "hot-swap", provides the benefits of
increased system uptime and availability during both
scheduled and unscheduled maintenance. Starting with Tru64
UNIX Version 5.1B, CPU OLAR is supported. Additional OLAR
capabilities are planned to be added for subsequent
releases of the operating system.
OLAR management is integrated with the SysMan suite of
system management applications, which provides the ability
to manage all aspects of the system from a centralized
location.
You must be a privileged user to perform OLAR management
operations. Or, you may configure privileges for selective
authorized user or group access using Division of
Privileges (DOP), as described below.
Note that only one administrator at a time can initiate
OLAR operations; other administrators will be prevented
from initiating OLAR operations while another operation
completes.
CPU OLAR Overview [Toc] [Back]
Tru64 UNIX supports the ability to add, replace, and/or
remove individual CPU modules on supported AlphaServer
systems while the operating system and applications continue
to run. Newly inserted CPUs are automatically recognized
by the operating system, but will not start scheduling
and executing processes until the CPU module is powered
on and placed online through any of the supported
management applications as described below. Conversely,
before a CPU can be physically removed from the system, it
must be placed offline and then powered off. Processes
queued for execution on a CPU that is to be placed offline
are simply migrated to run-queues of other running
(online) processors.
By default, CPUs that are placed offline will persist
across reboot and system initialization, until the CPU is
explicitly placed online. This behavior differs from the
default behavior of previous versions of Tru64 UNIX, where
a CPU that was placed offline would return to service
automatically after reboot or system restart. Note that
for backward compatibility, the psradm(8) and offline(8)
commands still provide the non-persistent offline behavior.
While the psradm(8) and offline(8) commands are
still provided, they are not recommended for performing
OLAR operations.
On platforms supporting this functionality, any CPU can
participate in an OLAR operation, including the primary
CPU and/or I/O interrupt handling CPUs. These roles will
be delegated to other running CPUs in the event that a
currently running primary or I/O interrupt handler needs
to be placed offline or removed.
Currently, the platforms which support CPU OLAR are the
AlphaServer GS160, and GS320 series systems. The GS80 does
not support the physical removal of CPU modules, due to
cabinet packaging design.
Why Perform OLAR on CPUs [Toc] [Back]
OLAR of CPUs may be performed for the following reasons: A
system manager wants to provide additional computational
capacity to the system without having to bring the system
down. As an example, an AlphaServer GS320 with available
CPU slots can have it's CPU capacity expanded by adding
additional CPU modules to the system while the operating
system and applications continue to run. A system manager
wants to upgrade specific system components to the latest
model without having to bring the system down. As an example,
a GS160 with earlier model Alpha CPU modules can be
upgraded to later model CPUs with higher clock rates,
while the operating system continues to run. A system
component is indicating a high incidence of correctable
errors and the system manager wants to perform a proactive
replacement of the failing component before it results in
a hard failure. As an example, the Component Indictment
facility (described below) has indicated excessive correctable
errors in a CPU module and has therefore recommended
its replacement. Once the CPU module has been
placed offline and powered off, either through the Automatic
Deallocation Facility (also described below) or
through manual intervention, the CPU module can be
replaced while the operating system continues to run.
Cautions Before Performing OLAR on CPUs [Toc] [Back]
Before performing an OLAR operation, be aware of the following
cautions: When offlining or removing one or more
CPUs, processes scheduled to run on the affected CPUs will
be scheduled to execute on other running CPUs, thus redistributing
the processing capacity among the remaining
CPUs. In general, this will result in a system performance
degradation, proportional to the number of CPUs
taken out of service and the current system load, for the
period of the OLAR operation. Multi-threaded applications
that are written to take advantage of known CPU concurrencies
can expect to encounter significant performance
degradation during the period of the OLAR operation. The
OLAR management utilities do not presently operate with
processor sets. Processor sets are groups of processors
that are dedicated for use by selected processes (see processor_sets(4)). If a process has been specifically bound
to run on a processor set (see runon(1),
assign_pid_to_pset(3) ), and an OLAR operation is
attempted on the last running CPU in the processor set,
you will not be notified by the OLAR utilities that you
are effectively shutting down the entire processor set.
Offlining the last CPU in a processor set will cause all
processes bound to that processor set to suspend until the
processor set has at least one running CPU. Therefore, use
caution when performing CPU OLAR operations on systems
that have been configured with processor sets. If a process
has been specifically bound to execute on a CPU (see
runon(1), bind_to_cpu(3), and bind_to_cpu_id(3) for more
information), and an OLAR operation is attempted on that
CPU, you will be notified by the OLAR utilities that processes
have been bound to the CPU prior to any operation
being performed. You may choose to continue or cancel the
OLAR operation. By choosing to continue, processes bound
to a CPU will suspend their execution until such time that
the process is un-bound, or the CPU is placed back online.
Note that choosing to offline a CPU that has processes
bound may cause detrimental consequences to the application,
depending upon the characteristics of the application.
If a process has been specifically bound to execute
on a Resource Affinity Domain (RAD) (see runon(1) and
rad_bind_pid(3) for more information), and an OLAR operation
is attempted on the last running CPU in the RAD, you
will be notified by the OLAR utilities that processes have
been bound to the RAD and that the last CPU in the RAD has
been requested to be placed offline. By choosing to continue,
processes bound to the RAD will suspend their execution
until such time that the process is un-bound, or at
least one CPU in the RAD is placed online. Note that
choosing to offline the last CPU in a RAD with processes
bound may cause detrimental consequences to the application,
depending upon the characteristics of the application.
If you are using program profiling utilities such
as dcpi, kprofile, or uprofile, that are aware of the system's
CPU configuration, unpredictable results may occur
when performing OLAR operations. It is therefore recommended
that these profiling utilities be disabled prior to
performing an OLAR operation. Ensure that all the processes
including any associated daemons that are related
to these utilities have been stopped before performing
OLAR operations on system CPUs.
The device drivers used by these profiling utilities
are usually configured into the kernel dynamically,
so the tools can be disabled before each
OLAR operation with the following commands:
# sysconfig -u pfm
# sysconfig -u pcount
The appropriate driver can be re-enabled with one
of the following:
# sysconfig -c pfm
# sysconfig -c pcount
The automatic deallocation of CPUs, enabled through
the Automatic Deallocation Facility, should be disabled
whenever the pfm or pcount device drivers are
configured into the kernel, or vice versa. Refer
to the documentation and reference pages for these
utilities for additional information.
General Procedures for Online Addition and Removal of CPUs [Toc] [Back]
Caution
Pay attention to the system safety notes as outlined in
the GS80/160/320 Service Manual.
Removing a CPU Module
To perform an online removal of a CPU module, follow
these steps using your preferred management
application, described in the section "Tools for
Managing OLAR". Off-line the CPU. The operating
system will stop scheduling and executing tasks on
this CPU. Using your preferred OLAR management
application, make note of the quad building block
(QBB) number where this CPU is inserted. This is
the "hard" (or physical) QBB number, and does not
change if the system is partitioned. Power the CPU
module off. The LED on the CPU module will illuminate
yellow, indicating that the CPU module is unpowered,
and safe to be removed. Physically remove
the CPU module. Note that the operating system
automatically recognizes that the CPU module has
been physically removed. There is no need to perform
a scan operation to update the hardware
configuration. Adding a CPU module
To perform an online addition of a CPU module, follow
these steps using your preferred management
application, described in the section "Tools for
Managing OLAR". Select an available CPU slot in
one of the configured quad building blocks (QBB).
If there are available slots in several QBBs, it is
typically best to equally distribute the number of
CPUs among the configured QBBs. Insert the CPU
module into the CPU slot. Ensure that you align the
color-coded decal on the CPU module with the colorcode
decal on the CPU slot. The LED on the CPU module
will illuminate yellow, indicating that the CPU
module is un-powered. Note that the CPU will be
automatically recognized by the operating system,
even though it is un-powered. There is no need to
perform a scan operation for the operating system
to identify the CPU module. Power the CPU module
on. The CPU module will undergo a short self-test
(7-10 secs), after which the LED will illuminate
green, indicating the module is powered-on and has
passed its self-test. On-line the CPU. Once the
CPU is on-line, the operating system will automatically
begin to schedule and execute tasks on this
CPU.
Tools for Managing OLAR [Toc] [Back]
When it is necessary to perform an OLAR operation, use the
following tools which are provided as part of the SysMan
suite of system management utilities.
Manage CPUs [Toc] [Back]
"Manage CPUs" is a task-oriented application that provides
the following functions: Change the state of a CPU to
online or offline Power on or power off a CPU Determine
the status of each inserted CPU
The "Manage CPUs" application can be run equivalently from
an X Windows display, a terminal with curses capability,
or locally on a PC (as described below), thus providing a
great deal of flexibility when performing OLAR operations.
Note
You must be a privileged user to run the "Manage CPUs"
application. Non-root users may also run the "Manage
CPUs" application if they are assigned the "HardwareManagement"
privilege. To assign a user the "HardwareManagement"
privilege, issue the following command to launch the
"Configure DOP" application:
# sysman dopconfig [-display <hostname>]
Please refer to the dop(8) reference page and the on-line
help in the 'dopconfig' application for further information.
Additionally, the Manage CPUs application provides
online help capabilities that describe the operation of
this application.
The "Manage CPUs" application can be invoked using one of
the following methods:
SysMan Menu At the command prompt in a terminal window,
enter the following command:
[Note that the "DISPLAY" shell environment variable
must be set, or the "-display" command line option
must be used, in order to launch the X Windows version
of SysMan Menu. If there is no indication of
which graphics display to use, or if invoking from
a character cell terminal, then the curses version
of SysMan Menu will be launched.]
# sysman [-display <hostname>] Highlight the "Hardware"
entry and press "Select" Highlight the "Manage
CPUs" entry and press "Select" SysMan command
line accelerator
To launch the Manage CPUs application directly via
the command prompt in a terminal window, enter the
following command:
# sysman hw_manage_cpus [-display hostname]
[Note that the "DISPLAY" shell environment variable
must be set, or the "-display" command line option
must be used, in order to launch the X Windows version
of Manage CPUs. If there is no indication of
which graphics display to use, or if invoking from
a character cell terminal, then the curses version
of Manage CPUs will be launched.] System Management
Station
To launch the Manage CPUs application from the System
Management Station, do the following: At the
command prompt in a terminal window from a system
that supports graphical display, enter the following
command:
# sysman -station [-display hostname]
When the System Management Station launches, two
separate windows will appear. One window is the
Status Monitor view, and the other window is the
Hardware view, providing a graphical depiction of
the hardware connected to your system. Select the
Hardware view window. Select the CPU for an OLAR
operation by left-clicking once with the mouse.
Select Tools from the menu bar, or right-click once
with the mouse. A list of menu options will appear.
Select Daily Administration from the list. Select
the Manage CPUs application. Manage CPUs from a PC
or Web Browser
You can also perform OLAR management from your PC
desktop or from within a web browser. Specifically,
you can run Manage CPUs via the System Management
Station client installed on your desktop, or by
launching the System Management Station client from
within a browser pointed to the Tru64 UNIX System
Management home page. For a detailed description of
options and requirements, visit the Tru64 UNIX System
Management home page, available from any Tru64
UNIX system running V5.1A (or higher), at the following
URL:
http://hostname:2301/SysMan_Home_Page
where "hostname" is the name of a Tru64 UNIX Version
5.1B, (or higher) system.
hwmgr Command Line Interface (CLI)
In addition to its set of generic hardware management
capabilities, the hwmgr(8) command line interface incorporates
the same level of OLAR management functionality as
the Manage CPUs application. You must be root to run the
hwmgr command; this command does not currently operate
with DOP.
The following describes the OLAR specific commands supported
by hwmgr. To obtain general help on the use of
hwmgr, issue the command:
# hwmgr -help
To obtain help on a specific option, issue the command:
# hwmgr -help "option"
where option is the name of the option you want help on.
To obtain the status and state information of all hardware
components the operating system is aware of, issue the
following command: # hwmgr -status comp
STATUS ACCESS HEALTH INDICT
HWID: HOSTNAME SUMMARY STATE STATE LEVEL NAME
-------------------------------------------------------------
3: wild-one online available
dmapi
49: wild-one online available
dsk2
50: wild-one online available
dsk3
51: wild-one online available
dsk4
52: wild-one online available
dsk5
56: wild-one online available
Compaq
Alpha
Server
GS160
6/731
57: wild-one online available
CPU0
58: wild-one online available
CPU2
59: wild-one online available
CPU4
60: wild-one online available
CPU6
or, to obtain status on an individual component,
use the hardware id (HWID) of the component and
issue the command:
# hwmgr -status comp -id 58
STATUS ACCESS HEALTH
INDICT
HWID: HOSTNAME SUMMARY STATE STATE
LEVEL NAME
-------------------------------------------------------------
58: wild-one online available
CPU2
To see the complete list of options for "-status",
issue the command:
# hwmgr -help status To view a hierarchical listing
of all hardware components the operating system is
aware of, issue the command:
# hwmgr -view hier
HWID: hardware hierarchy (!)warning (X)critical
(-)inactive (see -status)
-------------------------------------------------------------------------
1: platform Compaq AlphaServer GS160 6/731
9: bus wfqbb0
10: connection wfqbb0slot0
11: bus wfiop0
12: connection wfiop0slot0
13: bus pci0
14: connection pci0slot1
o
o
o
57: cpu qbb-0 CPU0
58: cpu qbb-0 CPU2
This example shows that CPU0 and CPU2 are children
of bus name "wfqbb0", and that their physical location
is (hard) qbb-0. Note that hard QBB numbers
do not change as the system partitioning changes.
To quickly identify which QBB a CPU is associated
with, issue the command:
# hwmgr -view hier -id 58 HWID: hardware hierarchy
-----------------------------------------------------
58: cpu CPU0 qbb-0 To offline a CPU that is
currently in the online state, issue the command
# hwmgr -offline -id 58
or
# hwmgr -offline -name CPU2
Note that device names are case sensitive. In this
example, CPU2 must be upper case. To verify the new
status of CPU2, issue the command:
# hwmgr -status comp -id 58
STATUS ACCESS HEALTH
INDICT
HWID: HOSTNAME SUMMARY STATE STATE
LEVEL NAME
--------------------------------------------------------------
58: wild-one critical offline available
CPU2
Note that the offline state will be saved across
future reboots of the operating system, including
power cycling the system. If you want the component
to return to the online state the next time the
operating system is booted, use the "-nosave"
switch.
# hwmgr -offline -nosave -id 58
or
# hwmgr -offline -nosave -name CPU2
Once again, to verify the status of CPU2, issue the
command:
# hwmgr -status comp -id 58
STATUS ACCESS HEALTH
INDICT
HWID: HOSTNAME SUMMARY STATE STATE
LEVEL NAME
----------------------------------------------------------------------
58: wild-one critical offline(nosave) available
CPU2
To power off a CPU that is currently in the offline
state, issue the command:
# hwmgr -power off -id 58
or
# hwmgr -power off -name CPU2
Note that a component must be in the offline state
before power can be removed using hwmgr. Once power
has been removed from a component, is it safe to
remove that component from the system. To power on
a CPU that is currently powered off, issue the command:
# hwmgr -power on -id 58
or
# hwmgr -power on -name CPU2 To place a CPU online
so that the operating system can start scheduling
processes to run on that CPU, issue the command:
# hwmgr -online -id 58
or
# hwmgr -online -name CPU2
Refer to the hwmgr(8) reference page for additional information
on the use of hwmgr.
Component Indictment Overview [Toc] [Back]
Component indictment is a proactive notification from a
fault analysis utility, indicating that a component is
experiencing high incidence of correctable errors, and
therefore should be serviced and/or replaced. Component
indictment involves the process of analyzing specific
failure patterns from error log entries, either immediately
or over a given time interval, and recommending a
component's removal. The fault analysis utility signals
the running operating system that a given component is
suspect, causing the operating system to distribute this
information via an EVM indictment event such that interested
applications, including the System Management Station,
Insight Manager, and the Automatic Deallocation
Facility can update their state information, as well as
take appropriate action if so configured (see the discussion
on Automatic Deallocation Facility below).
It is possible for more than one component to be indicted
simultaneously if the exact source of error cannot be pinpointed.
In these cases, the most likely suspect will be
indicted with a `high` probability. The next likely suspect
will be indicted with a `medium` probability, and the
least likely suspect will be indicted with a `low` probability.
When this situation arises, the indictment events
can be tied together by examining the "report_handle"
variable within the indictment events. Indictment events
for the same error will contain the same "report_handle"
value.
The indicted state of a component will persist across
reboot and system initialization if no action is taken to
remedy the suspect component, such as an online repair
operation. Once an indictment has occurred for a given
component, another indictment event will not be generated
for that component unless the utility has determined,
through additional analysis, that the original indictment
probably should be updated. In this case, the component
will be re-indicted with the new probability. Once
the indicted component has been serviced, it is necessary
to manually clear the indicted component state with the
following hwmgr command:
# hwmgr -unindict -id <hwid>
where <id> is the hardware id (HWID) of the component
Allowing the operator to manually clear the indicted problem
state, ensures positive identification of when a
replaced component is operating properly.
All component indictment EVM events have an event prefix
of sys.unix.hw.state_change.indicted. You may view the
complete list of all possible component indictment events
that may be posted, including a description of each event,
by issuing the command:
# evmwatch -i -f '[name
sys.unix.hw.state_change.indicted]' | evmshow -t
"@name" -x | more
You may view the list of indictment events that have
occurred by issuing the command:
# evmget -f '[name sys.unix.hw.state_change.indicted]' |
evmshow -t "@name"
CPU modules and memory pages are currently supported for
component indictment.
Compaq Analyze, included as part of the Web-Based Enterprise
Services (WEBES) 4.0 product (or higher), is the
fault analysis utility that supports component indictment
on a Tru64 UNIX (V5.1A or higher) system. The WEBES product
is included as part of the Tru64 UNIX operating system
distribution, and must be installed after installation of
the base operating system. Please refer to the Compaq Analyze
documentation, distributed with the WEBES product,
for a list of AlphaServer platforms that support the component
indictment feature.
Automatic Deallocation Facility Overview [Toc] [Back]
The Automatic Deallocation Facility provides the ability
to automatically take an indicted component out of service,
thus providing the automated ability for the system
to heal itself while furthering the reliability and availability
of the system. The Automatic Deallocation Facility
currently supports the ability to stop using CPUs and memory
pages that have been indicted.
The ability to tailor the behavior of the automatic deallocation
facility can be user-controlled on both single
and clustered systems, through the use of the text-based
OLAR Policy Configuration files. When operating in a clustered
environment, automatic deallocation policy applies
to all members in a cluster by default. This is specified
through the cluster-wide file /etc/olar.config.common.
However, individual cluster-wide policy variables can be
overridden using the member-specific configuration file
/etc/olar.config.
The OLAR Policy Configuration files contain configuration
variables that control specific behaviors of the Automatic
Deallocation Facility. Behaviors such as whether or not to
enable automatic deallocation, and what times of the day
automatic deallocation should be enabled can be defined.
Additionally, the ability to specify a user-supplied
script or executable that provides the gating factor as to
whether an automatic deallocation operation should occur,
can be provided as well.
Automatic deallocation is supported for those platforms
that support the component indictment feature, as
described in the Component Indictment Overview section
above.
Refer to the olar.config(4) reference page for additional
information about the OLAR Policy Configuration files.
Commands: sysman(8), sysman_menu(8), sysman_station(8),
hwmgr(8), codconfig(8), dop(8)
Files: olar.config.common(4)
System Administration
Configuring and Managing Systems for Increased Availability
Guide
OLAR_intro(5)
[ Back ] |