olar_intro - Tru64

· Home

+ man pages

-> Linux

-> FreeBSD

-> OpenBSD

-> NetBSD

-> Tru64 Unix

-> HP-UX 11i

-> IRIX

· Linux HOWTOs

· FreeBSD Tips

· *niX Forums

man pages->Tru64 Unix man pages -> olar_intro (5)

OLAR_intro(5)

NAME
DESCRIPTION
SEE ALSO

NAME [Toc] [Back]

       OLAR_intro,  olar_intro  - Introduction to Online Addition
       and Removal (OLAR) Management

DESCRIPTION [Toc] [Back]

   Introduction to Online Addition and Removal (OLAR) Management
       Online addition and removal management is used  to  expand
       capacity,  upgrade  components,  and replace failed components,
 while the operating system  services  and  applications
  continues  to  run.  This  functionality, sometimes
       referred  to  as  "hot-swap",  provides  the  benefits  of
       increased  system  uptime  and  availability  during  both
       scheduled and unscheduled maintenance. Starting with Tru64
       UNIX  Version 5.1B, CPU OLAR is supported. Additional OLAR
       capabilities  are  planned  to  be  added  for  subsequent
       releases of the operating system.

       OLAR  management  is  integrated  with the SysMan suite of
       system management applications, which provides the ability
       to  manage  all  aspects  of the system from a centralized
       location.

       You must be a privileged user to perform  OLAR  management
       operations.   Or,  you may configure privileges for selective
 authorized user or group  access  using  Division  of
       Privileges (DOP), as described below.

       Note  that  only  one administrator at a time can initiate
       OLAR operations; other administrators  will  be  prevented
       from  initiating  OLAR  operations while another operation
       completes.

   CPU OLAR Overview    [Toc]    [Back]
       Tru64 UNIX supports the ability to  add,  replace,  and/or
       remove  individual  CPU  modules  on supported AlphaServer
       systems while the operating system and  applications  continue
 to run. Newly inserted CPUs are automatically recognized
 by the operating system, but will not start scheduling
  and  executing processes until the CPU module is powered
 on and placed online through  any  of  the  supported
       management  applications  as  described below. Conversely,
       before a CPU can be physically removed from the system, it
       must  be  placed  offline  and then powered off. Processes
       queued for execution on a CPU that is to be placed offline
       are   simply  migrated  to  run-queues  of  other  running
       (online) processors.

       By default, CPUs that  are  placed  offline  will  persist
       across  reboot and system initialization, until the CPU is
       explicitly placed online. This behavior differs  from  the
       default behavior of previous versions of Tru64 UNIX, where
       a CPU that was placed  offline  would  return  to  service
       automatically  after  reboot  or system restart. Note that
       for backward compatibility, the psradm(8)  and  offline(8)
       commands  still  provide the non-persistent offline behavior.
  While the  psradm(8)  and  offline(8)  commands  are
       still  provided,  they  are not recommended for performing
       OLAR operations.

       On platforms supporting this functionality,  any  CPU  can
       participate  in  an  OLAR operation, including the primary
       CPU and/or I/O interrupt handling CPUs. These  roles  will
       be  delegated  to  other  running CPUs in the event that a
       currently running primary or I/O interrupt  handler  needs
       to be placed offline or removed.

       Currently,  the  platforms  which support CPU OLAR are the
       AlphaServer GS160, and GS320 series systems. The GS80 does
       not  support  the  physical removal of CPU modules, due to
       cabinet packaging design.

   Why Perform OLAR on CPUs    [Toc]    [Back]
       OLAR of CPUs may be performed for the following reasons: A
       system  manager  wants to provide additional computational
       capacity to the system without having to bring the  system
       down.   As an example, an AlphaServer GS320 with available
       CPU slots can have it's CPU capacity  expanded  by  adding
       additional  CPU  modules to the system while the operating
       system and applications continue to run.  A system manager
       wants  to upgrade specific system components to the latest
       model without having to bring the system down. As an example,
  a  GS160 with earlier model Alpha CPU modules can be
       upgraded to later model  CPUs  with  higher  clock  rates,
       while  the  operating  system  continues to run.  A system
       component is indicating a high  incidence  of  correctable
       errors and the system manager wants to perform a proactive
       replacement of the failing component before it results  in
       a  hard  failure.  As an example, the Component Indictment
       facility (described below) has  indicated  excessive  correctable
  errors  in a CPU module and has therefore recommended
 its replacement.  Once  the  CPU  module  has  been
       placed  offline  and powered off, either through the Automatic
 Deallocation  Facility  (also  described  below)  or
       through   manual  intervention,  the  CPU  module  can  be
       replaced while the operating system continues to run.

   Cautions Before Performing OLAR on CPUs    [Toc]    [Back]
       Before performing an OLAR operation, be aware of the  following
  cautions:  When  offlining or removing one or more
       CPUs, processes scheduled to run on the affected CPUs will
       be scheduled to execute on other running CPUs, thus redistributing
 the  processing  capacity  among  the  remaining
       CPUs.   In  general,  this will result in a system performance
 degradation, proportional  to  the  number  of  CPUs
       taken  out of service and the current system load, for the
       period of the OLAR operation. Multi-threaded  applications
       that are written to take advantage of known CPU concurrencies
  can  expect  to  encounter  significant  performance
       degradation  during the period of the OLAR operation.  The
       OLAR management utilities do not  presently  operate  with
       processor  sets.  Processor  sets are groups of processors
       that are dedicated for use by selected processes (see processor_sets(4)).  If a process has been specifically bound
       to   run   on   a    processor    set    (see    runon(1),
       assign_pid_to_pset(3)   ),   and   an  OLAR  operation  is
       attempted on the last running CPU in  the  processor  set,
       you  will  not  be notified by the OLAR utilities that you
       are effectively shutting down the  entire  processor  set.
       Offlining  the  last CPU in a processor set will cause all
       processes bound to that processor set to suspend until the
       processor set has at least one running CPU. Therefore, use
       caution when performing CPU  OLAR  operations  on  systems
       that  have been configured with processor sets.  If a process
 has been specifically bound to execute on a CPU  (see
       runon(1),  bind_to_cpu(3),  and bind_to_cpu_id(3) for more
       information), and an OLAR operation is attempted  on  that
       CPU,  you will be notified by the OLAR utilities that processes
 have been bound to the CPU prior to  any  operation
       being  performed. You may choose to continue or cancel the
       OLAR operation. By choosing to continue,  processes  bound
       to a CPU will suspend their execution until such time that
       the process is un-bound, or the CPU is placed back online.
       Note  that  choosing  to  offline a CPU that has processes
       bound may cause detrimental consequences to  the  application,
  depending  upon the characteristics of the application.
  If a process has been specifically bound to execute
       on  a  Resource  Affinity  Domain  (RAD) (see runon(1) and
       rad_bind_pid(3) for more information), and an OLAR  operation
  is attempted on the last running CPU in the RAD, you
       will be notified by the OLAR utilities that processes have
       been bound to the RAD and that the last CPU in the RAD has
       been requested to be placed offline. By choosing  to  continue,
  processes bound to the RAD will suspend their execution
 until such time that the process is un-bound, or at
       least  one  CPU  in  the  RAD  is placed online. Note that
       choosing to offline the last CPU in a RAD  with  processes
       bound  may  cause detrimental consequences to the application,
 depending upon the characteristics of  the  application.
   If  you are using program profiling utilities such
       as dcpi, kprofile, or uprofile, that are aware of the system's
  CPU  configuration, unpredictable results may occur
       when performing OLAR operations. It  is  therefore  recommended
 that these profiling utilities be disabled prior to
       performing an OLAR operation. Ensure  that  all  the  processes
  including  any associated daemons that are related
       to these utilities have  been  stopped  before  performing
       OLAR operations on system CPUs.

              The  device  drivers used by these profiling utilities
 are usually configured into the kernel dynamically,
  so  the  tools  can be disabled before each
              OLAR operation with the following commands:

              # sysconfig -u pfm

              # sysconfig -u pcount

              The appropriate driver can be re-enabled  with  one
              of the following:

              # sysconfig -c pfm

              # sysconfig -c pcount

              The automatic deallocation of CPUs, enabled through
              the Automatic Deallocation Facility, should be disabled
 whenever the pfm or pcount device drivers are
              configured into the kernel, or vice  versa.   Refer
              to  the documentation and reference pages for these
              utilities for additional information.


   General Procedures for Online Addition and Removal of CPUs    [Toc]    [Back]
                                Caution

       Pay attention to the system safety notes  as  outlined  in
       the GS80/160/320 Service Manual.

       Removing a CPU Module

              To  perform an online removal of a CPU module, follow
 these steps  using  your  preferred  management
              application,  described  in the section "Tools  for
              Managing OLAR".  Off-line the  CPU.  The  operating
              system  will stop scheduling and executing tasks on
              this CPU.  Using  your  preferred  OLAR  management
              application,  make  note of the quad building block
              (QBB) number where this CPU is inserted.   This  is
              the  "hard"  (or physical) QBB number, and does not
              change if the system is partitioned.  Power the CPU
              module  off. The LED on the CPU module will illuminate
 yellow, indicating that the CPU module is  unpowered,
 and safe to be removed.  Physically remove
              the CPU module.  Note  that  the  operating  system
              automatically  recognizes  that  the CPU module has
              been physically removed.  There is no need to  perform
  a  scan  operation  to update the    hardware
              configuration.  Adding a CPU module

              To perform an online addition of a CPU module, follow
  these  steps  using  your preferred management
              application, described in the  section  "Tools  for
              Managing  OLAR".   Select  an available CPU slot in
              one of the configured quad building  blocks  (QBB).
              If there are available slots in several QBBs, it is
              typically best to equally distribute the number  of
              CPUs  among  the  configured  QBBs.  Insert the CPU
              module into the CPU slot. Ensure that you align the
              color-coded decal on the CPU module with the colorcode
 decal on the CPU slot. The LED on the CPU module
 will illuminate yellow, indicating that the CPU
              module is un-powered. Note that  the  CPU  will  be
              automatically  recognized  by the operating system,
              even though it is un-powered. There is no  need  to
              perform  a  scan operation for the operating system
              to identify the CPU module.  Power the  CPU  module
              on.  The  CPU module will undergo a short self-test
              (7-10 secs), after which the  LED  will  illuminate
              green,  indicating the module is powered-on and has
              passed its self-test.  On-line the  CPU.  Once  the
              CPU is on-line, the operating system will automatically
 begin to schedule and execute tasks  on  this
              CPU.


   Tools for Managing OLAR    [Toc]    [Back]
       When it is necessary to perform an OLAR operation, use the
       following tools which are provided as part of  the  SysMan
       suite of system management utilities.

   Manage CPUs    [Toc]    [Back]
       "Manage CPUs" is a task-oriented application that provides
       the following functions: Change the  state  of  a  CPU  to
       online  or  offline  Power on or power off a CPU Determine
       the status of each inserted CPU

       The "Manage CPUs" application can be run equivalently from
       an  X  Windows display, a terminal with curses capability,
       or locally on a PC (as described below), thus providing  a
       great deal of flexibility when performing OLAR operations.

                                  Note

       You must be a privileged user to  run  the  "Manage  CPUs"
       application.   Non-root  users  may  also  run the "Manage
       CPUs" application if they are assigned  the  "HardwareManagement"
  privilege. To assign a user the "HardwareManagement"
 privilege, issue the following command to launch the
       "Configure DOP" application:

       # sysman dopconfig [-display <hostname>]

       Please  refer to the dop(8) reference page and the on-line
       help in the 'dopconfig' application for  further  information.
  Additionally,  the Manage CPUs application provides
       online help capabilities that describe  the  operation  of
       this application.

       The  "Manage CPUs" application can be invoked using one of
       the following methods:

       SysMan Menu At the command prompt in  a  terminal  window,
       enter the following command:

              [Note that the "DISPLAY" shell environment variable
              must be set, or the "-display" command line  option
              must be used, in order to launch the X Windows version
 of SysMan Menu.  If there is no indication  of
              which  graphics display to use, or if invoking from
              a character cell terminal, then the curses  version
              of SysMan Menu will be launched.]

              # sysman [-display <hostname>] Highlight the "Hardware"
 entry and press "Select" Highlight the  "Manage
  CPUs"  entry and press "Select" SysMan command
              line accelerator

              To launch the Manage CPUs application directly  via
              the  command prompt in a terminal window, enter the
              following command:

              # sysman hw_manage_cpus [-display hostname]

              [Note that the "DISPLAY" shell environment variable
              must  be set, or the "-display" command line option
              must be used, in order to launch the X Windows version
  of Manage CPUs.  If there is no indication of
              which graphics display to use, or if invoking  from
              a  character cell terminal, then the curses version
              of Manage CPUs will be launched.]   System  Management
 Station

              To launch the Manage CPUs application from the System
 Management Station, do the  following:  At  the
              command  prompt  in a terminal window from a system
              that supports graphical display, enter the  following
 command:

              # sysman -station [-display hostname]

              When  the  System  Management Station launches, two
              separate windows will appear.  One  window  is  the
              Status  Monitor  view,  and the other window is the
              Hardware view, providing a graphical  depiction  of
              the  hardware connected to your system.  Select the
              Hardware view window.  Select the CPU for  an  OLAR
              operation  by  left-clicking  once  with the mouse.
              Select Tools from the menu bar, or right-click once
              with the mouse. A list of menu options will appear.
              Select Daily Administration from the list.   Select
              the Manage CPUs application.  Manage CPUs from a PC
              or Web Browser

              You can also perform OLAR management from  your  PC
              desktop or from within a web browser. Specifically,
              you can run Manage CPUs via the  System  Management
              Station  client  installed  on  your desktop, or by
              launching the System Management Station client from
              within  a  browser pointed to the Tru64 UNIX System
              Management home page. For a detailed description of
              options and requirements, visit the Tru64 UNIX System
 Management home page, available from any  Tru64
              UNIX  system running V5.1A (or higher), at the following
 URL:

              http://hostname:2301/SysMan_Home_Page

              where "hostname" is the name of a Tru64  UNIX  Version
 5.1B, (or higher) system.

   hwmgr Command Line Interface (CLI)
       In  addition  to  its  set  of generic hardware management
       capabilities, the hwmgr(8) command line interface incorporates
  the  same level of OLAR management functionality as
       the Manage CPUs application. You must be root to  run  the
       hwmgr  command;  this  command  does not currently operate
       with DOP.

       The following describes the OLAR  specific  commands  supported
  by  hwmgr.  To  obtain  general help on the use of
       hwmgr, issue the command:

       # hwmgr -help

       To obtain help on a specific option, issue the command:

       # hwmgr -help "option"

       where option is the name of the option you want  help  on.
       To obtain the status and state information of all hardware
       components the operating system is  aware  of,  issue  the
       following command: # hwmgr -status comp
                        STATUS   ACCESS  HEALTH      INDICT
        HWID: HOSTNAME  SUMMARY  STATE   STATE       LEVEL   NAME
       -------------------------------------------------------------
          3:       wild-one                online       available
       dmapi
         49:      wild-one                online        available
       dsk2
         50:       wild-one                online       available
       dsk3
         51:      wild-one                online        available
       dsk4
         52:       wild-one                online       available
       dsk5
         56:      wild-one                online        available
       Compaq
                                                              Alpha
       Server
                                                              GS160
       6/731
         57:       wild-one                online       available
       CPU0
         58:      wild-one                online        available
       CPU2
         59:       wild-one                online       available
       CPU4
         60:      wild-one                online        available
       CPU6



              or,  to  obtain  status on an individual component,
              use the hardware id (HWID)  of  the  component  and
              issue the command:

              # hwmgr -status comp -id 58

                                 STATUS       ACCESS       HEALTH
              INDICT
               HWID:    HOSTNAME     SUMMARY    STATE       STATE
              LEVEL                                          NAME
              -------------------------------------------------------------
                 58:    wild-one             online     available
              CPU2



              To see the complete list of options for  "-status",
              issue the command:

              # hwmgr -help status To view a hierarchical listing
              of all hardware components the operating system  is
              aware of, issue the command:

              # hwmgr -view hier
               HWID:  hardware  hierarchy  (!)warning (X)critical
              (-)inactive (see -status)
               -------------------------------------------------------------------------
                  1: platform Compaq AlphaServer GS160 6/731
                  9:   bus wfqbb0
                  10:     connection wfqbb0slot0
                  11:       bus wfiop0
                  12:         connection wfiop0slot0
                  13:           bus pci0
                  14:             connection pci0slot1

                   o
                   o
                   o

                  57:     cpu qbb-0 CPU0
                  58:     cpu qbb-0 CPU2


              This  example shows that CPU0 and CPU2 are children
              of bus name "wfqbb0", and that their physical location
  is  (hard) qbb-0.  Note that hard QBB numbers
              do not change as the system partitioning changes.

              To quickly identify which QBB a CPU  is  associated
              with, issue the command:

              #  hwmgr -view hier -id 58 HWID:   hardware hierarchy

              -----------------------------------------------------
                  58:   cpu CPU0 qbb-0 To offline a CPU  that  is
              currently in the online state, issue the command

              # hwmgr -offline -id 58

              or

              # hwmgr -offline -name CPU2

              Note  that device names are case sensitive. In this
              example, CPU2 must be upper case. To verify the new
              status of CPU2, issue the command:

              # hwmgr -status comp -id 58

                                 STATUS       ACCESS       HEALTH
              INDICT
               HWID:    HOSTNAME     SUMMARY    STATE       STATE
              LEVEL                                          NAME
              --------------------------------------------------------------
                 58:   wild-one    critical  offline    available
              CPU2



              Note that the offline state will  be  saved  across
              future  reboots  of the operating system, including
              power cycling the system. If you want the component
              to  return  to  the  online state the next time the
              operating  system  is  booted,  use  the  "-nosave"
              switch.

              # hwmgr -offline -nosave -id 58

              or

              # hwmgr -offline -nosave -name CPU2

              Once again, to verify the status of CPU2, issue the
              command:

              # hwmgr -status comp -id 58

                                STATUS   ACCESS            HEALTH
              INDICT
               HWID:   HOSTNAME   SUMMARY  STATE            STATE
              LEVEL                                          NAME
              ----------------------------------------------------------------------
                 58:  wild-one   critical offline(nosave)  available
            CPU2


              To power off a CPU that is currently in the offline
              state, issue the command:

              # hwmgr -power off -id 58

              or

              # hwmgr -power off -name CPU2

              Note that a component must be in the offline  state
              before power can be removed using hwmgr. Once power
              has been removed from a component, is  it  safe  to
              remove that component from the system.  To power on
              a CPU that is currently powered off, issue the command:


              # hwmgr -power on -id 58

              or

              #  hwmgr -power on -name CPU2 To place a CPU online
              so that the operating system can  start  scheduling
              processes to run on that CPU, issue the command:

              # hwmgr -online -id 58

              or

              # hwmgr -online -name CPU2


       Refer to the hwmgr(8) reference page for additional information
 on the use of hwmgr.

   Component Indictment Overview    [Toc]    [Back]
       Component indictment is a proactive  notification  from  a
       fault  analysis  utility,  indicating  that a component is
       experiencing high incidence  of  correctable  errors,  and
       therefore  should  be  serviced and/or replaced. Component
       indictment involves  the  process  of  analyzing  specific
       failure  patterns  from  error log entries, either immediately
 or over a given time interval,  and  recommending  a
       component's  removal.  The  fault analysis utility signals
       the running operating system that  a  given  component  is
       suspect,  causing  the operating system to distribute this
       information via an EVM indictment event such  that  interested
  applications,  including the System Management Station,
 Insight  Manager,  and  the  Automatic  Deallocation
       Facility  can  update  their state information, as well as
       take appropriate action if so configured (see the  discussion
 on Automatic Deallocation Facility below).

       It  is possible for more than one component to be indicted
       simultaneously if the exact source of error cannot be pinpointed.
   In these cases, the most likely suspect will be
       indicted with a `high` probability. The next  likely  suspect
 will be indicted with a `medium` probability, and the
       least likely suspect will be indicted with a `low`  probability.
  When this situation arises, the indictment events
       can be tied  together  by  examining  the  "report_handle"
       variable  within  the indictment events. Indictment events
       for the same error will contain the  same  "report_handle"
       value.

       The  indicted  state  of  a  component will persist across
       reboot and system initialization if no action is taken  to
       remedy  the  suspect  component,  such as an online repair
       operation. Once an indictment has  occurred  for  a  given
       component,  another indictment event will not be generated
       for that component  unless  the  utility  has  determined,
       through  additional  analysis, that   the original indictment
 probably should be updated.  In this case, the component
  will  be  re-indicted with the new probability. Once
       the indicted component has been serviced, it is  necessary
       to  manually  clear  the indicted component state with the
       following hwmgr command:

       # hwmgr -unindict -id <hwid>

       where <id> is the hardware id (HWID) of the component

       Allowing the operator to manually clear the indicted problem
  state,  ensures  positive  identification  of  when a
       replaced component is operating properly.

       All component indictment EVM events have an  event  prefix
       of  sys.unix.hw.state_change.indicted.  You  may  view the
       complete list of all possible component indictment  events
       that may be posted, including a description of each event,
       by issuing the command:

       #         evmwatch          -i          -f          '[name
       sys.unix.hw.state_change.indicted]' | evmshow -t
          "@name" -x | more

       You  may  view  the  list  of  indictment events that have
       occurred by issuing the command:

       # evmget -f '[name  sys.unix.hw.state_change.indicted]'  |
       evmshow -t "@name"

       CPU  modules  and memory pages are currently supported for
       component indictment.

       Compaq Analyze, included as part of the  Web-Based  Enterprise
  Services  (WEBES)  4.0  product (or higher), is the
       fault analysis utility that supports component  indictment
       on  a Tru64 UNIX (V5.1A or higher) system. The WEBES product
 is included as part of the Tru64 UNIX operating system
       distribution,  and must be installed after installation of
       the base operating system. Please refer to the Compaq Analyze
  documentation,  distributed  with the WEBES product,
       for a list of AlphaServer platforms that support the  component
 indictment feature.

   Automatic Deallocation Facility Overview    [Toc]    [Back]
       The  Automatic  Deallocation Facility provides the ability
       to automatically take an indicted component  out  of  service,
  thus providing the automated ability for the system
       to heal itself while furthering the reliability and availability
 of the system. The Automatic Deallocation Facility
       currently supports the ability to stop using CPUs and memory
 pages that have been indicted.

       The  ability to tailor the behavior of the automatic deallocation
 facility can be user-controlled  on  both  single
       and  clustered  systems, through the use of the text-based
       OLAR Policy Configuration files. When operating in a clustered
  environment,  automatic deallocation policy applies
       to all members in a cluster by default. This is  specified
       through  the  cluster-wide  file  /etc/olar.config.common.
       However, individual cluster-wide policy variables  can  be
       overridden  using  the  member-specific configuration file
       /etc/olar.config.

       The OLAR Policy Configuration files contain  configuration
       variables that control specific behaviors of the Automatic
       Deallocation Facility. Behaviors such as whether or not to
       enable  automatic  deallocation, and what times of the day
       automatic deallocation should be enabled can  be  defined.
       Additionally,  the  ability  to  specify  a  user-supplied
       script or executable that provides the gating factor as to
       whether  an automatic deallocation operation should occur,
       can be provided as well.

       Automatic deallocation is supported  for  those  platforms
       that   support   the   component  indictment  feature,  as
       described in the  Component  Indictment  Overview  section
       above.

       Refer  to the olar.config(4) reference page for additional
       information about the OLAR Policy Configuration files.

OLAR_intro(5)

Contents

NAME [Toc] [Back]

DESCRIPTION [Toc] [Back]

SEE ALSO [Toc] [Back]