prof_intro - Tru64

· Home

+ man pages

-> Linux

-> FreeBSD

-> OpenBSD

-> NetBSD

-> Tru64 Unix

-> HP-UX 11i

-> IRIX

· Linux HOWTOs

· FreeBSD Tips

· *niX Forums

man pages->Tru64 Unix man pages -> prof_intro (1)

prof_intro(1)

NAME
DESCRIPTION
AUTOMATIC AND PROFILE-DIRECTED OPTIMIZATIONS
- Tools and Examples
MANUAL DESIGN AND CODE OPTIMIZATIONS
- Tools and Examples
MINIMIZING SYSTEM RESOURCE USAGE
- Tools and Examples
VERIFYING SIGNIFICANCE OF TEST CASES
- Tools and Examples
SEE ALSO

NAME [Toc] [Back]

       prof_intro  -  Introduction to application profilers, profiling,
 optimization, and performance analysis

DESCRIPTION [Toc] [Back]

       Tru64  UNIX  supports  four  approaches   to   performance
       improvement: Automatic and profile-directed optimizations.
       For example: pixie -update a.out data/* cc -non_shared -O3
       -spike  -feedback  a.out  *.c Manual design and code optimizations.
  For  example:  hiprof  -all  -display  program
       data/*  | more hiprof -flat -all -display program data/* |
       more uprofile -heavy program data/* | more Minimizing system-resource
  usage.  For  example: third -display program
       data/* | more Verifying significance of  test  cases.  For
       example: pixie -testcoverage program data/* | more


       One approach might be enough, but more might be beneficial
       if no single approach addresses all aspects of a program's
       performance. The following sections describe each approach
       and the tools provided by Tru64 UNIX to support them.

AUTOMATIC AND PROFILE-DIRECTED OPTIMIZATIONS [Toc] [Back]

   Techniques
       Automatic and profile-directed optimizations are the  simplest
 approaches to improving application performance.

       Some  degree  of automatic optimization can be achieved by
       using the compiler's and  linker's  optimization  options.
       These  can  help  in the generation of minimal instruction
       sequences that make best use of the CPU  architecture  and
       cache memory.

       However,  the  compiler and linker can improve their optimizations
 if they are given information on which  instructions
 are executed most often when the program is run with
       its normal input data and environment. While  the  default
       optimizations  give  improved  performance for most common
       situations, the optimizers can do even better if they  can
       tune  the program in favor of the heavily used instruction
       sequences as determined from a sample run.

       Tru64 UNIX helps you  provide  the  optimizers  with  this
       information  on  processing  hot-spots  by allowing a profiler's
 results to be fed back into a recompilation.  This
       customized,  profile-directed  optimization can be used in
       conjunction with automatic optimization.

   Tools and Examples    [Toc]    [Back]
       The cc compiler command's automatic  optimization  options
       are  selected  with  -O, -fast, -inline, -spike, and other
       related options. See cc(1) for details and Chapter  10  of
       the  Programmer's  Guide  for more information on the many
       options and tradeoffs available.

       For example, this command selects a high degree  of  optimization
   in   both  the  compiler  and  the  linker:  cc
       -non_shared -O3 -spike *.c

       The pixie profiler provides profile information  that  the
       cc  command's -feedback and -spike options can use to tune
       the generated instruction sequences to the demands  placed
       on the program by particular sets of input data.

       The  steps, shown in the following example, consist of (1)
       preparing the program for  profile-directed  optimization,
       (2)  creating  an  instrumented version of the program and
       running it to collect profiling statistics, and (3)  feeding
  that  information  back to the compiler and linker to
       help them optimize the executable code: rm -f  program  cc
       -non_shared  -feedback  program  -o  program -O3 *.c pixie
       -update program cc -non_shared -feedback program  -o  program
 -O3 -spike *.c

       To   apply   profile-directed   optimizations   to  shared
       libraries, generate profile data with  an  exerciser  program,
  and  store it in the shared library prior to recompiling
 with that feedback. For  example:  rm  -f  libexample.so
 cc -feedback libexample.so -o libexample.so -shared
       -O3 lib*.c cc -o exerciser exerciser.c -L. -lexample pixie
       -L.  -incobj  libexample.so  -run  exerciser  prof  -pixie
       -update libexample.so exerciser.Counts cc -spike -feedback
       libexample.so -o libexample.so -shared -O3 lib*.c

MANUAL DESIGN AND CODE OPTIMIZATIONS [Toc] [Back]

   Techniques
       The effectiveness of the automatic optimizations described
       previously is limited by the efficiency of the  algorithms
       that the program uses. A program's performance can be further
 improved by manually optimizing  its  algorithms  and
       data  structures.  Such optimizations may include reducing
       complexity from N-squared to log-N,  avoiding  copying  of
       data,  and  reducing  the amount of data used. It may also
       extend to tuning the algorithm to the architecture of  the
       particular  machine  it will be run on - for example, processing
 large arrays in small blocks such that each  block
       remains  in  the data cache for all processing, instead of
       the whole array being read into the cache  for  each  processing
 phase.

       Tru64 UNIX supports manual optimization with its profiling
       tools, which identify the parts of  the  application  that
       use  most CPU resources - CPU cycles, cache misses, and so
       on. By evaluating different profiles of a program, you can
       identify which parts of the program use most CPU resources
       and you can then redesign or recode  algorithms  in  those
       parts  to  use less resources. The profiles also make this
       exercise more cost-effective by helping you  to  focus  on
       the  most  demanding  code  instead of the least demanding
       code.

   Tools and Examples    [Toc]    [Back]
       A call-graph profile shows how much CPU time  is  used  by
       each  procedure,  and how much is used by all of the other
       procedures that it calls. This can show  which  phases  or
       subsystems  in a program spend most of the total CPU time,
       which can help in gaining a general understanding  of  the
       program's performance.

       The  hiprof profiler instruments the program and records a
       call graph while the instrumented  program  executes.  The
       hiprof  profiler does not require that the program be compiled
 in any particular way, but the names of  local  (for
       example,  static) procedures will be hidden if the cc command's
 default -g0 option was used, and procedures will be
       hidden  if  they  are  inlined. For example: cc -g1 -O2 -o
       program *.c hiprof -all -display program data/* | more

       By default, hiprof uses  a  low-frequency  sampling  technique.
 It can profile all of the code executed by the program,
 including all selected libraries,  though  its  call
       graph   excludes   procedures  in  threads-related  system
       libraries. It can also provide detailed  profiles  at  the
       level of source lines or machine instructions.

       For  non-threaded programs, hiprof can alternatively count
       the number of machine cycles  used  or  page  faults  that
       occur  during  program  execution. In these modes, the CPU
       time or page-faults count reported  for  the  instrumented
       routines  includes  that  for  the uninstrumented routines
       that they call. This can summarize the  costs  and  reduce
       the  run-time  overhead,  but  note that the machine-cycle
       counter wraps if no instrumented procedure  is  called  at
       least every few seconds.

       The  cc compiler's -pg option uses the same sampling technique
 as hiprof. This technique is  supported  in  a  very
       similar  way on different vendors' UNIX systems. For example:
 cc -g1 -O2 -pg -o program *.c ./program data/*  gprof
       program gmon.out | more

       However,  hiprof  may  be preferred because the -pg option
       has some disadvantages: The program needs to be  specially
       compiled  with  the -pg option.  Only a few of the archive
       libraries that are provided with the operating system were
       compiled  to generate a call-graph profile.  Only the executable
 is profiled. Shared libraries are not.

       The optional dxprof command provides a  graphical  display
       of various call-graph profiles.


       A  good  performance-improvement strategy may start with a
       procedure-level profile of the whole program (perhaps with
       a  call  graph  too, to give the big picture), but it will
       often progress to detailed profiling of individual sourcelines
 and instructions.

       The  uprofile profiler uses a sampling technique to generate
 a profile of the CPU time  or  events  such  as  cache
       misses  associated  with  each procedure or source-line or
       instruction. The sampling frequency depends on the processor
 type and the statistic being sampled, but for CPU time
       it is  on  the  order  of  a  millisecond.   The  profiler
       achieves  this without modifying the target program at all
       by using hardware counters that are built into  the  Alpha
       CPU.   Running  the  uprofile  command  with  no arguments
       yields a list of all the kinds of events that a particular
       machine can profile, depending on the nature of its architecture.
 The default is to profile machine cycles, resulting
 in a CPU-time profile. The following example shows how
       to display a profile of the source lines that  experienced
       the  top 90% of data cache misses on an EV56 Alpha: cc -g1
       -O2 -o program *.c uprofile -h  -q  90cum%  dcacheldmisses
       program data/* | more

       This  technique  has  the  advantage  of very low run-time
       overhead. Also, the detailed information it can provide on
       the  costs  of executing individual instructions or source
       lines is essential in identifying exactly which  operation
       in a procedure is slowing down the program.

       The  disadvantages  of  uprofile are that only executables
       can be profiled, the results can be skewed unless all processors
 have the same cycle speed, only one program can be
       profiled with the hardware counters at one  time,  threads
       can not be profiled individually, and the Alpha EV6 architecture's
 execution of instructions out  of  sequence  can
       significantly  reduce  the  accuracy  of fine-grained profiles.


       If hiprof's -flat option is  used,  its  default  sampling
       technique  can  provide  the same fine-grain profiles (CPU
       time only) and low intrusiveness as uprofile. Also, it  is
       accurate  even  with  mixed processor cycle speeds, and it
       can profile all of a program's shared libraries as well as
       its  individual threads. For example: hiprof -flat -h -all
       program data/* | more

       The cc compiler's -p option uses  the  same  low-frequency
       sampling  technique  as  hiprof. It is common to many UNIX
       systems, and (on Tru64 UNIX) it is able to profile all the
       shared  libraries  used by a program. The program needs to
       be relinked with the -p option, but it does not need to be
       recompiled  from  source, so long as the original compilation
 used an acceptable debug level, such as the -g1  compiler
  option. For example, to profile individual instructions
 of a program: cc -p -o program *.o setenv  PROFFLAGS
       '-all  -stride 1' ./program data/* prof -all -asm -quit 5%
       program mon.out | more

       The pixie tool can also profile source lines and  instructions
  (including shared libraries), but note that when it
       displays counts of  "Cycles",  it  is  actually  reporting
       counts  of  instructions executed, not machine cycles. For
       example: cc -g1 -O2 -o program *.c pixie -all -lines -quit
       20 program data/* | more

       The  optional  dxprof command provides a graphical display
       of profiles collected by either pixie or the cc  command's
       -p option.

MINIMIZING SYSTEM RESOURCE USAGE [Toc] [Back]

   Techniques
       The  preceding techniques can improve an application's use
       of just the CPU. Further performance improvements  can  be
       made  by  improving the efficiency with which the application
 uses the other components  of  the  computer  system:
       heap memory, disk files, network connections, and so on.

       As with CPU profiling, the first phase of a resource usage
       improvement process is to monitor how  much  memory,  data
       I/O and disk space, elapsed time, and so on, is used. Then
       the throughput of the computer can be increased  or  tuned
       in ways that help the program, or the program's design can
       be tuned to make better use of the computer resources that
       are  available.  For  example: Reduce the size of the data
       files that the program reads and writes.   Use  memory-map
       files  instead  of regular I/O.  Allocate memory incrementally
 on demand instead of allocating at start-up the maximum
  that  could be required.  Fix heap leaks, and do not
       leave allocated memory unused.  See the System  Configuration
 and Tuning manual for a broader discussion of analyzing
 and tuning a Tru64 UNIX system.






   Tools and Examples    [Toc]    [Back]
       The Tru64 UNIX base system commands ps u, swapon  -s,  and
       vmstat 3 can show the currently active processes' usage of
       system resources such as CPU time,  physical  and  virtual
       memory, swap space, page faults, and so on.

       The optional pview command provides a graphical display of
       similar information for the  processes  that  comprise  an
       application.

       The  time  commands  provided by the Tru64 UNIX system and
       command shells provide an easy way to  measure  the  total
       elapsed  time  and  CPU time for a program and its descendants.


       The collect tool is an optional, low overhead, system performance
 monitor.

       Many  other  related  commands are described in the System
       Configuration and Tuning manual.


       The third command reports heap memory leaks in a  program,
       by  instrumenting  it  with  the Third Degree memory-usage
       checker,  running  it,  and  displaying  a  log  of  leaks
       detected at program exit. For example: third -display program
 data/* | more

       If you are interested only in leaks occurring  during  the
       normal  operation  of  the  program, not during startup or
       shutdown, you can specify additional places to  check  for
       previously unreported leaks. For example, the pre-shutdown
       leak report will give  this  information:  third  -display
       -after startup -before shutdown program data/* | more

       Third  Degree  can  also detect various kinds of bugs that
       may be affecting the correctness or performance of a  program.
  See  the  Programmer's Guide for further details on
       debugging and leak-detection.

       The optional dxheap command provides a  graphical  display
       of Third Degree's heap and bug reports.

       The  optional  mview command provides a graphical analysis
       of heap usage over time. This view of a program's heap can
       clearly  show  the presence (if not the cause) of significant
 leaks or other undesireable  trends  such  as  wasted
       memory.

VERIFYING SIGNIFICANCE OF TEST CASES [Toc] [Back]

   Techniques
       Most  of  the preceding profiling techniques are effective
       only if you profile and optimize or tune the parts of  the
       program  that  are executed in the scenarios whose performance
 is important. Careful selection of the data used for
       the  profiled  test  runs is often sufficient, but you may
       want a quantitative analysis of which code was and was not
       executed in a given set of tests.








   Tools and Examples    [Toc]    [Back]
       The  pixie  command's -t[estcoverage] option reports lines
       of code that were not executed in a given  test  run.  For
       example: pixie -t program data/* | more

       Conversely,  pixie's  -p[rocedure],  -h[eavy],  and -a[sm]
       options show which procedures, source lines, and  instructions
 were executed.

       If  multiple  test  runs  are needed to build up a typical
       scenario, the prof command can be run separately on a  set
       of profile data files: pixie -pids program ./program.pixie
       data1/* ./program.pixie data2/*  prof  -pixie  -t  program
       program.Counts.*

prof_intro(1)

Contents

NAME [Toc] [Back]

DESCRIPTION [Toc] [Back]

AUTOMATIC AND PROFILE-DIRECTED OPTIMIZATIONS [Toc] [Back]

MANUAL DESIGN AND CODE OPTIMIZATIONS [Toc] [Back]

MINIMIZING SYSTEM RESOURCE USAGE [Toc] [Back]

VERIFYING SIGNIFICANCE OF TEST CASES [Toc] [Back]

SEE ALSO [Toc] [Back]