| 
AUTO_P(5)							     AUTO_P(5)
      AUTO_P - Automatic	Parallelization
      This man page discusses automatic parallelization and how to achieve it
     with the Silicon Graphics MIPSpro Automatic Parallelization Option. The
     following topics are covered:
     Automatic Parallelization and the MIPSpro Compilers
     Using the MIPSpro Automatic Parallelization Option
Automatic Parallelization and the MIPSpro Compilers    [Toc]    [Back]     Parallelization is	the process of analyzing sequential programs for
     parallelism so that they may be restructured to run efficiently on
     multiprocessor systems. The goal is to minimize the overall computation
     time by distributing the computational work load among the	available
     processors. Parallelization can be	automatic or manual.
     During automatic parallelization, the MIPSpro Automatic Parallelization
     Option, hereafter called the auto-parallelizer, analyzes and structures
     the program with little or	no intervention	by the developer. The autoparallelizer
 can automatically generate code that splits the processing
     of	loops among multiple processors. The alternative is manual
     parallelization by	which the developer performs the parallelization using
     pragmas and other programming techniques. Manual parallelization is
     discussed in the mp(3f) and mp(3c)	man pages.
     Automatic parallelization begins with the determination of	data
     dependence	of variables and arrays	in loops. Data dependence can prevent
     loops from	being safely run in parallel because the final outcome of the
     computation may vary depending on the order the various processors	access
     the variables and arrays. Data dependence and other obstacles to
     parallelization are discussed in more detail in the next section.
     Once data dependences are resolved, a number of automatic parallelization
     strategies	can be employed. They can consist of the following:
	  Loop interchange of nested loops
	  Scalar expansion
	  Loop distribution
	  Automatic synchronization of DOACROSS	loops
	  Intraprocedural array	privatization
     The 7.2 release of	the MIPSpro compilers marks a major revision of	the
     auto-parallelizer.	The new	release	incorporates automatic parallelization
     into the other optimizations performed by the MIPSpro compilers. Previous
									Page 1
AUTO_P(5)							     AUTO_P(5)
     versions relied on	preprocessors to provide source-to-source conversions
     prior to compilation. This	change provides	several	benefits to
     developers:
     Automatic parallelization is integrated with optimizations	for single
     processors
     A set of options and pragmas consistent with the rest of the MIPSpro
     compilers
     Support for C++
     Better run-time and compile-time performance
The MIPSpro Automatic Parallelization Option    [Toc]    [Back]     Developers	exploit	parallelism in programs	to provide better performance
     on	multiprocessor systems.	You do not need	a multiprocessor system	to use
     th	e automatic parallelizer. Although there is a slight performance loss
     when a single-processor system runs multiprocessed	code, you can use the
     auto-parallelizer on any Silicon Graphics system to create	and debug a
     program.
     The automatic parallelizer	is an optional software	product	that is	used
     as	an extension to	the following compilers:
	  MIPSpro Fortran 77
	  MIPSpro Fortran 90
	  MIPSpro C
	  MIPSpro C++
     It	is controlled by flags inserted	in the command lines that invoke the
     supported compilers.
Using the MIPSpro Automatic Parallelizer    [Toc]    [Back]     This section describes how	to use the auto-parallelizer when you compile
     and run programs with the MIPSpro compilers.
   Using the MIPSpro Compilers to Parallelize Programs    [Toc]    [Back]
     You invoke	the auto-parallelizer by using the -pfa	or -pca	flags on the
     command lines for the MIPSpro compilers. The syntax for compiling
     programs with the auto-parallelizer is as follows:
     For Fortran 77 and	Fortran	90 use -pfa:
     %f77 options -pfa [{ list | keep }] [ -mplist ] filename
									Page 2
AUTO_P(5)							     AUTO_P(5)
     %f90 options -pfa [{ list | keep }] [ -mplist ] filename
     For C and C++ use -pca:
     %cc options -pca [{ list |	keep }]	[ -mplist ] filename
     %CC options -pca [{ list |	keep }]	[ -mplist ] filename
     where options are MIPSPro compiler	command-line options. For details on
     the other options see the documentation for your MIPSPro compiler.
     -pfa and -pca
	  Invoke the auto-parallelizer and enable any multiprocessing
	  directives.
     list
	  Produce an annotated listing of the parts of the program that	can
	  (and cannot) run in parallel on multiple processors. The listing
	  file has the suffix .l.
     keep
	  Generate the listing file (.l), and the transformed equivalent
	  program (.m),	and creates an output file for use with	WorkShop Pro
	  MPF (.anl).
     -mplist
	  Generate a transformed equivalent program in a .w2f.f	file for
	  Fortran 77 or	a .w2c.c file for C.
     filename
	  The name of the file containing the source code.
     To	use the	automatic parallelizer with Fortran programs, add the -pfa
     flag to both the compile and link line. For C or C++, add the -pca	flag.
     If	you link separately, you must also add -mp to the link line. Previous
     versions of the Power compilers had a large set of	flags to control
     optimization. The 7.2 version uses	the same set of	options	as the rest of
     the MIPSPro compilers.  So, for example, while in the older Power
     compilers the option -pfa,-r=0 turned off roundoff	changing
     transformations in	the pfa	preprocessor, in the new compiler
     -OPT:roundoff=0 turns off roundoff	changing transformations in all	phases
     of	the compiler.
     The -pfa list option generates a .l file. The .l file lists the loops in
     your code,	indicating which were parallelized and which were not. If any
     were not parallelized, it explains	why not. The -pfa keep option
     generates a .l, a .m file and a .anl file that is used by the Workshop
     ProMPF tool. The .m file is similar to the	.w2f.f or .w2c.c file except
     that the file is annotated	with some information used by Workshop ProMPF
									Page 3
AUTO_P(5)							     AUTO_P(5)
     tool.
     The -mplist option	will, in addition to compiling your program, generate
     a .w2f.f file (for	Fortran	77, .w2c.c file	for C) that represents the
     program after the automatic parallelization phase.	These programs should
     be	readable and in	most cases should be valid code	suitable for
     recompilation. The	-mplist	option can be used to see what portions	of
     your code were parallelized.
     For Fortran 90 and	C++, automatic parallelization happens after the
     source program has	been converted into an internal	representation.	It is
     not possible to regenerate	Fortran	90 or C++ after	parallelization.
     Examples:
     Analyzing a .l File %cat foo.f
     subroutine	sub(arr,n)
	   real*8 arr(n)
	   do i=1,n
	     arr(i) = arr(i) + arr(i-1)
	   end do
	   do i=1,n
	     arr(i) = arr(i) + 7.0
	     call foo(a)
	   end do
	   do i=1,n
	     arr(i) = arr(i) + 7.0
	   end do
	   end
     %f77 -O3 -n32 -mips4 -pfa list foo.f -c.
     Here's the	associated .l file
     Parallelization Log for Subprogram	sub_ 3:	Not Parallel
	      Array dependence from arr	on line	4 to arr on line 4.
     6:	Not Parallel
	      Call foo on line 8.
     10: PARALLEL (Auto) __mpdo_sub_1
     Example Analyzing a .w2f.f	File
     %cat test.f
     subroutine	trivial(a)
       real a(10000)
									Page 4
AUTO_P(5)							     AUTO_P(5)
       do i=1,10000
	 a(i) =	0.0
       end do end
     %f77 -O3 -n32 -mips4 -c -pfa -c -mplist test.f
     We	get both an object file, test.o, and a test.w2f.f file that contains
     the following code
     SUBROUTINE	trivial(a)
       IMPLICIT	NONE
       REAL*4 a(10000_8)
       INTEGER*4 i
     C$DOACROSS	local(i), shared(a)
       DO i = 1, 10000,	1
	 a(i) =	0.0
       END DO
       RETURN
     END ! trivial
Running	Your Program
     Invoke your program as if it were a sequential program. The same binary
     can execute using different numbers of processors.	By default, the
     runtime will selec	t how many processors to use based on the number of
     processors	in the machine.	The developer can use the environment
     variable, NUM_THREADS, to change the default to use an explicit number of
     processors. In addition, the developer can	have the number	of processors
     vary dynamically from loop	to loop	based on system	load by	setting	the
     environment variable MP_SUGNUMTHD.	Refer to the mp(3f) and	mp(3c) for
     more details.
     Simply passing code through the auto-parallelizer does not	always produce
     s all the increased performance available.	In the next chapter, we
     discuss strategies	for making effective use of the	product	when the
     auto-parallelizer is not able to fully parallelize	an application.
   Analyzing the Automatic Parallelizer's Results
									Page 5
AUTO_P(5)							     AUTO_P(5)
     Running a program through the auto-parallelizer often results in
     excellent parallel	speedups, but there are	cases that cannot be
     automatically well	parallelized. By understanding the listing files, you
     can sometimes identify small problems that	prevent	a loop from running
     safely in parallel. With a	relatively small amount	of work, you can
     remove these data dependencies and	dramatically improve the program's
     performance.
     Hint:  When trying	to find	loops to run in	parallel, focus	your efforts
     on	the areas of the code that use the bulk	of the run time. Spending time
     trying to run a routine in	parallel that uses only	one percent of the run
     time of the program cannot	significantly improve the overall performance
     of	your program. To determine where your code spends its time, take an
     execution profile of the program using the	Speedshop performance tools.
     The auto-parallelizer provides several mechanisms to analyze what it did.
     For Fortran 77 and	C programs, the	-mplist	the code after
     parallelization. Manual parallelism directives are	inserted on loops that
     have been automatically parallelized. For details about these directives,
     refer to Chapters 5-7, "Fortran Enhancements for Multiprocessors,"	of the
     MIPSpro Fortran 77	Programmer's Guide", or	Chapter	11, "Multiprocessing
     C/C++ Compiler Directives," of the	C Language Reference Manual.
     The output	code in	the .w2f.f or .w2c.c file should be readable and under
     standable.	The user can use it as a tool to gain insight into what	the
     auto-parallelizer did. The	user can then use that insight to make changes
     to	the original source program.
     Note that the auto-parallelizer is	not a source to	source preprocessor,
     but is instead an internal	phase of the MIPSPro compilers.	With a
     preprocessor system, a post parallelization file would always be
     generated and fed into the	regular	compiler. This is not the case with
     the auto-parallelizer. Therefore, compiling a .w2f.f or .w2c.c file
     through a MIPSPro compiler	will not generate identical code to compiling
     the original source through the MIPSPro auto-parallelizer.	But, often the
     two will be almost	the same.
     The auto-parallelizer also	provides a listing mechanism via the -pfa or
     -pca keep or -pfa or -pca list option. This will cause the	compiler to
     generate a	.l file. The .l	file will list the original loops in the
     program along with	messages telling whether or not	the loops were
     parallelized. For loops that were not parallelized, an explanation	will
     be	given.
     Parallelization Failures With the Automatic Parallelizer
     This section discusses mistakes you can avoid and actions you can take to
     enhance the performance of	the auto-parallelizer. The auto-parallelizer
     is	not always able	to parallelize programs	effectively. This can be true
     for a number of reason s, some of which you can address. There are	three
									Page 6
AUTO_P(5)							     AUTO_P(5)
     broad categories of parallelization failure:
     The auto-parallelizer does	not detect that	a loop is safe to parallelize
     The auto-parallelizer chooses the wrong nested loop to make parallel
     The auto-parallelizer parallelizes	a loop that would run more efficiently
     sequentially
   Failure to Recognize	Safe Loops
     We	want the auto-parallelizer to recognize	every loop that	is safe	to par
     allelize. A loop is not safe if there is data dependence, so the
     automatic parallelizer analyzes each loop in a sequential program to try
     to	prove it is safe. If it	cannot prove a loop is safe, it	does not do
     the parallelization. A loop that contains any of the constructs described
     in	this section may not be	proved safe. However, in many instances	the
     loop can be proved	safe after minor changes. You should review your
     program's .l file,	to see if there	are any	of these constructs in your
     code.
     Usually the failure to recognize a	loop as	safe is	related	to one or more
     of	the following practices:.
     Function Calls in Loops
     GO	TO Statements in Loops
     Complicated Array Subscripts
     Conditionally Assigned Temporary Variables	in Loops"
     Unanalyzable Pointer Usage	in C/C++
   Function Calls in Loops    [Toc]    [Back]
     By	default, the auto-parallelizer does not	parallelize a loop that
     contains a	function call because the function in one iteration may	modify
     or	depend on data in other	iterations of the loop.	However, a couple of
     tools can help with this problem.
     Interprocedural analysis, specified by the	-IPA command-line option, can
     provide the auto-parallelizer with	enough additional information to
     parallelize some loops that contain function calls. For more information
     on	interprocedural	analysis, see the MIPSpro Compiling and	Performance
     Tuning Guide.
     The C*$* ASSERT CONCURRENT	CALL Fortran assertion,	discussed below	allows
     you to tell the auto-parallelizer to ignore function calls	when analyzing
     the specified loops.
   GO TO Statements in Loops    [Toc]    [Back]
									Page 7
AUTO_P(5)							     AUTO_P(5)
     The use of	GO TO statements in loops can cause two	problems:
     Early exits from loops.
	  It is	not possible to	parallelize loops with early exits, either
	  automatically	or manually.
     Unstructured control flows.
	  The auto-parallelizer	attempts to convert unstructured control flows
	  in loops into	structured constructs. If the auto-parallelizer	cannot
	  restructure these control flows, your	only alternatives are manual
	  parallelization or restructuring the code.
   Complicated Array Subscripts    [Toc]    [Back]
     There are several cases where array subscripts are	too complicated	to
     permit parallelization.
     Indirect Array References    [Toc]    [Back]
	  The auto-parallelizer	is not able to analyze indirect	array
	  references. Consider the following Fortran example.
	  do i=	1,n
	    a(b(i)) ...
	  end do
	  This loop cannot be run safely in parallel if	the indirect reference
	  b(i) is equal	to the same value for different	iterations of i. If
	  every	element	of array b is unique, the loop can safely be made
	  parallel. In such cases, use either manual methods or	the C*$*
	  ASSERT PERMUTATION Fortran directive discussed below,	to achieve
	  parallelism.
     Unanalyzable Subscripts    [Toc]    [Back]
	  The auto-parallelizer	cannot parallelize loops containing arrays
	  with unanalyzable subscripts.	In the following case, the autoparallelizer
 is not able to analyze the / in the array subscript and
	  cannot reorder the loop.
	  do i = l,u,2
	    a(i/2) = ... Changed to ().
	  end do
     Hidden Knowledge    [Toc]    [Back]
	  In the following example there may be	hidden knowledge about the
	  relationship between the variables m and n.
									Page 8
AUTO_P(5)							     AUTO_P(5)
	  do i = 1,n
	    a(i) = a(i+m) Changed to ().
	  end do
	  The loop can be run in parallel if m > n, because the	arrays will
	  not overlap. However,	because	the auto-parallelizer does not know
	  the value of the variables, it cannot	make the loop parallel.
   Conditionally Assigned Temporary Variables in Loops    [Toc]    [Back]
     When parallelizing	a loop,	the auto-parallelizer often localizes
     (privatizes) temporary scalar and array variables.	Consider the following
     example.
     do	i = 1,n
       do j = 1,n
	 tmp(j)	= ...
       end do
       do j = 1,n
	 a(j,i)	= a(j,i) + tmp(j)
       end do
     end do
     The array tmp is used for local scratch space. To successfully
     parallelize the outer (i) loop, each processor must be given a distinct,
     private tmp array.	In this	example, the auto-parallelizer is able to
     localize tmp and parallelize the loop.  The auto-parallelizer runs	into
     trouble when a conditionally assigned temporary variable might be used
     outside of	the loop, as in	the following example.
     subroutine	s1(a,b)
       common t
       ...
       do i = 1,n
	 if (b(i)) then
	   t = ...
	   a(i)	= a(i) + t
									Page 9
AUTO_P(5)							     AUTO_P(5)
	 end if
       end do
       call s2()
     If	the loop were to be run	in parallel, a problem would arise if the
     value of t	were used inside subroutine s2(). Which	processor's private
     copy of t should s2() use?	If t were not conditionally assigned, the
     answer would be the processor that	executed iteration n. But t is
     conditionally assigned and	the auto-parallelizer cannot determine which
     copy to use.
     The loop is inherently parallel if	the conditionally assigned variable t
     is	localized. If the value	of t is	not used outside the loop, you should
     replace t with a local variable. Unless t is a local variable, the	autoparallelizer
 must assume that s2()	might use it.
   Unanalyzable	Pointer	Usage in C/C++
     The C and C++ languages have features that	make them more difficult than
     Fortran to	automatically parallelize. Many	of these features are related
     to	the use	of pointers. The following practices involving pointers
     interfere with the	auto-parallelizer's effectiveness:
     Arbitrary Pointer Dereferences    [Toc]    [Back]
	  The auto-parallelizer	does not analyze arbitrary pointer
	  dereferences.	The only pointers it analyzes are array	references and
	  pointer dereferences that can	be converted into array	references.
	  The auto-parallelizer	can subdivide the trees	formed by
	  dereferencing	arbitrary pointers and run the parts in	parallel.
	  However, it cannot determine if the tree is really a directed	graph
	  with an unsafe multiple reference. Therefore the parallelization is
	  not done.
     Arrays of Arrays    [Toc]    [Back]
	  Multidimensional arrays are sometimes	implemented as arrays of
	  arrays. Consider this	example:  double **p;
	  for (int i = 0; i < n; i++)
	    for	(int j = 0; j <	n; j++)
	      p[i][j] =	 ...
	  If p is a true multi-dimensional array, the outer loop can be	run
	  safely in parallel. If two of	the array pointers, p[2] and p[3] for
	  example, reference the same array, the loop must not be run in
								       Page 10
AUTO_P(5)							     AUTO_P(5)
	  parallel. Although this duplicate reference is unlikely, the autoparallelizer
 cannot prove it doesn't exist. You can avoid this
	  problem by always using true arrays. To parallelize the code
	  fragment above, rewrite it as	follows:
	  double p[n][n];
	  for (int i = 0; i < n; i++)
	    for	(int j = 0; j <	n; j++)
	      p[i][j] =	...
	  Note:	 Although ANSI C does not allow	variable-sized multidimensional
 arrays, there is a proposal to allow them	in the next
	  standard. The	MIPSPro	7.2 auto-parallelizer already implements this
	  proposal.
     Loops Bounded by Pointer Comparisons    [Toc]    [Back]
	  The auto-parallelizer	reorders only those loops in which the number
	  of it	erations can be	exactly	determined. In Fortran programs	this
	  is rarely a problem, but in C	and C++	subtle issues relating to
	  overflow and unsigned	arithmetic can come to play. One consequence
	  of this is that loops	should not be bounded by pointer comparisons
	  such as
	  int* pl, pu;
	  for (int *p =	pl; p != pu; p++)
	  This loop cannot be made parallel, and compiling it will result in a
	  .l file entry	stating	the bound cannot be standardized. To avoid
	  this result, restructure the loop to be of the form
	  int lb, ub;
	  for (int i = lb; i <=	ub; i++)
     Aliased Parameter Information    [Toc]    [Back]
	  Perhaps the most frequent impediment to parallelizing	C and C++ is
	  aliased information. Although	Fortran	guarantees that	multiple
	  parameters to	a subroutine are not aliased to	each other, C and C++
	  do not. Consider the following example:
	  void sub(double *a, double *b,n) {
	    for	(int i = 0; i <	n; i++)
								       Page 11
AUTO_P(5)							     AUTO_P(5)
	      a[i] = b[i];
	  This loop can	be parallelized	only if	arrays a and b do not overlap.
	  With the option -OPT:alias=restrict, you can assure the autoparallelizer
 that the	arrays do not overlap. This assurance permits
	  the auto-parallelizer	to proceed with	the parallelization. See the
	  MIPSpro Compiling and	Performance Tuning Guide for details about
	  this option.
     Incorrectly Parallelized Nested Loops    [Toc]    [Back]
	  The auto-parallelizer	parallelizes a loop by distributing its
	  iterations among the available processors.
	  Because the resulting	performance is usually better, the autoparallelizer
 tries to	parallelize the	outermost loop.
	  If it	cannot do so, probably for one of the reasons mentioned	in the
	  previous section, it tries to	interchange the	outermost loop with an
	  inner	one that it can	parallelize.
	  Example Nested Loops
	  do i = 1,n
	    do j = 1,n
	      ...
	    end	do
	  end do
	  Even when most of your program is parallelized, it is	possible that
	  the wrong loop is parallelized. Given	a nest of loops, the autoparallelizer
 will only parallelize one of the	loops in the nest. In
	  general, it is better	to parallelize outer loops rather than inner
	  ones.
	  The auto-parallelizer	will try to either parallelize the outer loop
	  or in	terchange the parallel loop so that it will be outermost, but
	  sometimes it is not possible.	For any	of the reasons mentioned in
	  the previous section,	the auto-parallelizer might be able to
	  parallelize an inner loop but	not the	outer one. Even	if this
	  results in most of your code being parallelized, it might be
	  advantageous to modify your code so that the outer loop is
	  parallelized.
	  It is	better to parallelize loops that do not	have very small	trip
	  counts.  Consider the	following example.
	  do i = 1,m
								       Page 12
AUTO_P(5)							     AUTO_P(5)
	    do j = 1,n
	  The auto-parallelizer	may decide to parallelize the i	loop, but if m
	  is v ery small, it would be better to	interchange the	j loop to be
	  outermost and	then parallelize it. The auto-parallelizer might not
	  have any way to know that m is small.	In such	cases, the user	can
	  either use the C*$* ASSERT DO	PREFER
	  directives discussed in the next section to tell the autoparallelizer
 that it is better to parallelize	the j loop, or the
	  user can use manual parallelism directives.
	  Because of memory hierarchies, performance can be improved if	the
	  same processors access the same data in all parallel loop nests.
	  Consider the following two examples.
	  Example   Inefficient	Loop
	  do i = 1,n
	    ...a(i)
	  end do
	  do i = n,1
	    ...a(i)...
	  end do
	  Assume that there are	p processors. In the first loop, the first
	  processor will access	the first n/p elements of a, the second
	  processor will access	the next n/p and so on.	In the second loop,
	  the first processor will access the last n/p elements	of a. Assuming
	  n is not too large, those elements will be in	the cache of the a
	  different processor. Accessing data that is in some other
	  processor's cache can	be very	expensive. This	example	might run much
	  more efficiently if we reverse the direction of one of the loops.
	  Example   Efficient Loop
	  do i = 1,n
	    do j = 1,n
	      a(i,j) = b(j,i) +	...
	    end	do
	  end do
								       Page 13
AUTO_P(5)							     AUTO_P(5)
	  do i = 1,n
	    do j = 1,n
	      b(i,j) = a(j,i) +	...
	    end	do
	  end do
	  In this second example, the auto-parallelizer	might chose to
	  parallelize the outer	loop in	both nests. This means that in the
	  first	loop the first processor is accessing the first	n/p rows of a
	  and the first	n/p columns of b, while	in the second loop the first
	  processor is accessing the first n/p columns of a and	the first n/p
	  rows of b. This example will run much	more efficiently if we
	  parallelize the i loop in one	nest and the j loop in the other. The
	  user can add the prefer directives described in the next section to
	  solve	this problem.
Unnecessarily Parallelized Loops    [Toc]    [Back]     The auto-parallelizer may parallelize loops that would run	better sequent
     ially. While this is usually not a	disaster, it can cause unnecessary
     overhead. There is	a certain overhead to running loops in parallel. If,
     for example, a loop has a small number of iterations, it is faster	to
     execute the loop sequentially. When bounds	are unknown (and even
     sometimes when they are known), the auto-parallelizer parallelizes	loops
     conditionally. In other words, code is generated for both a parallel and
     sequential	version	of the loop. The parallel version is executed only
     when the auto-parallelizer	thinks that there is sufficient	work for it to
     be	worthwhile to execute the loop in parallel. This estimate depends on
     the iteration count, what code is inside the loop body, how many
     processors	are available and the auto-parallelizer	estimate for the
     overhead cost to invoke a parallel	loop. This user	can control the
     compiler's	estimate for the invocation overhead using the option
     -LNO:parallel_overhead=n. The default value for n will vary on different
     systems, but typical values are in	the low	thousands.
     By	generating two versions	of the loop, we	avoid going parallel in	small
     trip count	cases, but versioning does incur an overhead to	do the dynamic
     check. The	user can use the DO PREFER assertions to insure	that a loop
     goes parallel or sequential without incurring a run-time test.
     Nested parallelism	is not supported. Consider the following case:
     subroutine	caller
       do i
	 call sub
								       Page 14
AUTO_P(5)							     AUTO_P(5)
       end do
     subroutine	sub
       ...
       do i
	 ..
       end do
     end
     Suppose that the first loop is parallelized. It is	not possible to
     execute the loop inside sub in parallel whenever sub is called by caller.
     Thus the auto-parallelizer	must generate a	test for every parallel	loop
     that checks whether the loop is being invoked from	another	parallel loop
     or	region.	While this check is not	very expensive,	in some	cases it can
     add to overhead. If the user knows	that sub is always called from caller,
     the user can use the prefer directives to force the loop in sub to	go
     sequential.
Assisting the Silicon Graphics Automatic Parallelizer    [Toc]    [Back]     This section discusses actions you	can take to enhance the	performance of
     the auto-parallelizer.
   Assisting the Automatic Parallelizer    [Toc]    [Back]
     There are circumstances that interfere with the auto-parallelizer's
     ability to	optimize programs. As shown in Parallelization Failures	With
     the Automatic Parallelizer, problems are sometimes	caused by coding
     practices.	Other times, the auto-parallelizer does	not have enough
     information to make good parallelization decisions. You can pursue	three
     strategies	to attack these	problems and achieve better results with the
     auto-parallelizer.
     The first approach	is to modify your code to avoid	coding practices that
     the auto-parallelizer cannot analyze well.
     The second	strategy is to assist the auto-parallelizer with the manual
     parallelization directives	described in the MIPSpro Compiling and
     Performance Tuning	Guide. The auto-parallelizer is	designed to recognize
     and coexist with manual parallelism. You can use manual directives	with
     some loop nests, while leaving others to the auto-parallelizer. This
     approach has both positive	and negative aspects.
     On	the positive side, the manual parallelism directives are well defined
     and deterministic.	If you use a manual directive, the specified loop will
     run in parallel.
     Note:  This last statement	assumes	that the trip count is greater than
								       Page 15
AUTO_P(5)							     AUTO_P(5)
     one and that the specified	loop is	not nested in another parallel loop.
     On	the negative side, you must carefully analyze the code to determine
     that parallelism is safe. Also, you must mark all variables that need to
     be	localized.
     The third alternative is to use the automatic parallelization directives
     and assertions to give the	auto-parallelizer more information about your
     code. The automatic directives and	assertions are described in Directives
     and Assertions for	Automatic Parallelization. Like	the manual directives,
     they have positive	and negative features:
     On	the positive side, automatic directives	and assertions are easier to
     use and they allow	you to express the information you know	without	your
     having to be certain that all the conditions for parallelization are met.
     On	the negative side, they	are hints and thus do not impose parallelism.
     In	addition, as with the manual directives, you must ensure that you are
     using them	legally. Because they require less information than the	manual
     directives, automatic directives and assertions can have subtle meanings.
   Directives and Assertions for Automatic Parallelization    [Toc]    [Back]
     Directives	enable,	disable, or modify features of the auto-parallelizer.
     Assertions	assist the auto-parallelizer by	providing it with additional
     information about the source program. The automatic directives and
     assertions	do not impose parallelism; they	give hints and assertions to
     the auto-parallelizer in order to assist it in paralleling	the that the
     right loops. To invoke a directive	or assertion, include it in the	input
     file.  Listed below are the Fortran directives and	assertions for the
     auto-parallelizer.
     C*$* NO CONCURRENTIZE
	  Do not parallelize either a subroutine or file.
     C*$* CONCURRENTIZE
	  Not used. (See below.)
     C*$* ASSERT DO (CONCURRENT)
	  Ignore perceived dependences between two references to the same
	  array	when parallelizing.
     C*$* ASSERT DO (SERIAL)
	  Do not parallelize the following loop.
     C*$* ASSERT CONCURRENT CALL
	  Ignore subroutine calls when parallelizing.
     C*$* ASSERT PERMUTATION (array_name)
	  Array	array_name is a	permutation array.
								       Page 16
AUTO_P(5)							     AUTO_P(5)
     C*$* ASSERT DO PREFER (CONCURRENT)
	  Parallelize the following loop if it is safe.
     C*$* ASSERT DO PREFER (SERIAL)
	  Do not parallelize the following loop.
	  Note:	 The general compiler option -LNO:ignore_pragmas causes	the
	  auto-parallelizer to ignore all of these directives and assertions.
     C*$* NO CONCURRENTIZE
	  The C*$* NO CONCURRENTIZE directive prevents parallelization.	Its
	  effect depends on where it is	placed.
	  When placed inside a subroutine, the directive prevents the
	  parallelization of the subroutine. In	the following example, SUB1()
	  is not parallelized.	Example:
		 SUBROUTINE SUB1
	  C*$* NO CONCURRENTIZE
		   ...
		 END
	  When placed outside of a subroutine, C*$* NO CONCURRENTIZE prevents
	  the parallelization of all the subroutines in	the file. The
	  subroutines SUB2() and SUB3()	are not	parallelized in	the next
	  example.  Example:
		 SUBROUTINE SUB2
		   ...
		 END
	  C*$* NO CONCURRENTIZE
		 SUBROUTINE SUB3
		   ...
		 END
	  The C*$* NO CONCURRENTIZE directive is valid only when the -pfa or
	  -pca command-line option is used.
     C*$* CONCURRENTIZE
	  The C*$* CONCURRENTIZE directive exists only to maintain backwards
	  compatibility, and its use is	discouraged. Using the -pfa or -pca
	  option replaces using	this directive.
								       Page 17
AUTO_P(5)							     AUTO_P(5)
     C*$* ASSERT DO (CONCURRENT)
	  C*$* ASSERT DO (CONCURRENT) says that	when analyzing the loop
	  immediately following	this assertion,	the auto-parallelizer should
	  ignore any perceived dependences between two references to the same
	  array. The following example is a correct use	of the assertion when
	  M > N.
	  Example:
	  C*$* ASSERT DO (CONCURRENT)
		 DO I =	1, N
		   A(I)	= A(I+M)
	  This assertion is usually used to help the auto-parallelizer with
	  loops	that have indirect array references. There are other facts to
	  be aware of when using this assertion.
	  If multiple loops in a nest can be parallelized, C*$*	ASSERT DO
	  (CONCURRENT) causes the auto-parallelizer to prefer the loop
	  immediately following	the assertion.	The assertion does not affect
	  how the auto-parallelizer analyzes CALL statements and dependences
	  between two potentially aliased pointers.
	  Note:	 If there are real dependences between array references, C*$*
	  ASSERT DO (CONCURRENT) may cause the auto-parallelizer to generate
	  incorrect code.
     C*$* ASSERT DO (SERIAL)
	  C*$* ASSERT DO (SERIAL) instructs the	auto-parallelizer to not
	  parallelize the loop following the assertion.
     C*$* ASSERT CONCURRENT CALL
	  The C*$* ASSERT CONCURRENT CALL assertion tells the autoparallelizer
 to ignore subroutine calls contained in a loop when
	  deciding if that loop	is parallel. The assertion applies to the loop
	  that immediately follows it and to all loops nested inside that
	  loop.	The auto-parallelizer ignores subroutine FRED()	when it
	  analyzes the following loop.
	  C*$* ASSERT CONCURRENT CALL
		 DO I =	1, N
		   CALL	FRED
		   ...
		 END DO
								       Page 18
AUTO_P(5)							     AUTO_P(5)
		 SUBROUTINE FRED
		   ...
		 END
	  To prevent incorrect parallelization,	you must make sure the
	  following conditions are met when using C*$* ASSERT CONCURRENT CALL:
	  A subroutine cannot read from	a location inside the loop that	is
	  written to during another iteration. This rule does not apply	to a
	  location that	is a local variable declared inside the	subroutine.
	  A subroutine cannot write to a location inside the loop that is read
	  from during another iteration. This rule does	not apply to a
	  location that	is a local variable declared inside the	subroutine.
	  The following	code shows an illegal use of the assertion. Subroutine
	  FRED() writes	to variable T which is also read from by WILMA()
	  during other iterations.
	  C*$* ASSERT CONCURRENT CALL
		 DO I =	1,M
		   CALL	FRED(B,	I, T)
		   CALL	WILMA(A, I, T)
		 END DO
		 SUBROUTINE FRED(B, I, T)
		   REAL	B(*)
		   T = B(I)
		 END
		 SUBROUTINE WILMA(A, I,	T)
		   REAL	A(*)
		   A(I)	= T
		 END
	  By localizing	the variable T,	you could manually parallelize the
	  above	example	safely.	But, the auto-parallelizer does	not know to
	  localize T, and it illegally parallelizes the	loop because of	the
								       Page 19
AUTO_P(5)							     AUTO_P(5)
	  assertion.
     C*$* ASSERT PERMUTATION (array_name)
	  C*$* ASSERT PERMUTATION tells	the auto-parallelizer that array_name
	  is a permutation array: every	element	of the array has a distinct
	  value. Array B is asserted to	be a permutation array in this
	  example.
	  Example:
	  C*$* ASSERT PERMUTATION (B)
		 DO I =	1, N
		   A(B(I)) = ...
		 END DO
	  As shown in the previous example, you	can use	this assertion to
	  parallelize loops that use arrays for	indirect addressing. Without
	  this assertion, the auto-parallelizer	is not able to determine that
	  the array elements used as indexes are distinct.
	  Note:	 The assertion does not	require	the permutation	array to be
	  dense.
     C*$* ASSERT DO PREFER (CONCURRENT)
	  C*$* ASSERT DO PREFER	(CONCURRENT) says that the auto-parallelizer
	  should parallelize the loop immediately following the	assertion, if
	  it is	safe to	do so. The following code encourages the autoparallelizer
 to run the I loop in parallel.
	  C*$*ASSERT DO	PREFER (CONCURRENT)
		 DO I =	1, M
		   DO J	= 1, N
		     A(I,J) = B(I,J)
		   END DO
		   ...
		 END DO
	  When dealing with nested loops, follow these guidelines:
	  If the loop specified	by this	assertion is safe to parallelize, the
								       Page 20
AUTO_P(5)							     AUTO_P(5)
	  auto-parallelizer chooses it to parallelize, even if other loops in
	  the nest are safe.
	  If the specified loop	is not safe, the auto-parallelizer chooses
	  another loop that is safe, usually the outermost.
	  This assertion can be	applied	to more	than one loop in a nest. In
	  this case, the auto-parallelizer uses	its heuristics to choose one
	  of the specified loops.
	  Note:	C*$* ASSERT DO PREFER (CONCURRENT) is always safe to use. The
	  auto-parallelizer will not illegally parallelize a loop because of
	  this assertion.
	  C*$* ASSERT DO PREFER	(SERIAL)
	  The C*$* ASSERT DO PREFER (SERIAL) assertion requests	the autoparallelizer
 not to parallelize the loop that	immediately follows.
	  In the following case, the assertion requests	that the J loop	be run
	  serially.
		 DO I =	1, M
	  C*$*ASSERT DO	PREFER (SERIAL)
		   DO J	= 1, N
		     A(I,J) = B(I,J)
		   END DO
		   ...
		 END DO
	  Using	C*$* ASSERT DO PREFER (SERIAL)
	  The assertion	applies	only to	the loop directly after	the assertion.
								       PPPPaaaaggggeeee 22221111[ Back ] |