Step-by-step guide to R8000 optimization

Other pages

Step-by-step guide to R8000 optimization

The following are recommended steps to port and optimize FORTRAN programs from the Challenge Irix 5.2 environment to Power Challenge and Irix 6.0. Much further detail is available on-line in MIPSpro books on Iris "insight" or in the man pages for f77 and pfa.

We describe here a small subset of the switches that have the broadest benefit based on our benchmarking experience.

I. Uniprocessor Performance Optimization

Before attempting multiprocessing, port and fully optimize for 1 processor. The largest effects are usually seen in climbing from step 1 to 3.

(1): Remake with -O2 -mips2, linking with -mips2, and retest
You are actually using the same compiler as on Challenge. This is a sanity test step. Performance may be no better than Challenge.
(2): Remake, -O2 -mips4, linking with -p -mips4, retest and profile
You are now using the new MIPSpro compiler, generating new R8000 instructions (such as multiply-add) and 64-bit addressing. Porting problems may arise. Answers may change slightly (with better accuracy) because the multiply-add instruction does not do rounding between the multiply and add.

For steps 3-6, try the recommended switch and reprofile the run. Compare the profile time for each subroutine to determine which subroutines benefit from the optimizations, and back off the switch wherever it doesn't help. Each switch usually improves performance, but may not on short loops or other special cases. Each switch may introduce changes in arithmetic rounding or accuracy as noted.

To focus the tuning effort where it will have the greatest effect, it is advisable on large codes to perform this experimentation only on the most time-consuming subroutines.

(3) Increase optimization

Increase optimization to -O3 -mips4 -p and use -lfastm at the end of the link command line to link with the fast math library.

You are now invoking software pipelining (SWP) for the R8000 architecture. (This has nothing to do with the -O3 optimizations of Irix 5.x and before.) Vector-style loops with about 10 or more iterations generally will run much faster. Loops with many lines of code generally are not pipelined. Loops with function calls (including many intrinsics) are not pipelined.

Answers may change slightly at this stage due to more rounding differences. The KAP loop analyzer is on by default with -O3 and may perform optimizations that change the order of summations, for example. To see the performance and answers that are generated without KAP optimizations, you can turn off KAP optimizations by recompiling with the switch -WK,-so=0,-o=0,-r=0 added. However, if you later intend to use -pfa to parallelize this subroutine, you will have to remove this switch since KAP also performs the parallel optimizations.

Because of SWP, it is more important for the R8000 to have inner loops with large iteration counts, as compared to the R4400. This may affect the strategy for determining the optimal nesting of loops. When tuning for both single processor performance and multiprocessing, there may be tradeoffs between placing the largest iteration count on the inside for best SWP or on the outside to provide the greatest MP opportunity.

(4)

Allow the compiler even more latitude with rounding by adding the switch -OPT:roundoff=3 (2 and 1 are less aggressive, 1 is the default).

(5)

If the program spends significant time performing divides with or without square roots, allow the use of reciprocal and reciprocal-square-root hardware instructions that are fast, but inexact in the last bit, by adding the switch -OPT:IEEE_arithmetic=3 (2 and 1 are less aggressive, 1 is the default).

(6)

Allow the use of a fast square root method by adding the switch -OPT:fast_sqrt=ON but not if it is possible for the program to take the square root of a zero value, which will result in NAN.

If you wish to examine the effects of SWP on time-critical loops and possibly make source changes to improve performance, then:

(7)

Generate an assembly language listing *.s by adding the switch -S and examining messages about SWP and loop cycle-counts. You can see just the messages from SWP by grep-ing for "swps" in this file. You may be able to decrease cycle counts by making changes like splitting loops that don't pipeline into smaller loops that do, by adding the Cray directive "cdir$ ivdep" when SWP complains about possible recurrences but there are really no data dependencies. It may also help to change the number of loop unrollings that SWP performs. The default is 2, so the switches -SWP:unroll_times_max=4 and -SWP:unroll_times_max=1 will show if more or less unrolling, respectively, is beneficial.

Finally, most of the techniques that lead to efficient cache behavior, such as stride-1 inner loops and local re-use of data, are also applicable to the R8000, although cache effects are less exaggerated because (1) the cache is 4 MB instead of 1 MB, (2) it is 4-set associative instead of direct-mapped, and (3) there is only one level of cache for floating-point data instead of two.

II. Multiprocessor Performance Optimization

For a more detailed explanation of SGI multiprocessing tools under Irix 6.0, see man pages "pfa" and "mp", as well as the on-line Iris Insight documents "POWER Fortran Accelerator Users Guide" and "MIPSpro F77 Programmers Guide."

As with Challenge systems, parallelization is often performed through the use of the -pfa switch to invoke the Power Fortran Accelerator, or through the manual insertion of DOACROSS directives along with the use of the -mp or -pfa switch to recognize them, or a combination of both.

When experimenting with automatic PFA parallelization, we recommend that you focus -pfa optimization on the most time-consuming subroutines, rather than the entire program at once. Listings generated by the "-pfa list" option show details of the loop optimizations that have been performed.

If a program spends significant time doing reduction operations such as summation or dot product, you may get better PFA parallelization by adding the switch -WK,-ro=3 to allow roundoff changes.

You may find that the Power Challenge exhibits slightly less parallel speedup than the Challenge for the same parallelization strategy. This is because the R8000 has significantly sped up the calculations in the parallel region, but the overhead of communication between processors, being a memory operation, is the same as on the Challenge. This effect will be greater on fine-grained parallel regions and should be negligible on coarse-grained parallel regions.

Rosario Caltabiano Silicon Graphics Eastern Technology Center