We describe here a small subset of the switches that have the broadest benefit based on our benchmarking experience.
You are actually using the same compiler as on Challenge. This is a sanity test step. Performance may be no better than Challenge.
You are now using the new MIPSpro compiler, generating new R8000 instructions (such as multiply-add) and 64-bit addressing. Porting problems may arise. Answers may change slightly (with better accuracy) because the multiply-add instruction does not do rounding between the multiply and add.
To focus the tuning effort where it will have the greatest effect, it is advisable on large codes to perform this experimentation only on the most time-consuming subroutines.
You are now invoking software pipelining (SWP) for the R8000 architecture. (This has nothing to do with the -O3 optimizations of Irix 5.x and before.) Vector-style loops with about 10 or more iterations generally will run much faster. Loops with many lines of code generally are not pipelined. Loops with function calls (including many intrinsics) are not pipelined.
Answers may change slightly at this stage due to more rounding differences. The KAP loop analyzer is on by default with -O3 and may perform optimizations that change the order of summations, for example. To see the performance and answers that are generated without KAP optimizations, you can turn off KAP optimizations by recompiling with the switch -WK,-so=0,-o=0,-r=0 added. However, if you later intend to use -pfa to parallelize this subroutine, you will have to remove this switch since KAP also performs the parallel optimizations.
Because of SWP, it is more important for the R8000 to have inner loops with large iteration counts, as compared to the R4400. This may affect the strategy for determining the optimal nesting of loops. When tuning for both single processor performance and multiprocessing, there may be tradeoffs between placing the largest iteration count on the inside for best SWP or on the outside to provide the greatest MP opportunity.
If you wish to examine the effects of SWP on time-critical loops and possibly make source changes to improve performance, then:
Finally, most of the techniques that lead to efficient cache behavior, such as stride-1 inner loops and local re-use of data, are also applicable to the R8000, although cache effects are less exaggerated because (1) the cache is 4 MB instead of 1 MB, (2) it is 4-set associative instead of direct-mapped, and (3) there is only one level of cache for floating-point data instead of two.
As with Challenge systems, parallelization is often performed through the use of the -pfa switch to invoke the Power Fortran Accelerator, or through the manual insertion of DOACROSS directives along with the use of the -mp or -pfa switch to recognize them, or a combination of both.
When experimenting with automatic PFA parallelization, we recommend that you focus -pfa optimization on the most time-consuming subroutines, rather than the entire program at once. Listings generated by the "-pfa list" option show details of the loop optimizations that have been performed.
If a program spends significant time doing reduction operations such as summation or dot product, you may get better PFA parallelization by adding the switch -WK,-ro=3 to allow roundoff changes.
You may find that the Power Challenge exhibits slightly less parallel speedup than the Challenge for the same parallelization strategy. This is because the R8000 has significantly sped up the calculations in the parallel region, but the overhead of communication between processors, being a memory operation, is the same as on the Challenge. This effect will be greater on fine-grained parallel regions and should be negligible on coarse-grained parallel regions.