Other pages


Optimizing for the R8000 CPU

The R8000 CPU is capable of executing four floating point operations in one clock cycle, in addition to integer and load instructions. With a frequency of 75 MHz, it thus has a peak performance of 300 Mfl. In order to come close to this speed, it is necessary to have at least a superficial understanding of under what circumstances the best performance is achieved.

The benchmarks below illustrate that speeds in the range 95 - 245 Mfl are obtainable in practice, in simple loops.

A step-by-step introduction to R8000 optimization, provided by SGI, is a good starting point. Further information is in this detailed discussion of options.

A benchmark program, with 14 different loops, illustrates the effect of the options recommended in the step-by-step introduction mentioned above.

A summary of the results (for details, click on a particular line):

options                                                         Mfl/s
---------------------------------------------------------------------
-O2 -mips2                                                       26.4
-O2                                                              31.7
-O3 -lfastm                                                      78.6
-O3 -OPT:round=3 -lfastm                                        104.2
-O3 -OPT:round=3:IEEE_ar=3 -lfastm                              131.2
-O3 -OPT:round=3:IEEE_ar=3:fast_sq=ON -lfastm                   146.1
-O3 -OPT:round=3:IEEE_ar=3:fast_sq=ON -GCM:array_sp=ON -lfastm  146.4

A few words of advice, based on the benchmark results

Pay particular attention to the importance of software pipelining, and cache efficiency. Software pipelining interleaves instructions from consecutive iterations of a loop. The cache is a few Mb buffer memory, with particularly fast memory to CPU transfer rates.

In simple words, the messages are


Last updated 26-mar-95 / aake@astro.ku.dk