Optimizing for the R8000 CPU

Other pages

Optimizing for the R8000 CPU

The R8000 CPU is capable of executing four floating point operations in one clock cycle, in addition to integer and load instructions. With a frequency of 75 MHz, it thus has a peak performance of 300 Mfl. In order to come close to this speed, it is necessary to have at least a superficial understanding of under what circumstances the best performance is achieved.

The benchmarks below illustrate that speeds in the range 95 - 245 Mfl are obtainable in practice, in simple loops.

A step-by-step introduction to R8000 optimization, provided by SGI, is a good starting point. Further information is in this detailed discussion of options.

A benchmark program, with 14 different loops, illustrates the effect of the options recommended in the step-by-step introduction mentioned above.

A summary of the results (for details, click on a particular line):

options                                                         Mfl/s
---------------------------------------------------------------------
-O2 -mips2                                                       26.4
-O2                                                              31.7
-O3 -lfastm                                                      78.6
-O3 -OPT:round=3 -lfastm                                        104.2
-O3 -OPT:round=3:IEEE_ar=3 -lfastm                              131.2
-O3 -OPT:round=3:IEEE_ar=3:fast_sq=ON -lfastm                   146.1
-O3 -OPT:round=3:IEEE_ar=3:fast_sq=ON -GCM:array_sp=ON -lfastm  146.4

A few words of advice, based on the benchmark results

Pay particular attention to the importance of software pipelining, and cache efficiency. Software pipelining interleaves instructions from consecutive iterations of a loop. The cache is a few Mb buffer memory, with particularly fast memory to CPU transfer rates.

In simple words, the messages are

Simple loops, without subroutine calls or conditional statements, run much faster after software pipe-lining.
Loops covering at most a few hundred thousand consecutive elements obtain the largest speeds.
Do as much work as possible on elements that are brought into the cache.

Last updated 26-mar-95 / aake@astro.ku.dk