Other pages
Optimizing for the R8000 CPU
The R8000 CPU is capable of executing four floating point
operations in one clock cycle, in addition to integer and
load instructions. With a frequency of 75 MHz, it thus has
a peak performance of 300 Mfl. In order to come close to this
speed, it is necessary to have at least a superficial
understanding of under what circumstances the best
performance is achieved.
The benchmarks below illustrate that speeds in the range
95 - 245 Mfl are obtainable in practice, in simple loops.
A step-by-step
introduction to R8000 optimization,
provided by SGI, is a good starting point.
Further information is in this
detailed discussion of options.
A benchmark program, with 14 different
loops, illustrates the effect of the options recommended in
the step-by-step introduction mentioned above.
A summary of the results (for details, click on a particular
line):
options Mfl/s
---------------------------------------------------------------------
-O2 -mips2 26.4
-O2 31.7
-O3 -lfastm 78.6
-O3 -OPT:round=3 -lfastm 104.2
-O3 -OPT:round=3:IEEE_ar=3 -lfastm 131.2
-O3 -OPT:round=3:IEEE_ar=3:fast_sq=ON -lfastm 146.1
-O3 -OPT:round=3:IEEE_ar=3:fast_sq=ON -GCM:array_sp=ON -lfastm 146.4
A few words of advice, based on the benchmark results
Pay particular attention to the importance of software
pipelining, and cache efficiency. Software pipelining
interleaves instructions from consecutive iterations of
a loop. The cache is a few Mb buffer memory, with
particularly fast memory to CPU transfer rates.
In simple words, the messages are
-
Simple loops, without subroutine calls or
conditional statements, run much faster after
software pipe-lining.
-
Loops covering at most a few hundred thousand
consecutive elements obtain the largest speeds.
-
Do as much work as possible on elements that are
brought into the cache.
Last updated 26-mar-95 / aake@astro.ku.dk