However, most code that has not been written with careful attention to floating point behavior does not require precise IEEE 754 conformance. Therefore, the MIPSpro compilers provide a number of options which trade off IEEE 754 conformance against better performance of generated code. These options allow transformations of calculations specified by the source code that may not produce precisely the same floating point result, although they involve a mathematically equivalent calculation.
The principal such option is a general control over floating point accuracy and overflow/underflow exception behavior:
As an example, consider the Fortran code fragment:
INTEGER i, n REAL sum, divisor, a(n) sum = 0.0 DO i = 1,n sum = sum + a(i)/divisor END DO
At roundoff=0, the generated code must do the n loop iterations in order, with a divide and an add in each.
At roundoff=1, the divide can be treated like a(i)*(1.0/divisor). On the MIPS R8000, the reciprocal can be done with a recip instruction. But more importantly, the reciprocal can be calculated once before the loop is entered, reducing the loop body to a much faster multiply and add per iteration, which can be a single madd instruction on the R8000.
At roundoff=2, the loop may be reordered. The original loop takes at least 4 cycles per iteration on the R8000 (the latency of the add or madd instruction). Reordering allows the calculation of several partial sums in parallel, adding them together after loop exit. With software pipelining, a throughput of nearly 2 iterations per cycle is possible on the R8000, a factor of 8 improvement.
Consider another example:
INTEGER i,n COMPLEX c(n) REAL r DO i = 1,n r = 0.1 * i c(i) = CABS ( CMPLX(r,r) ) END DO
Mathematically, r can be calculated by initializing it to 0.0 before entering the loop and adding 0.1 on each iteration. But doing so causes significant cumulative errors because the representation of 0.1 is not exact. The complex absolute value is mathematically equal to SQRT(r*r + r*r). However, calculating it this way will cause an overflow if 2*r*r is greater than the maximum REAL value, even though a representable result can be calculated for a much wider range of values of r (at greater cost). Both of these transformations are forbidden for roundoff=2, but enabled for roundoff=3.
There are several other options which allow finer control of floating point behavior than is provided by -OPT:roundoff.
void dbl ( int *i, float *f ) { *i = *i + *i; *f = *f + *f; }
The compiler will assume that i and f point to different memory, and will produce an overlapped schedule for the two calculations.
int i; void dbl ( float *f ) { i = i + i; *f = *f + *f; }
The compiler will assume that f cannot point to i, and will produce an overlapped schedule for the two calculations. This option also implies the alias=typed assumption. Note that this is the default assumption for the pointers implicit in Fortran dummy arguments according to the ANSI standard.
void dbl ( int *i, int *j ) { *i = *i + *i; *j = *j + *j; }
The compiler will assume that i and j point to different memory, and will produce an overlapped schedule for the two calculations. Although this is a very dangerous option to use in general, it may produce significantly better code when used for specific well-controlled cases where it is known to be valid.
Traditional global optimizers avoid moving instructions in cases which might cause them to be executed along control flow paths where they would not have been in the original program. However, GCM will perform such motion, called "speculative code motion" because the instructions moved are executed based on speculation that they will actually prove useful. By default, GCM is very conservative in its speculations. Most of the options in the -GCM: group control the sorts of speculation which are to be allowed.
Valid speculative code motion must normally avoid moving operations which may cause runtime traps. As a result, turning off certain traps at runtime enables more motion. See the target environment option -TENV:X=n for general control over the exception environment.
This option enables such accesses a small distance beyond the end of the array. The compiler attempts to pad arrays to guarantee that such accesses won't cause memory faults, but it may not always be able to do so, so this option must be used with care.
The second form involves moving a reference like *(p+n), for a small integer n, to a block which already contains a reference to *p. The assumption involved is that if p is a valid address, p+n will be too.
Consider an example:
... if ( p->next != NULL ) { sum += p->next->val; } else { sum += p->final_val; } ...
If this option is set, the load of p->next->val can be moved before the if (it is through a potentially NULL pointer), as can the load of p->final_val (it is offset by a small amount from the p->next reference).
Many important loop preparation transformations involve re-association of floating point values. See the discussion of floating point optimization above, especially the -OPT:roundoff option.
SWP must normally be careful during the initial and final iterations of a loop to not perform extra operations which might cause runtime traps. It must be similarly careful if early exits from a loop (i.e. before the initially calculated trip count is reached) are possible. Turning off certain traps at runtime can give it more flexibility, producing better schedules and/or simpler wind-up/wind-down code. See the target environment option -TENV:X=n for general control over the exception environment.
DO i=1,n a[i] = a[i-1] + 5.0 END DO
Without back-substitution, each iteration must wait for the previous iteration's add to complete, yielding a best case II of 4 cycles per iteration on the R8000. Back-substitution can transform the loop to something equivalent to:
DO i=1,n a[i] = a[i-8] + 40.0 END DO
With appropriate initialization, this version can achieve an effective II of nearly 0.5 cycles.
Loop bodies are also normally unrolled in preparation for SWP. This also limits the unrolling, since loops will not be unrolled to more than n instructions in the unrolled body. Unrolling is also constrained by the unroll_times_max option described below. (Unrolling of loop bodies not expected to be software pipelined is controlled separately by -OPT:unroll_size and -OPT:unroll_times_max.)
DO i=1,n IF ( a(i) .LT. b(i) ) THEN c(i) = a(i) ELSE c(i) = b(i) END IF END DO
The loop body can be compiled for MIPS IV as:
ldc1 $f0,a(i) ldc1 $f1,b(i) c.lt.s cc,$f0,$f1 movf.s $f0,$f1,cc sdc1 $f0,c(i)
Note that there are no conditional branches in the code. This option is ON by default for MIPS IV targets only.
DO i=1,n sum = sum + a(x) END DO
Without interleaving, each iteration must wait for the previous iteration's add to complete, yielding a best case II of 4 cycles per iteration on the R8000. Interleaving can transform the loop to something equivalent to:
DO i=1,n,8 sum1 = sum1 + a(i) sum2 = sum2 + a(i+1) sum3 = sum3 + a(i+2) sum4 = sum4 + a(i+3) sum5 = sum5 + a(i+4) sum6 = sum6 + a(i+5) sum7 = sum7 + a(i+6) sum8 = sum8 + a(i+7) END DO sum = sum1 + sum2 + sum3 + sum4 + sum5 + sum6 + sum7 + sum8
This version can achieve an effective II of nearly 0.5 cycles. These transformations generally require -OPT:roundoff=2 or better.
The linker (ld) does not understand the -TARG group, so the short form must be used if linking is to be invoked as part of the compilation.
cc -64 -TARG:proc=r8000 ...
will produce MIPS III code conforming to the 64-bit ABI (and therefore executable on any MIPS III or above processor) which is optimized to run on the r8000.
The first such options are concerned with the shared code model.
Non-default levels should be used with great care. Disabling traps eliminates a useful debugging tool, since the problems which cause them will be detected later (often much later) in the execution of the program. In addition, many memory traps can't be avoided outright, but must be dismissed by the operating system after they occur. As a result, level 4 or 5 speculation can actually slow a program down significantly if it causes frequent traps.
Disabling traps in one module will require disabling them for the entire program. Programs which make use of level 2 or above should not attempt explicit manipulation of the hardware trap enable flags.
These options specify a maximum alignment (in bits) to be forced in allocating data. The MIPSpro compilers default to -align32 for MIPS I Fortran, to -align64 for MIPS II-IV Fortran, and to ABI alignment (up to 128 bits for long double) for C.
*(type_a *) &b
The declared type of b implies a certain minimum alignment, as does the cast to type_a; the compiler will use the maximum of the two. If n=3, the compiler also analyzes alignment, but does not trust casts. In the above example, it will assume the alignment of the declared type of b.
The current default (for the beta release) is model 3. This is expected to change to model 1 or 2 for the full release.