Discussion of Options

Other pages

Discussion of Options

Floating point numbers (Fortran's REAL*n, DOUBLE PRECISION, and COMPLEX*n, and C's float, double, and long double) are inexact representations of ideal real numbers. The operations performed on them are also necessarily inexact. However, the MIPS processors conform to the IEEE 754 floating point standard, producing results as precise as possible given the constraints of the IEEE 754 representations, and the MIPSpro compilers generally preserve this conformance. (Note, however, that 128-bit floating point, i.e. Fortran's REAL*16 and C's long double, is not IEEE-compliant.)

However, most code that has not been written with careful attention to floating point behavior does not require precise IEEE 754 conformance. Therefore, the MIPSpro compilers provide a number of options which trade off IEEE 754 conformance against better performance of generated code. These options allow transformations of calculations specified by the source code that may not produce precisely the same floating point result, although they involve a mathematically equivalent calculation.

The principal such option is a general control over floating point accuracy and overflow/underflow exception behavior:

-OPT:roundoff=n

The roundoff option specifies the extent to which optimizations are allowed to affect floating point results, in terms of both accuracy and overflow/underflow behavior. The roundoff value, n, has a value in the range 0..3 with the following meaning:

roundoff=0

Do no transformations which could affect floating point results. This is the default for optimization levels -O0 to -O2.

roundoff=1

Allow transformations with limited effects on floating point results. For roundoff, limited means that only the last bit or two of the mantissa will be affected. For overflow (underflow), it means that intermediate results of the transformed calculation may overflow within a factor of two of where the original expression might have overflowed (underflowed). Note that limited effects may be less limited when compounded by multiple transformations. For example, this option allows use of the MIPS 4 recipe (reciprocal) and rsqrt (reciprocal square root) instructions, which are slightly less accurate than the divide and square root instructions but significantly faster.

roundoff=2

Allow transformations with more extensive effects on floating point results. Allow associative rearrangement, even across loop iterations, and distribution of multiplication over addition/subtraction. Disallow only transformations known to cause cumulative roundoff errors or overflow/underflow for operands in a large range of valid floating point values. Re-association can have a substantial effect on the performance of software pipelined loops by breaking recurrences. This is therefore the default for optimization level -O3.

roundoff=3

Allow any mathematically valid transformation of floating point expressions. This allows floating point induction variables in loops, even when they are known to cause cumulative roundoff errors, and fast algorithms for complex absolute value and divide, which overflow (underflow) for operands beyond the square root of the representable extremes.

As an example, consider the Fortran code fragment:

        INTEGER i, n
        REAL sum, divisor, a(n)
        sum = 0.0
        DO i = 1,n
            sum = sum + a(i)/divisor
        END DO

At roundoff=0, the generated code must do the n loop iterations in order, with a divide and an add in each.

At roundoff=1, the divide can be treated like a(i)*(1.0/divisor). On the MIPS R8000, the reciprocal can be done with a recip instruction. But more importantly, the reciprocal can be calculated once before the loop is entered, reducing the loop body to a much faster multiply and add per iteration, which can be a single madd instruction on the R8000.

At roundoff=2, the loop may be reordered. The original loop takes at least 4 cycles per iteration on the R8000 (the latency of the add or madd instruction). Reordering allows the calculation of several partial sums in parallel, adding them together after loop exit. With software pipelining, a throughput of nearly 2 iterations per cycle is possible on the R8000, a factor of 8 improvement.

Consider another example:


        INTEGER i,n
        COMPLEX c(n)
        REAL r
        DO i = 1,n
            r = 0.1 * i
            c(i) = CABS ( CMPLX(r,r) )
        END DO

Mathematically, r can be calculated by initializing it to 0.0 before entering the loop and adding 0.1 on each iteration. But doing so causes significant cumulative errors because the representation of 0.1 is not exact. The complex absolute value is mathematically equal to SQRT(r*r + r*r). However, calculating it this way will cause an overflow if 2*r*r is greater than the maximum REAL value, even though a representable result can be calculated for a much wider range of values of r (at greater cost). Both of these transformations are forbidden for roundoff=2, but enabled for roundoff=3.

There are several other options which allow finer control of floating point behavior than is provided by -OPT:roundoff.

-OPT:fast_complex[=(ON|OFF)]: Enable/disable the fast algorithms for complex absolute value and division, normally enabled by roundoff=3.
-OPT:fast_sqrt[=(ON|OFF)]: Enable/disable the calculation of square root as x*rsqrt(x) for MIPS 4 and above, normally enabled by roundoff>0.
-OPT:fold_reassociate[=(ON|OFF)]: Enable/disable transformations which reassociate or distribute floating point expressions, normally enabled by roundoff>1.
-OPT:IEEE_comparisons[=ON]: Force comparisons to yield results conforming to the IEEE 754 standard for NaN and Inf operands, normally disabled. Setting this option will disable certain optimizations like assuming that a comparison x==x is always TRUE (since it is FALSE if x is a NaN). It also disables optimizations which reverse the sense of a comparison, e.g. turning "x < y" into "! (x >= y)", since both "x=y" may be FALSE if one of the operands is a NaN.
-TARG:madd[=(ON|OFF)]: The MIPS 4 architecture supports fused multiply-add instructions, which add the product of two operands to a third, with a single roundoff step at the end. Because the product is not separately rounded, this can produce slightly different (but more accurate) results than a separate multiply and add pair of instructions. This is normally enabled for -mips4.

Controlling Miscellaneous Optimizations

The -OPT group allows control over a variety of other optimization choices.

-OPT:space

The MIPSpro compilers normally make optimization decisions based strictly on the expected execution time effects. If code size is more important, use this option. One of its effects is to cause most subprogram exits to go though a single exit path, with a single copy of register restores, result loads, etc.

-OPT:alias=name

The compilers must normally be very conservative in optimization of memory references involving pointers (especially in C), since aliases (i.e. different ways of accessing the same memory) may be very hard to detect. This option may be used to specify that the program being compiled avoids aliasing in various ways. The choices are:

alias=any

The compiler will assume that any pair of memory references may be aliased unless it can prove otherwise (the default).

alias=typed

The compiler will assume that any pair of memory references which reference distinct types in fact reference distinct data. For example, consider the code:

            void dbl ( int *i, float *f ) {
                *i = *i + *i;
                *f = *f + *f;
            }

The compiler will assume that i and f point to different memory, and will produce an overlapped schedule for the two calculations.

alias=unnamed

The compiler will assume that pointers never point to named objects. For example, consider the code:

            int i;
            void dbl ( float *f ) {
                i = i + i;
                *f = *f + *f;
            }

The compiler will assume that f cannot point to i, and will produce an overlapped schedule for the two calculations. This option also implies the alias=typed assumption. Note that this is the default assumption for the pointers implicit in Fortran dummy arguments according to the ANSI standard.

alias=restrict

The compiler assumes a very restrictive model of aliasing, where no two pointers ever point to the same memory area. For example, consider the code:

            void dbl ( int *i, int *j ) {
                *i = *i + *i;
                *j = *j + *j;
            }

The compiler will assume that i and j point to different memory, and will produce an overlapped schedule for the two calculations. Although this is a very dangerous option to use in general, it may produce significantly better code when used for specific well-controlled cases where it is known to be valid.

The following options control loop unrolling in the MIPSpro optimizer, i.e. making multiple copies of a loop body to minimize the loop overhead or to expose more instruction parallelism. Unrolling is subject to a number of limits in the optimizer, intended to balance the runtime benefits against code expansion. These options allow the user to modify those limits when they can be improved. Note that loops expected to be software pipelined are subject to similar options in the -SWP group.

-OPT:unroll_times_max=n: The optimizer will normally unroll loops at most 2 times (-mips4) or 4 times (-mips3), unless it can unroll them completely. This option modifies the default limit.
-OPT:unroll_size=n: The optimizer will normally unroll loops only to the extent that the resulting unrolled loop body contains at most 320 instructions. This option modifies the default limit.
-OPT:unroll_bblimit=n: The optimizer will normally unroll loops containing up to 10 basic blocks. This option modifies the default limit. (A basic block is a sequence of code between branches and labels, i.e. with a single entry at the top and a single exit at the bottom.)

Controlling Global Code Motion

An important optimization performed by the MIPSpro compilers is called Global Code Motion (GCM). It is intended to improve the overall execution time of programs by redistributing the instructions among the basic blocks along an execution path to improve instruction parallelism and make better use of the machine resources.

Traditional global optimizers avoid moving instructions in cases which might cause them to be executed along control flow paths where they would not have been in the original program. However, GCM will perform such motion, called "speculative code motion" because the instructions moved are executed based on speculation that they will actually prove useful. By default, GCM is very conservative in its speculations. Most of the options in the -GCM: group control the sorts of speculation which are to be allowed.

Valid speculative code motion must normally avoid moving operations which may cause runtime traps. As a result, turning off certain traps at runtime enables more motion. See the target environment option -TENV:X=n for general control over the exception environment.

-GCM:aggressive_speculation[=(ON|OFF)]

GCM normally tries not to move instructions to basic blocks which are already using most of the instruction execution resources available, since doing so will likely extend the execution time of the block. This option minimizes that bias, which often helps floating point intensive code.

-GCM:array_speculation[=(ON|OFF)]

A form of speculation which is often very effective is called "bottom loading." It involves moving instructions from the top of a loop body to both the block before the loop (for the first iteration) and to the end of the loop body (which executes them at the end of one iteration so that they will be ready early for the next iteration). Doing this, however, means that the instructions will be executed one or more extra times in the last iteration(s) of the loop. If the instructions moved are loading elements of an array, this may cause extra accesses beyond the end of the array.

This option enables such accesses a small distance beyond the end of the array. The compiler attempts to pad arrays to guarantee that such accesses won't cause memory faults, but it may not always be able to do so, so this option must be used with care.

-GCM:pointer_speculation[=(ON|OFF)]

This option allows speculative motion of loads of two kinds, both involving pointer usage. The first is to allow motion of loads through pointers which may be NULL. In order to prevent this from causing memory faults, the compiler causes page zero of the address space to be mapped at runtime. Since this may prevent some truly invalid references from causing faults, this option should be avoided during debugging.

The second form involves moving a reference like *(p+n), for a small integer n, to a block which already contains a reference to *p. The assumption involved is that if p is a valid address, p+n will be too.

Consider an example:

        ...
        if ( p->next != NULL ) {
            sum += p->next->val;
        } else {
            sum += p->final_val;
        }
        ...

If this option is set, the load of p->next->val can be moved before the if (it is through a potentially NULL pointer), as can the load of p->final_val (it is offset by a small amount from the p->next reference).

-GCM:static_load_speculation[=(ON|OFF)]

Allow the speculative motion of loads from static data areas. Since such areas are known to be allocated, such motion cannot cause new faults except page faults or cache misses. As a result, it is generally safe, although it may hurt performance if it causes unnecessary page faults or cache misses.

-GCM:prepass[=(ON|OFF)]

-GCM:postpass[=(ON|OFF)]

GCM is implemented in two passes, one before instruction scheduling and register allocation, and one after, currently enabled at optimization level 2 (-O2) and above. These options allow either or both of the passes to be enabled or disabled. The two passes are complementary in effect, often improving code more in combination than the sum of their individual improvements, so enabling just one is not recommended.

Controlling Software Pipelining

Software pipelining (SWP) is an important optimization for the inner loops of programs, which can cause dramatic improvement by rearranging the loops to overlap calculations from multiple iterations. This is an iterative process, searching for an effective schedule and then for a workable allocation of registers and then retrying if either step fails. Some of the options in the -SWP group control that process. Others control how the loop body is prepared for the attempt, e.g. by unrolling.

Many important loop preparation transformations involve re-association of floating point values. See the discussion of floating point optimization above, especially the -OPT:roundoff option.

SWP must normally be careful during the initial and final iterations of a loop to not perform extra operations which might cause runtime traps. It must be similarly careful if early exits from a loop (i.e. before the initially calculated trip count is reached) are possible. Turning off certain traps at runtime can give it more flexibility, producing better schedules and/or simpler wind-up/wind-down code. See the target environment option -TENV:X=n for general control over the exception environment.

-SWP:=(ON|OFF)

Enable/disable SWP (normally enabled at -O3).

-SWP:back_substitution[=(ON|OFF)]

The iteration interval (II) of a pipelined loop, i.e. the frequency at which new iterations are started, is constrained by circular data dependencies across iterations, called recurrences. This option, ON by default, allows transformations which make recurrences less severe by substituting the expression which defines a variable for the variable. For example, consider the code:

        DO i=1,n
            a[i] = a[i-1] + 5.0
        END DO

Without back-substitution, each iteration must wait for the previous iteration's add to complete, yielding a best case II of 4 cycles per iteration on the R8000. Back-substitution can transform the loop to something equivalent to:

        DO i=1,n
            a[i] = a[i-8] + 40.0
        END DO

With appropriate initialization, this version can achieve an effective II of nearly 0.5 cycles.

-SWP:backtracks=n

SWP often backtracks and tries again when it fails to find a workable schedule. This option controls the limit on how many times it will do so. Increasing the limit will improve its chances of success; decreasing it may improve the compilation time required.

-SWP:body_ins_count_max=n

SWP will not be attempted for loop bodies containing more than n instructions (default 100, 0 for no limit). Larger loop bodies are less likely to be successfully pipelined, and will take more compilation time in the attempt, so this is another tradeoff of (potential) code improvement vs. compile time.

Loop bodies are also normally unrolled in preparation for SWP. This also limits the unrolling, since loops will not be unrolled to more than n instructions in the unrolled body. Unrolling is also constrained by the unroll_times_max option described below. (Unrolling of loop bodies not expected to be software pipelined is controlled separately by -OPT:unroll_size and -OPT:unroll_times_max.)

-SWP:fix_recurrences[=(ON|OFF)]

This option controls both of the transformations controlled by back_substitution and interleave_reductions. See their descriptions.

-SWP:if_conversion[=(ON|OFF)]

SWP generally works much better on loop bodies without internal branches caused by conditional execution. This option causes conditional branches to be removed when possible by using conditional move instructions (MIPS IV) and equivalents. For example, consider the code:

        DO i=1,n
            IF ( a(i) .LT. b(i) ) THEN
                c(i) = a(i)
            ELSE
                c(i) = b(i)
            END IF
        END DO

The loop body can be compiled for MIPS IV as:

        ldc1    $f0,a(i)
        ldc1    $f1,b(i)
        c.lt.s  cc,$f0,$f1
        movf.s  $f0,$f1,cc
        sdc1    $f0,c(i)

Note that there are no conditional branches in the code. This option is ON by default for MIPS IV targets only.

-SWP:interleave_reductions[=(ON|OFF)]

This option, ON by default, has the same motivation as back-substitution. It allows transformations which make recurrences arising from reductions less severe by interleaving multiple threads of the reduction and then piecing them together at the end of the loop. For example, consider the code to sum an array:

        DO i=1,n
            sum = sum + a(x)
        END DO

Without interleaving, each iteration must wait for the previous iteration's add to complete, yielding a best case II of 4 cycles per iteration on the R8000. Interleaving can transform the loop to something equivalent to:

        DO i=1,n,8
            sum1 = sum1 + a(i)
            sum2 = sum2 + a(i+1)
            sum3 = sum3 + a(i+2)
            sum4 = sum4 + a(i+3)
            sum5 = sum5 + a(i+4)
            sum6 = sum6 + a(i+5)
            sum7 = sum7 + a(i+6)
            sum8 = sum8 + a(i+7)
        END DO
        sum = sum1 + sum2 + sum3 + sum4 + sum5 + sum6 + sum7 + sum8

This version can achieve an effective II of nearly 0.5 cycles. These transformations generally require -OPT:roundoff=2 or better.

-SWP:trip_count_min=n

SWP will not be attempted for loops with trip counts known to be smaller than n (default 5). The limit is applied via a runtime test for cases where the trip count is not known at compile time. Sometimes, a longer loop body can be profitably pipelined even with a smaller trip count, enabled by this option.

-SWP:unroll_times_max=n

This option controls the maximum number of times inner loop bodies will be unrolled before attempting pipelining. The default is 2 for MIPS IV, and 1 for MIPS I-III. Unrolling is also constrained by the body_ins_count_max option described above. (Unrolling of loop bodies not expected to be software pipelined is controlled separately by -OPT:unroll_size and -OPT:unroll_times_max.)

Controlling the Target Architecture

There are a number of options which control the target architecture for which code is generated, most of them in the -TARG: option group.

-32 / -64

This option determines the base ABI to be assumed, either the 32-bit ABI for MIPS I/II targets, or the new 64-bit ABI for MIPS III/IV targets.

-mips[1234] or -TARG:isa=mips[1234] This option identifies the instruction set architecture (ISA) to

be used for generated code. It defaults to MIPS I for 32-bit compilations, and to MIPS III for 64-bit compilations, but higher ABIs may be specified for either at the cost of producing an executable which will not run on the earlier architectures.

The linker (ld) does not understand the -TARG group, so the short form must be used if linking is to be invoked as part of the compilation.

-TARG:madd[=(ON|OFF)]

This option enables/disables use of the multiply/add instructions for MIPS IV targets. These instructions multiply two floating point operands and then add (or subtract) a third with a single roundoff of the final result. They are therefore slightly more accurate than the usual discrete operations, and may cause results not to match baselines from other targets. This option may be used to determine whether observed differences are due to madds. ON by default for MIPS IV targets; ignored otherwise.

-TARG:processor=(r3000|r4000|r8000)

This option identifies the probably execution target of the code to be generated; it will be scheduled for optimal execution on that target, regardless of the ABI and/or ISA selected. Thus, for example, the command:

        cc -64 -TARG:proc=r8000 ...

will produce MIPS III code conforming to the 64-bit ABI (and therefore executable on any MIPS III or above processor) which is optimized to run on the r8000.

Controlling the Target Environment

Generated code is affected by a number of assumptions about the target software environment. The options in this group tell the compiler what assumptions it can make, and sometimes what assumptions it should enforce.

The first such options are concerned with the shared code model.

-TENV:large_GOT[=(ON|OFF)]
-TENV:small_GOT[=(ON|OFF)]: Shared code and dynamic shared objects (DSOs) require the use of a global offset table (GOT) containing addresses of static data and subprograms at runtime. A dedicated register ($gp) points to the GOT at runtime, and the code can load these addresses from the GOT without being dependent on its actual virtual address. If the GOT is less than 64KB in size, those loads can all be single instructions; otherwise they require adding a constructed offset to $gp. These options choose one of those cases (default small_GOT).

As mentioned in the GCM and SWP discussions above, being able to execute instructions speculatively can make a significant difference in the quality of generated code. What instructions can be executed speculatively depends on the exception state at runtime:

-TENV:X=n

Specify the level (0 to 5, default 1) of enabled traps that will be assumed (and enforced) for purposes of performing speculative code motion. At level 0, no speculation will be done. At level 1, only safe speculative motion may be done, assuming that the IEEE 754 underflow and inexact traps are disabled. At level 2, all IEEE 754 floating point traps are disabled except divide by zero. At level 3, divide by zero traps are disabled. At level 4, memory traps may be disabled or dismissed by the operating system. At level 5, any exceptions may be disabled or ignored.

Non-default levels should be used with great care. Disabling traps eliminates a useful debugging tool, since the problems which cause them will be detected later (often much later) in the execution of the program. In addition, many memory traps can't be avoided outright, but must be dismissed by the operating system after they occur. As a result, level 4 or 5 speculation can actually slow a program down significantly if it causes frequent traps.

Disabling traps in one module will require disabling them for the entire program. Programs which make use of level 2 or above should not attempt explicit manipulation of the hardware trap enable flags.

The last set of environment options are concerned with the alignment of data:

-align8 / -align16 / -align32 / -align64

The MIPS architectures perform memory references much more efficiently if the data referenced is naturally aligned, i.e. if 4-byte objects are at 4-byte-aligned address, etc. By default, the compilers allocate well-aligned data, and that is a requirement of the ABI for C. However, code ported from other architectures without alignment constraints may require less restricted alignment. The ANSI Fortran standard essentially requires maximum alignment of 4 bytes (32 bits), although it is unusual for code to actually depend on this.

These options specify a maximum alignment (in bits) to be forced in allocating data. The MIPSpro compilers default to -align32 for MIPS I Fortran, to -align64 for MIPS II-IV Fortran, and to ABI alignment (up to 128 bits for long double) for C.

-TENV:align_aggregates=n

The ABI specifies that aggregates (i.e. structs and arrays) be aligned according to the strictest requirements of their components (i.e. fields or elements). Thus, an array of short ints (or a struct containing only short ints or chars) will normally be 2-byte aligned. However, some code (non-ANSI-conforming) may reference such objects assuming greater alignment, and some code (e.g. struct assignments) may be more efficient if the objects are better aligned). This option specifies that any aggregate of size at least n will be at least n-byte aligned. It does not affect the alignment of aggregates which are themselves components of larger aggregates.

-TENV:misalignment=n

This option determines the model the compiler will use in deciding whether or not to assume that the memory references it generates are well-aligned. If n=1, it assumes well-aligned references unless it can determine otherwise (except in circumstances like struct copies where the ABI alignment requirements wouldn't normally imply alignment). If n=2, it always analyzes alignment based on the actual expression used, and trusts casts in the source code to reflect actual alignment. For instance, consider the expression:

        *(type_a *) &b

The declared type of b implies a certain minimum alignment, as does the cast to type_a; the compiler will use the maximum of the two. If n=3, the compiler also analyzes alignment, but does not trust casts. In the above example, it will assume the alignment of the declared type of b.

The current default (for the beta release) is model 3. This is expected to change to model 1 or 2 for the full release.

Jim Dehnert, SGI