Other pages
Overview of MIPSpro 64-Bit Compilers
Table of Contents
-----------------
preface
1. Introducing the 64-bit MIPSpro compilers
2. Binary compatibility
3. New features with the 64-bit MIPSpro compilers
4. Differences between the 32-bit and 64-bit compilers
5. Porting implications
6. Diagnosing porting problems
7. Performance tuning code for the R8000
8. References
preface
This is a collaboration of existing documents and experience, and is intended
as an overview. After reading this article, we strongly recommend reading the
online, Insight format MIPSpro 64-bit Porting and Transition Guide, available
with the purchase of the IRIX Development Environment. Please read section 7
if after porting code to 64 bits the expected performance is not achieved.
1. Introducing the 64-bit MIPSpro compilers
-------------------------------------------
To utilize the R8000 hardware technology, the compiler system was rewritten
and now supports 64-bit addresses and 64-bit arithmetic along with a powerful
optimizer to keep the hardware pipeline full.
The R8000 advanced superscalar architecture supports 4 instructions per cycle;
up to 2 load/store, and up to 2 integer or 2 floating point instructions per
cycle. The R8000 implements the MIPS4 instruction set which adds the following
new instructions: madd (multiply-add) instructions, conditional move instruct-
ions, indexed FP loads and stores, and reciprocal and reciprocal square root
instructions (see the MIPS4(5) man page).
The new 64-bit compilers that are purchasable separate from IRIX 6.0 are C++,
C, FORTRAN 77, Power FORTRAN, Power C, and Assembly. There are also 32-bit
compilers (version 3.19) which can be considered bug fixed versions of the
IRIX 5.2 32-bit compilers (version 3.18). The 32-bit compilers available with
IRIX 6.0 are the above plus Pascal and Ada. In addition, there is a CROSS64
development environment which is intended for users who want to develop 64-
bit applications targeting an IRIX 6.0 system, but run their development tools
on an IRIX 5.2 system (or are preparing for their future R8000 upgrade). For
more information, please see the separate article about the CROSS64 development
environment.
The MIPSpro compilers use different front-ends than in previous releases, a new
back-end (optimizer/code generator) and a new tail-end (linker). Note that the
back-end, under the highest optimization level, can software pipeline the code
such that inner loops are optimized to reduce the ratio of memory accesses to
arithmetic instructions. With the 64-bit compilers, software pipelining, and
the MIPS4 instruction set, maximum performance can be achieved on the R8000.
For understanding the development issues in the 64-bit environment, it is
essential to read the new online, Insight MIPSpro 64-Bit Porting and Transition
Guide; this reading is recommended prior to compilation of existing code with
the 64-bit compilers. And, as always, reading the release notes for all of the
compiler products is highly recommended. For a list of references, see the last
section of this article.
2. Binary compatibility
-----------------------
Currently, IRIX 6.0 is only supported on the R8000 (Power Challenge, Power
Onyx, Power Indigo2). Always backward compatibility is guaranteed but not
necessarily forward compatibility (in other words, something created on a
newer release is not guaranteed to work on a previous release). Backward
compatibility is maintained with IRIX 6.0 and the R8000, with the exceptions
below.
Backward Compatibility
----------------------
IRIX 32-bit binaries from a previous release will work under IRIX 6.0 except
those that
1. access kernel data structures, for example via /dev/kmem or /proc. Many
structures have changed sizes. Such a program must be ported to the 64-
bit ABI.
2. use NLIST(3E); those will not work on any 64-bit .o or a.out.
A new NLIST64(3E) is supplied for 64-bit ELF.
3. use LSEEK(2); now use LSEEK64(2) or the program is restricted to 2 GB.
4. make any assumption that the page size is 4 Kbytes (for example using
MMAP(2) and specifying the address). The page size is now 16 Kbytes;
such programs must use GETPAGESIZE(2).
5. are ada programs which catch floating point exceptions.
6. exchange binary data with 64-bit binaries having size assumptions which
do not agree. NFS works fine; 32-bit programs programs that share memory
with 64-bit programs will work assuming the programs carefully consider
data type size differences; programs using RPC(3R) + XDR(3R) should have
no problem; however arenas (USINIT(3)) CANNOT be shared between 32-bit
and 64-bit programs. See TEST_AND_SET(3) and ABILOCK(3) for 32 <-> 64
compatible locks.
It is possible for a program to determine whether it is running on a 64-bit
capable kernel using SYSCONF(3C) and the argument _SC_KERN_POINTERS. For
kernel drivers (character, block, streams), the kernel requires MIPS3 ELF
object format code in IRIX 6.0. MIPS1 drivers CANNOT be linked in. Hence,
kernel drivers for 6.0 require a 6.0 system to build them, likewise, drivers
for 5.2 require a 5.2 system. A makefile prototype for drivers is provided
in /var/sysgen/Makefile.kernio.
For 32-bit programs executed on the R8000, performance will not be better than
the R4400 since those were not compiled to utilize the R8000 architecture.
Object Compatibility
--------------------
Mixing 32-bit objects with 64-bit objects is not allowed, just as mixing ELF
and COFF format objects is not allowed, nor is mixing shared and non-shared
objects.
Furthermore:
* On an IRIX 6.0 system, one CAN link an ELF object, archive, or shared object
(.o, .a, or .so) created on IRIX 5.2 with 32-bit .o, .a or .so files created
using the 32-bit compiler on an IRIX 6.0 system. The resultant executable
will run on an IRIX 6.0 system, and likely on an IRIX 5.2 system but that is
not guaranteed.
* 32-bit .o, .a, or .so files CANNOT be linked with 64-bit .o, .a, or .so's.
* 32-bit a.outs CANNOT use 64-bit DSOs, nor can 64-bit a.outs use 32-bit DSOs.
* On an IRIX 5.2 or IRIX 6.0 system, one CAN link .o, .a, or .so files created
with the CROSS64 development environment under IRIX 5.2, with 64-bit .o, .a
or .so files created in IRIX 6.0, since the CROSS64 environment is
essentially the 64-bit environment but installed in /usr/cross64, and the
compilers are 32-bit executables so will execute on IRIX 5.2 systems.
* No 64-bit binaries will execute on current IRIX 5.x systems.
* It is likely that a future IRIX release for R4400 Challenge/Onyx systems
will support 64-bit MIPS3 ELF binaries. So, providing a MIPS3 version of
of an application's .o, .a, .so, or executable files will allow future
development on those systems where customer code can be combined with the
application's objects or libraries, and the application needs a large data
space (more than a 32-bit address can hold) and the MIPS 64-bit ABI.
MIPS instruction sets
---------------------
As background information, the four MIPS instruction sets and the SGI hardware
that supports them are as follows:
1. the MIPS1 instruction set is supported by the R2000/R3000 and all
subsequent processors. This CPU is found in Indigo, PI, PowerSeries
systems (none of these are in current production). By default for 32-
bit compilation the MIPS1 instruction set is used.
2. the MIPS2 instruction set is supported by the R4000 and all subsequent
processors. This CPU is found in Indigo, Indigo2, Indy, Crimson,
Challenge, and Onyx systems.
3. the MIPS3 instruction set is supported by the R4400 and all subsequent
processors. This CPU is found in Indy, Indigo2, Crimson, Challenge, and
Onyx systems. Operating system support is also required and no current
IRIX 5.x release provide this support at this point, just IRIX 6.0.
4. the MIPS4 instruction set is supported by the R8000 processor. This CPU
is found in Power Challenge/Onyx/Indigo2 systems. On the R8000, the MIPS4
instruction set provides peak floating point performance which cannot be
yielded by the other instruction sets. By default for the R8000 under
IRIX 6.0, 64-bit compilers and the MIPS4 instruction set are used.
The MIPS1 and MIPS2 sets provide 32-bit virtual addressing whereas MIPS3 and
MIPS4 provide 64-bit virtual addressing since pointers (and longs) are 64 bits.
3. New features with the 64-bit MIPSpro compilers
-------------------------------------------------
LP64 model
----------
LP64 means that longs and pointers are 64 bits (8 bytes), with the MIPSpro
compilers. See below for the change in C and FORTRAN data types sizes.
C type 32-bit LP64 FORTRAN type
char 8 8 CHARACTER
short int 16 16 INTEGER*2
int 32 32 INTEGER
long int 32 64 none
long long int 64 64 INTEGER*8
pointer 32 64 pointer
float 32 32 REAL
double 64 64 REAL*8
long double 64 128 REAL*16
Note that FORTRAN types already reflect the LP64 model (the pointer type is
an extension to ANSI FORTRAN 77).
16 byte arithmetic
------------------
Long double (16 byte) arithmetic is supported by the MIPSpro compilers,
using the ANSI C standard syntax. The 6.0 release introduces routines in
existing libraries which do QUAD precision floating point calculations.
Most of the long double math routines are named by prefixing the letter
'q' to the double precision routine's name: for example, qsin is the long
double version of sin (see the TRIG(3M) man page).
The representation used is not IEEE compliant; long doubles are represented
on this system as the sum or difference of two doubles, normalized so that
the smaller double is <= .5 units in the last place of the larger. This is
equivalent to a 107 bit mantissa with an 11 bit biased exponent (bias=1023),
and 1 sign bit. In terms of decimal precision, this is approximately 34
decimal digits. See MATH(3M) man page and 6.0 compiler_dev release notes,
end of chapter 3.
FORTRAN 77 Compiler
-------------------
The new FORTRAN compiler implements REAL*16 and COMPLEX*32 and all associated
intrinsics as 16 byte floating point entities. The 32-bit FORTRAN compiler
had recognized the types but converted them to REAL*8 and COMPLEX*16,
respectively.
%LOC now returns an 8 byte address and %VAL now passes an 8 byte value.
The 64-bit FORTRAN MP I/O library has been enhanced to allow IO from parallel
regions. In other words, multiple threads can read and write to different
FORTRAN logical units as well as read and write to the same logical unit. The
latter case will of course encounter normal overhead due to file locking.
The 64-bit compiler release provides a united run-time library for parallel C
and parallel FORTRAN programs (-lmp). This united library allows parallel C
to be mixed with parallel FORTRAN programs and have a single master.
C Compiler
----------
The C compiler has generally improved compiler diagnostics and generally
is more strict in enforcing ANSI C rules. Many of the options are quite
different, such as -O3. Please see the c_dev release notes for details.
Under the 64-bit compiler, warning messages start with the string "!!!".
Error messages start with the string "###". This allows easier searches
in log files.
C++ Compiler
------------
CC -64 and NCC are both native compilers that are based on the same front end.
The CC -64 front end is fecc, which has 64-bit pointers, addresses, and long
ints. The NCC front end is edgpcfe, a front end with 32-bit pointers,
addresses, and long ints. To invoke the old 32-bit cfront translator, use
"CC -32 -use_cfront". To invoke the 32-bit native C++ compiler, NCC, instead
of the translator, use "CC -32".
SGI continues to ship the old translator as a way to allow a phased migration
to the new compiler. In a future C++ release, the translator will be removed.
It is now possible to make dynamic shared objects (DSOs) from C++ object
files, even when run-time initialization is required. To make a DSO instead
of an executable file, use the -shared option on the CC command line.
The C++ 6.0 compiler implements the C++ language as described in The Annotated
C++ Reference Manual (Margaret Ellis and Bjarne Stroustrup, Addison-Wesley
1990), without, however, the exception handling feature described in Chapter
15 of that book. Those constructs formerly unimplemented by cfront are now
implemented with the native C++ compilers. Please see the online C++
Programmer's Guide and c++_dev release notes for more information.
Compiler tools
--------------
The 6.0 DBX(1) has been re-implemented and supports a new debugging format,
called DWARF. DWARF is a format for the info generated by the compiler,
assembler, and linker that is necessary for source-level debugging. See
man pages DWARF(4) and DWARFDUMP(1), a new tool to dump debug info of an
ELF object.
Pixstats functionality has been integrated into prof. Makefiles or scripts
that invoke pixstats will have to be changed.
Cord is available for 32-bit objects only.
The compiler tools understand 32-bit and 64-bit ELF objects, and for the
most part understand COFF (IRIX 4.x) objects (dbx does, dis does not).
64-bit interface register convention
------------------------------------
Register Software Name Use Who saves
$0 zero Hardware zero -
$1 at Assembly temp caller
$2..$3 v0..v1 Function results caller
$4..$11 a0..a7 Function arguments caller
$12..$15 t0..t3 Temps -
$16..$23 s0..s7 Saved callee
$24 t8 Temp caller
$25 t9 Temp caller (& callee in PIC)
$26..$27 k0..k1 Kernel temps -
$28 gp Global pointer callee
$29 sp Stack pointer callee
$30 s8 Frame pointer callee
$31 ra Return address caller
hi,lo Multiply/Divide caller
$f0, $f2 FP func results caller
$f1, $f3 FP temps caller
$f4..$f11 FP temps caller
$f12..$f19 FP arguments caller
$f20..$f23 FP temps caller
$f24..$f31 FP saved callee
Note that "caller-saved" means only that the caller may not assume that the
value in the register is preserved across the call.
64-bit subprogram interface
---------------------------
At most, a total of eight floating point registers ($f12..$f19) may be used
to pass FP arguments, or up to eight integer registers ($4..$11) may be used
to pass integer args. For example where d1..d3 are double precision FP args,
s1..s3 are single precision FP arguments, and n1..n3 are integer arguments:
if d1,d2,d3,s1,s2,s3,n1,n2,n3 are in the argument list then the register and
stack assignments are $f12,$f13,$f14,$f15,$f16,$f17,$10,$11,stack. Another
example: n1,n2,d1 get assigned to $4,$5,$f14.
Whenever possible, floating point arguments are passed in floating point
registers regardless of whether they are preceded by integer parameters.
[The 32-bit ABI allows only leading floating point arguments to be passed
in FP registers; those coming after integer registers must be moved to
integer registers.]
Variable argument routines require an exception to this rule. Any floating
point parameters in the variable part of the argument list are passed in
integer registers. There are several important cases involved:
- If a varargs prototype (or the actual callee definition) is available to
the caller, it places floating point parameters directly in the integer
register required, and there are no problems.
- If no prototype is available to the caller for a direct call, then the
caller's parameter profile is provided in the object file (as are all
global subprogram formal parameter profiles), and the linker (ld/rld)
generates a diagnostic message if the linked entry point turns out to
be a varargs routine.
- If no prototype is available to the caller for an indirect call (that
is, via a function pointer), then the caller assumes that the callee
is not a varargs routine and places floating point parameters in FP
registers. If the callee is varargs, the code is not ANSI-conformant.
4. Differences between the 32-bit and 64-bit compilers
------------------------------------------------------
As previously stated, there are two types of compilers available with IRIX
6.0: 32-bit (ucode) compilers that are bug fixed versions of the compilers
available under IRIX 5.2, and there are the 64-bit compilers. Using the
-32 switch on the compiler command line will invoke the 32-bit compiler
stages. Using -64 will invoke the 64-bit compiler stages.
*The default switches on Power Challenge/Onyx/Indigo2 systems (R8000) are
"-64 -mips4".* The 64-bit compilers are used and code is generated using
the MIPS4 instruction set. When using the 32-bit compilers often, setting
the environment variable SGI_ABI to 32 saves typing -32 on the compile line.
Simple 64-bit compiler stage flow is cc->fec->be->ld instead of the 32-bit
cc->cfe->ugen->as1->ld, or for FORTRAN f77->cpp->fef77->be->ld instead of
the 32-bit flow f77->cpp->fcom->ugen->as1->ld.
Components of the MIPSpro 64-bit compilers
MIPSpro Ucode MIPSpro Ucode
64-bit 32-bit 64-bit 32-bit
FORTRAN 77 Role performed C
----------------- -------------- ---------------
f77 f77 driver cc cc
cpp cpp preprocessor fec cfe
fef77 fcom front-end fec cfe
fef77 fopt scalar optimizer copt copt
fef77 fcom interprets parallel mpc accom_mp
directives
fef77p pfa automatic parallel pca pca
accelerator
be ugen&as1 back-end be ugen&as1
ld ld linker ld ld
Although they are two separate and different compiler systems, the 32-bit
and 64-bit compilers have similar command line interfaces. Please look to
the CC(1) and F77(1) man pages for a list of supported options, also the
MIPSpro Porting and Transition Guide summarizes the difference in switches.
32-bit compiler flags that the 64-bit compiler does not support:
-32 by definition
-mips1 generate code using MIPS1 instruction set (the default)
-mips2 generate code using MIPS2 instruction set
64-bit compiler flags that the 32-bit compiler does not support:
-64 by definition
-mips3 generate code using MIPS3 instruction set
-mips4 generate code using MIPS4 instruction set (default on R8000)
-help print list of possible options
Switches accepted by both compilers but having different semantics:
-v -show -show gives the compiler stage flow for 64-bit; -v for ucode
-woff turn off named warnings, but the warning numbers are different
between 32-bit and 64-bit compilers.
-Wc,arg where c designates to which pass of the compiler the argument
is going to be passed. Since the compiler stages changed,
the choices for c are different.
FORTRAN 77 compiler differences
-------------------------------
A major part of the front-end is the Kuck and Associates Processor (KAP).
KAP is an optimizer that analyzes data dependence to guide serial optim-
izations and automatic parallelizations. KAP performs scalar optimizations
optimizations such as outer loop unrolling and pulling out loop invariants.
Unlike the ucode compiler, with -O2 BOTH scalar and back-end optimizations
are performed.
Other differences with the FORTRAN 77 front-end, fef77, are as follows:
- fef77 allows empty arguments in subroutine/function calls which result
in zeroes passed by value.
- fef77 implements the FORTRAN 8x (subset of FORTRAN 90) array syntax
- fef77 and fcom are different in how they fold REAL*4 constants. fcom
internally promotes them to REAL*8 whereas fef77 adheres to ANSI standard
- fef77 allows less constant expressions in parameter statements than fcom
- fcom allows lines longer than 256 characters by default; fef77 currently
has a hard limit of 256.
C compiler differences
----------------------
Some compiler switches that are no longer supported under 6.0 MIPSpro are
-wlint (use lint instead), any options that have to do with ucode like -j,
-Olimit, -Xcpluscomm, -varargs, and many others.
Because of the LP64 model, data type size differences and alignment can
cause internal padding in structures (likewise for FORTRAN common blocks).
For example:
struct t { char c1; char c2; short s; long l; } t1;
Word alignment for l, sizeof struct is 8
-----------------------------------------------------------------
32-bit | c1 | c2 | s | l |
-----------------------------------------------------------------
Doubleword alignment for l, sizeof struct is 16
-----------------------------------------------------------------
64-bit | c1 | c2 | s | pad |
-----------------------------------------------------------------
| l |
-----------------------------------------------------------------
Byte 1 2 3 5 8
Because of alignment and padding, programs should use the heuristic of
placing the largest data types in the beginning of the struct or common
block (and smallest at the end).
C++ compiler differences
------------------------
In some cases, the Silicon Graphics native C++ compilers are not backwards-
compatible with cfront because cfront has defects, behaves in a non-
deterministic manner, or fails to adhere to the standard. Also, there is a
new, incompatible mechanism for handling C++ templates.
The native compilers provide much tighter error checking than cfront.
Please see the online C++ Programmer's Guide and c++_dev release notes for
more information.
Back-end differences
--------------------
With the 64-bit compiler, the back-end is the code generator and optimizer
rolled into one stage, "be". This is where the software pipelining happens.
With MIPSpro compilers, -O3 means software pipelining whereas with ucode
it means intra-procedure optimizations. There are a variety of back-end
optimizations flags that are available, many are activated based off of
the -O level specified, but some require semantic knowledge of the code
and therefore can only be activated by *the user*. For details on these
options please consult the MIPSpro 64-bit Porting and Transition Guide
and the compiler driver man page (CC(1), F77(1)).
Library structure differences
-----------------------------
There is also a new library structure with IRIX 6.0; the drivers (cc, f77)
and rld know where to find the matching MIPS instruction set libraries.
The location of the shared objects is part of the ABI; MIPS1/2 32-bit
programs have a different ABI than MIPS3/4 64-bit programs. The native
development library directory structure is the same for MIPS1/2, but for
MIPS3/4 libraries the shared objects are located in /usr/lib64/mips3 and
/usr/lib64/mips4, and the compiler stages are located in /usr/lib64/cmplrs.
5. Porting implications
-----------------------
C implications
--------------
Within source code, most porting problems will arise from assumptions, implicit
or explicit, about either absolute or relative sizes of the int, long int, or
pointer types. The most common classical assumptions are likely to be:
* size assumptions
sizeof (int) == sizeof (void *)
sizeof (int) == sizeof (long)
sizeof (long) == 4
sizeof (void *) == 4
sizeof (long) == sizeof (float) [long gets narrowed in mixed expressions]
Use -fullwarn or -wlint (with -32, lint in -64) to expose size issues
* sign extension possible
int *p, i;
p = (int *) i;
This generates a "warning(1412): source type of cast is too small to hold
all pointers: sign extension possible". Dereferencing the pointer will
at run-time cause a bus error and core dump since a stack address is
similar to 0xffffffae50, not 0xffffffffffffae50. Notice that the virtual
address is 40 bits (10 hex bits) with MIPS3/4; a user process can create
a virtual address space of up to size 1 terabyte (2^40), provided memory
plus swap size is larger.
* truncation possible
unsigned i, *p
i = (unsigned) p;
This generates a "warning(1411): destination type of cast is too small to
hold all pointers: truncation possible" because a 4 byte data type is not
big enough to hold an 8 byte data type.
* format strings in printf or scanf
%d and %ld differ when compiled -64
Using %d to print 8 bytes will print only the low order 4 bytes, use %ld.
* constants - hex constants are not sign extended
long x;
... ( (long) ( x + 0xffffffff ) ) ...
With -32 this evaluates to (x-1), however with -64 it is (x+4294967295).
* use prototypes, especially when varargs are involved
printf("%f",float_val);
Recall the subprogram interface change that was mentioned in section 3,
varargs floating point arguments are required to be put into integer regs
so not prototyping the above would result in the wrong value printed.
Watch for compiler/linker warning messages in order to spot these instances.
FORTRAN implications
--------------------
The FORTRAN compiler has no data size changes, as its types have specific bit
sizes (REAL still implies 4 bytes), so standard ANSI FORTRAN 77 code should
have no problems. For FORTRAN code that interfaces to C, care needs to be
taken since arguments are passed by reference and pointers are now 8 bytes.
Also, %loc now returns 64-bit addresses and %val passes 64-bit values.
If the C code used ints to contain the pointers then a change is needed.
Example: FORTRAN calling C
FORTRAN C
------- ---
integer i,j foo_(int *i, int *j) or,
call foo(i,j) foo_(long i, long j) less preferable
FORTRAN subprograms called by C where long int arguments are passed (by
address) may need to change argument declarations.
Example: C calling FORTRAN
C FORTRAN
--- -------
long l1, l2; subroutine foo( i, j )
foo_( &l1, &l2 ); #if (_MIPS_SZLONG==64)
integer*8 i, j
#else
integer*4 i, j
#endif
FORTRAN arguments passed by %VAL calls to C routines should be declared as
long ints in the C routines.
Example: FORTRAN arguments passed by %VAL
FORTRAN C
------- ---
call foo(%VAL(i)) foo_( long i }
FORTRAN code that uses %LOC may need to be changed to store 8 byte address.
Example: FORTRAN use of %LOC in -64
common // heap
#if (_MIPS_SZPTR==64)
integer*8 haddress
#else
integer*4 haddress
#endif
haddress = %loc(heap)
Coding variable size issues
---------------------------
Typically one would not want to maintain two copies of source code, but rather
maintain a single source with more complex makefiles to generate two outputs,
a 32-bit binary and a 64-bit binary. Care must be taken to create only one
version of every header file. Use typedef types for fields which are to be of
a constant size and for those to be of a natural size. For example, off_t and
size_t are now 64 bits in -64 only, while the same types compiled -32 remains
32 bits in size. See /usr/include/inttypes.h.
A less preferable alternative is to use #if for 32 vs 64, as shown in the above
FORTRAN examples. See the porting guide for a full list of compiler predefined
variables, like _MIPS_SZPTR, for MIPS1 and MIPS4 executables.
6. Diagnosing porting problems
------------------------------
This section describes possible causes for code that runs differently and/or
incorrectly than IRIX 5.2. For code that is not achieving expected performance,
see the next section, titled "Performance tuning code for the R8000".
32-bit program gets different answer on R8000
---------------------------------------------
The first case is a 32-bit program that gives different floating point results
when executed on an R8000 as compared to an R3000 or R4000 CPU. Two reasons
for this possibility are algorithm changes in the 32-bit DSOs provided with
the MIPSpro 6.0 environment and the second is a hardware change in handling
floating point exceptions.
A. Algorithm change in a needed library
To determine if this affects the code, try copying from a 5.2 system the
needed 32-bit DSOs (use "elfdump -Dl a.out" to see what is needed) and at
run-time, link with those DSOs (by using the LD_LIBRARY_PATH environment
variable, see RLD(1) ). The 32-bit DSOs provided in IRIX 6.0 are similar
to those provided in IRIX 5.3.
B. Hardware change in handling very tiny floating point numbers (fpmode)
To determine if this is affecting the code, try toggling fpmodes and re-running
the program via fpmode, "fpmode precise ". There is excellent
documentation of the two FP modes, performance and precise, in chapter five of
the MIPSpro 64-Bit Porting and Transition Guide and see also the FPMODE(1)
man page. Essentially, the MIPS Floating-Point Architecture has been extended
to improve performance significantly for those programs that do not care about
denormalized numbers that are generated by their code. A denormalized number
is a set of very tiny numbers, less than 2^126 (~10^38) in single precision
and less than 2^1022 (10^308) in double precision. For R4x00 systems, these
raised underflow exceptions and were trapped by the kernel, and are now flushed
to zero in hardware in performance floating point mode.
In many cases, the application performs correctly if all the denormalized
intermediate results were rounded to zero. However, if the application truly
requires representation of denormalized numbers in order to perform correctly,
then use "fpmode precise" (or likewise from a program use SYSSGI(2)). Precise
exception mode fully complies with the IEEE Standard and is compatible in every
way to the preexisting MIPS floating-point architecture.
One final reason to use "fpmode precise" is when debugging a program that
generates floating point exceptions and needs the floating point signal handler
to have the right state information, such as the instruction that caused the
exception. With "fpmode performance", FP instructions can be executed out of
order and exceptions trapped imprecisely so the floating point signal handlers
may not work as expected.
It should be emphasized that running in performance mode does not affect those
applications which do not cause floating point exceptions.
64-bit program gets different answer than its 32-bit counterpart
----------------------------------------------------------------
In addition to the reasons above (library algorithm changes and FP performance
mode), reasons why a 64-bit program can get different answers than the 32-bit
version of the same source base are the MIPS4 madd instructions, additional
library accuracy for 16 byte arithmetic, operation reductions by optimizations,
reassociation of operations by optimizations, or an unstable user algorithm.
A. MIPS4 madd instructions
The intermediate result of the multiply-add/subtract instructions is calculated
to infinite precision and is not rounded prior to the addition or subtraction.
The result is then rounded according to the rounding mode specified by the
instruction. This can yield slightly different calculations than a multiply
instruction (which is rounded) and an add instruction (which is rounded again).
B. Additional accuracy in math library for 16 byte arithmetic
The MIPS3/4 math library, -lm, contains routines newly implemented (1994) using
algorithms which take advantage of the MIPS architecture. FORTRAN code that
uses REAL*16 or C code that uses long double, and makes math function calls,
could get different run-time results due to additional accuracy that the QUAD
routines offer.
C. Operation reductions by optimizations (IEEE arithmetic non-conformance)
The extent to which optimizations must preserve IEEE floating point arithmetic
is controlled by the -OPT:IEEE_arithmetic option. In this FORTRAN do loop,
DO i = 1,1000
sum = sum + a(i)/divisor
END DO
at -OPT:IEEE_arithmetic=1, the generated code must do all the loop iterations
in order, with a divide and an add in each. Using -OPT:IEEE_arithmetic=3, the
divide can be treated like a(i)*(1.0/divisor). On the R8000, the reciprocal
can be calculated with a recip instruction before the loop is entered, reducing
the loop body to a much faster multiply-add (madd) per iteration. Note, that
IEEE arithmetic conformance is the default (=1) so this flag could only be the
cause of the 64-bit program giving different floating point results when
setting -OPT:IEEE_arithmetic=2 or 3.
D. Reassociation of operations by optimizations (causing roundoff errors)
For 64-bit programs compiled with -O3 optimization, cumulative roundoff errors
may have occurred due to associative rearrangement, even across loop
iterations,
and distribution of multiplication over addition/subtraction. Note, with -O3
the default is -OPT:roundoff=2 (out of a possible 3); -O0 through -O2 defaults
to -OPT:roundoff=0. To see if roundoff is affecting the program, try using
"-mips4 -O3 -OPT:roundoff=0" flags.
Also, the KAP loop analyzer is on by default with -O3 and may perform
optimizations that introduce roundoff error. To turn off these optimizations
one can recompile with "-WK,-o=0,-r=0,-so=0", unless using -pfa to parallelize
the code since KAP also performs the parallel optimizations in which case just
use "-WK,-r=0". KAP flags affect the front-end compilation stage and OPT
options affect the back-end.
E. Unstable user algorithm
It is not considered preferable programming practice to test computed floating
point values for equality. Rather, it is preferred to test
IF ( abs ( abs(x) - abs(y) ) < eps )
where eps is an appropriately picked delta, instead of
IF ( abs(x) > abs(y) )
Such an algorithm may give different results even with -OPT:roundoff=0 because
of the nature of floating point representation.
Recall that the 64-bit FORTRAN front-end is different than the 32-bit compiler
in constant folding. REAL*4 are not interally promoted to REAL*8, as before.
program test % f77 -32 cf.f
real r % a.out
r = 3.14 / 3 1.0466666222
write (6,10) r % f77 -64 cf.f
format (f17.10) % a.out
end 1.0466667414
As long as the user's algorithm doesn't depend on so many digits of precision,
this change (required for FORTRAN ANSI standard adherence) won't be an issue.
Isolating parallel processing (MP) problems
-------------------------------------------
SGI recommends first getting the application working with no parallelization at
the highest optimization level. When testing the parallel version, first run
it with only 1 thread (either on a 1 cpu machine or by setting the environment
variable MP_SET_NUMTHREADS to 1). If there is time, go back down to -g for
the first MP test, run that with 1 thread and multiple threads, and then go up
the optimization scale, testing one thread and then testing multiple threads.
This follows the general principle of changing only one thing at a time. This
methodology of incremental iterations will most likely help quickly narrow down
the problem.
7. Performance tuning code for the R8000
----------------------------------------
The R8000 architecture is a big performance advantage for code with floating
point loops having large iteration counts. **The R8000 performance story is
all about software pipelining (SWP) and the MIPS4 instruction set.** Keeping
the pipeline full with 4 instructions per cycle, unrolling inner loops to
eliminate redundant loads, reducing the ratio of loads to madd (multiply-add)
instructions per loop iteration, reducing the ratio of hardware cycle count per
loop iteration, and making use of the many registers, are all techniques that
lead to peak performance for loop intensive programs, especially floating
point loop intensive programs.
More about software pipelining
------------------------------
A. Definition
SWP is mixing operations from different loop iterations in each iteration of
the hardware loop so that the pipeline is filled (4 instructions per cycle).
B. How software pipelined code looks and how it works
For a simple DAXPY loop (double precision a times x plus y), shown below, can
be coded as two load instructions followed by a madd instruction and a store.
DO i = 1, n 0: ldc1(x) ldc1(y) madd
y(i) = y(i) + a*x(i) 1: 2: 3:
END DO 4: sdc1
However, there is a three cycle delay before the results of the madd can be
stored. So, to keep the interim cycles filled the loop can be unrolled and
rewritten such that operations from different loop iterations can be mixed
in each iteration of the hardware loop. This can look like the following:
Windup:
1: t1 = ldc1 t2 = ldc1 t7 = madd t1 + t2
2: t4 = ldc1 t5 = ldc1 t8 = madd t4 + t5
L1:
1: t1 = ldc1 t2 = ldc1 t3 = madd t1 + t2
2: t4 = ldc1 t5 = ldc1 t6 = madd t4 + t5
3: sdc1 t7 sdc1 t8 beq compensation1
4: t1 = ldc1 t2 = ldc1 t7 = madd t1 + t2
5: t4 = ldc1 t5 = ldc1 t8 = madd t4 + t5
6: sdc1 t3 sdc1 t6 bne L1
Winddown:
1: sdc1 t7 sdc1 t8 br ALLDONE
compensation1:
1: t7 = t3 t8 = t6
2: br Winddown
ALLDONE:
In the above example, there are two loop replications (1-3 and 4-6). Note that
every loop replication completes 2 loop iterations in 3 cycles, instead of one
loop iteration in five cycles. The stores in this loop are storing the madd
results from previous iterations. But, in general any operations from any
number of different iterations can be mixed. In order to properly prepare for
entry into such a loop, a Windup section of code is added to set up registers
for the first stores in the main loop. In order to exit the loop properly, a
Winddown section is added to perform the final stores. Any preparation of
registers needed for the Winddown section is done in the compensation section.
C. Parallelization (MP) tradeoffs
Software pipelining is performed only on inner loops. Outer loop unrolling is
performed by KAP (see -sopt in F77(1), or for C see the PCA man page). Because
of SWP, it is more important for the R8000 to have inner loops with large
iteration counts, as compared to the R4400. This may affect the strategy for
determining the optimal nesting of loops. When tuning for both single proc-
essor performance and multiprocessing, there may exist tradeoffs between plac-
ing the largest iteration count on the inside for best SWP or on the outside
to provide the greatest MP opportunity.
64-bit code not performing as well as expected
----------------------------------------------
A. SWP failed
Some programs when compiled -64 will not perform better than on the R4400.
However loop intensive programs, especially ones with SAXBY or DAXBYs, will
run faster. If the contrary is observed, possible reasons are as follows:
* not compiling with -O3. Software pipelining is activated by -O3.
* SWP works only on inner loops; inner loops with subroutine calls
(including many intrinsics) or branches will not software pipeline.
* loops with many lines of code generally will not SWP because of not
enough available registers; unrolling uses a lot of registers.
* KAP and SWP conflicts: KAP may fuse loops causing the resulting loop
to not pipeline well (try using "-WK,-nofuse"); KAP may unroll inner
loops which may be SWP'ed; when hand tuning code for the SWPer, one
may need to turn KAP off with "-WK,-o=1,-r=0".
Just how well did the software pipeliner do? Check for statistics in the .s
file that gets created with the -S option. This is an annotated assembly code
file that also denotes the sections such as Windup and Winddown. The .s file
reflects the actual order of instructions unlike assembly files produced by
previous releases. If a loop can be hand software pipelined for a better
schedule or operations then it is considered a bug.
One may be able to decrease cycle counts by making changes like splitting loops
that do not pipeline into smaller loops that do, and adding the Cray directive
"cdir$ ivdep" when SWP complains about possible recurrences but there really
are no data dependencies. It may also help to change the number of loop
unrollings that SWP performs. The default is 2, so the -SWP:unroll_times_max=4
and -SWP:unroll_times_max=1 will show if more or less unrolling, respectively,
is beneficial.
B. Bellows stalls
Bellows stalls is the name for the scenario where two loads, or one load and
one store, in the same cycle, both operate on the same memory bank. When
this occurs one of the accesses is delayed by one cycle in the address bellow
register. This means that in the worst case, only one of the two max memory
accesses per cycle occurs and the software pipeliner's efficiency is reduced.
The compiler cannot always guarantee even/odd access. Some techniques to work-
around the possible bellow stalls are as follows:
* padding common blocks; try trial and error or verify that consecutive
accesses go to alternate banks of memory.
* not declaring array sizes as a power of 2 (Budnick and Kuck, '71)
This can affect loops containing double precision arrays.
* interleaving iterations such as:
real a, x(100000), y(100000)
do i = 1, nn, 4
y(i) = y(i) + a*x(i)
y(i+2) = y(i+2) + a*x(i+2)
y(i+1) = y(i+1) + a*x(i+1)
y(i+3) = y(i+3) + a*x(i+3)
end do
instead of
do i = 1, nn
y(i) = y(i) + a*x(i)
end do
To see if bellow stalls is affecting the program, assuming the floating point
intensive loops did pipeline well, use prof to find the number of instructions
executed in the loop (assuming it is large enough to warrant investigation),
and divide this by the execution time for the loop; how close to the expected
performance did the loop get?
C. Parallel speed-up (running on increasing # of CPUs)
One may find that the Power Onyx/Challenge does not exhibit more MP speed-up
than the Onyx/Challenge for the same parallelization strategy. This is because
the R8000 has significantly sped up the calculations in the parallel region,
but the overhead of communication between processors, being a memory operation,
is the same as on Onyx/Challenge. This effect would be greater for fine-grained
parallel regions (short MP tasks) and should be negligible for coarse-grained
parallel regions.
The primary determining factor in the MP overhead is the cost of doing a bus
transaction, measured in terms of cpu instructions (i.e. how much work has to
be given up in order to do the bus transaction). The bus transaction time has
been going down, but not nearly at the rate that CPUs have been getting faster.
So, although the *absolute* cost of a bus transaction on an Onyx is a little
less than say on a 4D440, the *relative* cost is much higher, thus the MP
overhead is worse and the speed-up numbers are lower.
By the same token, note that *absolute* times for applications have gone down
(they are running faster), but the *relative* times (the speed-up) is worse.
The R8000 differs in this trend because there is (a small amount of) hardware
assist to reduce the MP overhead. Despite this, many applications will still
see even worse *speed-up* numbers on the R8000, because the cpu is so fast
(i.e. because the *absolute* numbers have gone down so much).
8. References
-------------
Online Insight manuals that have been changed for the IRIX 6.0 Development
Option (IDO) and MIPSpro compilers are the following:
MIPSpro 64-Bit Porting and Transition Guide
MIPSpro Compiling, Debugging and Performance Transition Guide
MIPSpro Assembly Language Programming Guide (MIPSpro 6.0.1)
MIPSpro FORTRAN 77 Programming Guide
MIPSpro FORTRAN 77 Language Reference Manual
MIPSpro Power FORTRAN Programmer's Guide
The OpenGL Programming Guide (Addison-Wesley Publishing Company)
The OpenGL Reference Manual (Addison-Wesley Publishing Company)
The OpenGL Porting Guide
Indigo Magic Desktop Integration Guide
MIPSpro compiler release notes that are supplied:
c_dev pfa_dev compiler_dev gl_dev
c++_dev pwrc_dev dev motif_dev
ftn_dev complib IDO x_dev
In conclusion, for a comprehensive understanding of the development issues in
the 64-bit environment, even after having read this article, it is essential
to read the MIPSpro 64-bit Porting and Transition Guide; this reading is
suggested prior to compilation of existing code with the 64-bit compilers.
And, as always, reading the release notes for all of the compiler products for
the details on the release's installation instructions, changes and additions,
bug fixes, and known problems and work-arounds, is highly recommended.