Overview of MIPSpro 64-Bit Compilers

Other pages

Overview of MIPSpro 64-Bit Compilers

Table of Contents
-----------------
preface
1. Introducing the 64-bit MIPSpro compilers
2. Binary compatibility
3. New features with the 64-bit MIPSpro compilers
4. Differences between the 32-bit and 64-bit compilers
5. Porting implications
6. Diagnosing porting problems
7. Performance tuning code for the R8000
8. References


preface

This is a collaboration of existing documents and experience, and is intended
as an overview. After reading this article, we strongly recommend reading the
online, Insight format MIPSpro 64-bit Porting and Transition Guide, available
with the purchase of the IRIX Development Environment.  Please read section 7
if after porting code to 64 bits the expected performance is not achieved.



1. Introducing the 64-bit MIPSpro compilers
-------------------------------------------

To utilize the R8000 hardware technology,  the compiler system  was rewritten
and now supports 64-bit addresses and 64-bit arithmetic along with a powerful
optimizer to keep the hardware pipeline full.

The R8000 advanced superscalar architecture supports 4 instructions per cycle;
up to 2 load/store,  and up to 2 integer  or 2 floating point instructions per
cycle. The R8000 implements the MIPS4 instruction set which adds the following
new instructions:  madd (multiply-add) instructions, conditional move instruct-
ions, indexed FP loads  and stores,  and reciprocal and reciprocal square root
instructions  (see the MIPS4(5) man page).

The new 64-bit compilers that are purchasable separate from IRIX 6.0 are C++,
C, FORTRAN 77, Power FORTRAN, Power C, and Assembly.    There are also 32-bit
compilers  (version 3.19)  which can be considered  bug fixed versions of the
IRIX 5.2 32-bit compilers (version 3.18). The 32-bit compilers available with
IRIX 6.0 are the above plus Pascal and Ada.   In addition, there is a CROSS64
development environment  which is intended for users  who want to develop 64-
bit applications targeting an IRIX 6.0 system, but run their development tools
on an IRIX 5.2 system (or are preparing for their future R8000 upgrade).  For
more information, please see the separate article about the CROSS64 development
environment.

The MIPSpro compilers use different front-ends than in previous releases, a new
back-end (optimizer/code generator) and a new tail-end (linker).  Note that the
back-end, under the highest optimization level,  can software pipeline the code
such that inner loops are optimized to reduce  the ratio  of memory accesses to
arithmetic instructions.   With the 64-bit compilers,  software pipelining, and
the MIPS4 instruction set, maximum performance can be achieved on the R8000.

For understanding  the development issues  in the 64-bit environment, it is
essential to read the new online, Insight MIPSpro 64-Bit Porting and Transition
Guide; this reading is recommended  prior  to compilation of existing code with
the 64-bit compilers.  And, as always, reading the release notes for all of the
compiler products is highly recommended. For a list of references, see the last
section of this article.




2. Binary compatibility
-----------------------

Currently, IRIX 6.0 is only supported on the R8000 (Power Challenge, Power
Onyx, Power Indigo2).  Always backward compatibility is guaranteed but not
necessarily forward compatibility (in other words,  something created on a
newer release is not guaranteed to work on a previous release).   Backward
compatibility is maintained with IRIX 6.0 and the R8000, with the exceptions
below.

Backward Compatibility
----------------------

IRIX 32-bit binaries from a previous release will work under IRIX 6.0 except
those that

 1. access kernel data structures, for example via /dev/kmem or /proc.  Many
    structures have changed sizes.  Such a program must be ported to the 64-
    bit ABI.

 2. use NLIST(3E); those will not work on any 64-bit .o or a.out.
    A new NLIST64(3E) is supplied for 64-bit ELF.

 3. use LSEEK(2); now use LSEEK64(2) or the program is restricted to 2 GB.

 4. make any assumption that the page size is 4 Kbytes  (for example using
    MMAP(2) and specifying the address).   The page size is now 16 Kbytes;
    such programs must use GETPAGESIZE(2).

 5. are ada programs which catch floating point exceptions.

 6. exchange binary data with 64-bit binaries having size assumptions which
    do not agree. NFS works fine; 32-bit programs programs that share memory
    with 64-bit programs will work  assuming the programs carefully consider
    data type size differences; programs using RPC(3R) + XDR(3R) should have
    no problem; however arenas (USINIT(3))  CANNOT  be shared between 32-bit
    and 64-bit programs.   See TEST_AND_SET(3) and ABILOCK(3)  for 32 <-> 64
    compatible locks.

It is possible for a program to determine whether it is running on a 64-bit
capable kernel using SYSCONF(3C) and the argument _SC_KERN_POINTERS.    For
kernel drivers (character, block, streams), the kernel requires  MIPS3  ELF
object format code in IRIX 6.0.  MIPS1 drivers CANNOT be linked in.  Hence,
kernel drivers for 6.0 require a 6.0 system to build them, likewise, drivers
for 5.2 require a 5.2 system.  A makefile prototype for drivers is provided
in /var/sysgen/Makefile.kernio.

For 32-bit programs executed on the R8000, performance will not be better than
the R4400 since those were not compiled to utilize the R8000 architecture.


Object Compatibility
--------------------

Mixing 32-bit objects with 64-bit objects is not allowed, just as mixing ELF
and COFF format objects is not allowed, nor is mixing shared and non-shared
objects.

Furthermore:

 * On an IRIX 6.0 system, one CAN link an ELF object, archive, or shared object
   (.o, .a, or .so) created on IRIX 5.2 with 32-bit .o, .a or .so files created
   using the 32-bit compiler on an IRIX 6.0 system.    The resultant executable
   will run on an IRIX 6.0 system, and likely on an IRIX 5.2 system but that is
   not guaranteed.

 * 32-bit .o, .a, or .so files CANNOT be linked with 64-bit .o, .a, or .so's.

 * 32-bit a.outs CANNOT use 64-bit DSOs, nor can 64-bit a.outs use 32-bit DSOs.

 * On an IRIX 5.2 or IRIX 6.0 system, one CAN link .o, .a, or .so files created
   with the CROSS64 development environment under IRIX 5.2, with 64-bit .o, .a
   or .so files created in IRIX 6.0, since the CROSS64 environment is
   essentially the 64-bit environment but installed in /usr/cross64, and the
   compilers are 32-bit executables so will execute on IRIX 5.2 systems.

 * No 64-bit binaries will execute on current IRIX 5.x systems.

 * It is likely that a future IRIX release for R4400 Challenge/Onyx systems
   will support 64-bit MIPS3 ELF binaries.  So, providing a MIPS3 version of
   of an application's .o, .a, .so, or executable files will allow future
   development on those systems where customer code can be combined with the
   application's objects or libraries, and the application needs a large data
   space (more than a 32-bit address can hold) and the MIPS 64-bit ABI.


MIPS instruction sets
---------------------

As background information, the four MIPS instruction sets and the SGI hardware
that supports them are as follows:

 1. the MIPS1 instruction set is supported by the R2000/R3000 and all
    subsequent processors.  This CPU is found in Indigo, PI, PowerSeries
    systems (none of these are in current production). By default for 32-
    bit compilation the MIPS1 instruction set is used.

 2. the MIPS2 instruction set is supported by the R4000  and all subsequent
    processors.  This CPU is found in Indigo, Indigo2, Indy, Crimson,
    Challenge, and Onyx systems.

 3. the MIPS3 instruction set is supported by the  R4400  and all subsequent
    processors.  This CPU is found in Indy, Indigo2, Crimson, Challenge, and
    Onyx systems.  Operating system support is also required and no current
    IRIX 5.x release provide this support at this point, just IRIX 6.0.

 4. the MIPS4 instruction set is supported by the R8000 processor.   This CPU
    is found in Power Challenge/Onyx/Indigo2 systems. On the R8000, the MIPS4
    instruction set provides peak floating point performance  which cannot be
    yielded by the other instruction sets.   By default for the  R8000  under
    IRIX 6.0, 64-bit compilers and the MIPS4 instruction set are used.

The MIPS1 and MIPS2 sets provide 32-bit virtual addressing whereas MIPS3 and
MIPS4 provide 64-bit virtual addressing since pointers (and longs) are 64 bits.




3. New features with the 64-bit MIPSpro compilers
-------------------------------------------------

LP64 model
----------

 LP64 means that longs and pointers are 64 bits (8 bytes), with the MIPSpro
 compilers. See below for the change in C and FORTRAN data types sizes.

 	C type		32-bit    LP64  	FORTRAN type
  	 char		  8	    8		 CHARACTER
  	 short int	 16	   16		 INTEGER*2
  	 int 		 32  	   32 		 INTEGER
  	 long int	 32 	   64		 none
  	 long long int	 64	   64		 INTEGER*8
  	 pointer	 32 	   64		 pointer
  	 float 	 	 32	   32		 REAL
  	 double	 	 64	   64		 REAL*8
  	 long double	 64 	  128 		 REAL*16

 Note that FORTRAN types already reflect the LP64 model (the pointer type is
 an extension to ANSI FORTRAN 77).

16 byte arithmetic
------------------

 Long double (16 byte) arithmetic is supported by the  MIPSpro  compilers,
 using the ANSI C standard syntax.  The 6.0 release introduces routines in
 existing libraries which do  QUAD  precision floating point calculations.
 Most of the long double math routines  are named by prefixing  the letter
 'q' to the double precision routine's name: for example, qsin is the long
 double version of sin (see the TRIG(3M) man page).

 The representation used is not IEEE compliant; long doubles are represented
 on this system as the sum or difference of two doubles,  normalized so that
 the smaller double is <= .5 units in the last place of the larger.  This is
 equivalent to a 107 bit mantissa with an 11 bit biased exponent (bias=1023),
 and 1 sign bit.  In terms of decimal precision,  this is  approximately  34
 decimal digits.   See MATH(3M) man page and 6.0 compiler_dev release notes,
 end of chapter 3.

FORTRAN 77 Compiler
-------------------

 The new FORTRAN compiler implements REAL*16 and COMPLEX*32 and all associated
 intrinsics as 16 byte floating point entities.    The 32-bit FORTRAN compiler
 had recognized the types but converted them to REAL*8 and COMPLEX*16,
 respectively.

 %LOC now returns an 8 byte address and %VAL now passes an 8 byte value.

 The 64-bit FORTRAN MP I/O library has been enhanced to allow IO from parallel
 regions.     In other words, multiple threads can read and write to different
 FORTRAN logical units as well as read and write to the same logical unit. The
 latter case will of course encounter normal overhead due to file locking.

 The 64-bit compiler release provides a united run-time library for parallel C
 and parallel FORTRAN programs (-lmp).   This united library allows parallel C
 to be mixed with parallel FORTRAN programs and have a single master.

C Compiler
----------

 The C compiler has generally improved compiler diagnostics and generally
 is more strict in enforcing ANSI C rules.  Many of the options are quite
 different, such as -O3.  Please see the c_dev release notes for details.

 Under the 64-bit compiler, warning messages start with the string "!!!".
 Error messages start with the string "###".  This allows easier searches
 in log files.

C++ Compiler
------------

 CC -64 and NCC are both native compilers that are based on the same front end.
 The CC -64 front end is fecc, which has 64-bit pointers, addresses, and long
 ints.  The NCC front end is edgpcfe, a front end with 32-bit pointers,
 addresses, and long ints.  To invoke the old 32-bit cfront translator, use
 "CC -32 -use_cfront".  To invoke the 32-bit native C++ compiler, NCC, instead
 of the translator, use "CC -32".

 SGI continues to ship the old translator as a way to allow a phased migration
 to the new compiler.  In a future C++ release, the translator will be removed.

 It is now possible to make dynamic shared objects (DSOs) from C++ object
 files, even when run-time initialization is required.  To make a DSO instead
 of an executable file, use the -shared option on the CC command line.

 The C++ 6.0 compiler implements the C++ language as described in The Annotated
 C++ Reference Manual (Margaret Ellis and Bjarne Stroustrup, Addison-Wesley
 1990), without, however, the exception handling feature described in Chapter
 15 of that book.  Those constructs formerly unimplemented by cfront are now
 implemented with the native C++ compilers.  Please see the online C++
 Programmer's Guide and c++_dev release notes for more information.

Compiler tools
--------------

 The 6.0 DBX(1) has been re-implemented and supports a new debugging format,
 called DWARF.  DWARF is a format for the info generated by the compiler,
 assembler, and linker that is necessary for source-level debugging.  See
 man pages DWARF(4) and DWARFDUMP(1), a new tool to dump debug info of an
 ELF object.

 Pixstats functionality has been integrated into prof.  Makefiles or scripts
 that invoke pixstats will have to be changed.

 Cord is available for 32-bit objects only.

 The compiler tools understand 32-bit and 64-bit ELF objects, and for the
 most part understand COFF (IRIX 4.x) objects (dbx does, dis does not).

64-bit interface register convention
------------------------------------

  Register   Software Name	   Use 		  Who saves
     $0		zero		Hardware zero	      -
     $1		at		Assembly temp	    caller
   $2..$3	v0..v1		Function results    caller
   $4..$11	a0..a7		Function arguments  caller
  $12..$15	t0..t3		Temps		      -
  $16..$23	s0..s7		Saved  		    callee
    $24		t8		Temp		    caller
    $25		t9		Temp 		    caller (& callee in PIC)
  $26..$27	k0..k1		Kernel temps	      -
    $28		gp		Global pointer	    callee
    $29		sp		Stack pointer	    callee
    $30		s8		Frame pointer	    callee
    $31		ra		Return address	    caller

   hi,lo			Multiply/Divide	    caller
  $f0, $f2			FP func results     caller
  $f1, $f3			FP temps	    caller
  $f4..$f11			FP temps	    caller
  $f12..$f19			FP arguments	    caller
  $f20..$f23			FP temps	    caller
  $f24..$f31			FP saved     	    callee

 Note that "caller-saved" means only that the caller may not assume that the
 value in the register is preserved across the call.

64-bit subprogram interface
---------------------------

 At most, a total of eight floating point registers ($f12..$f19)  may be used
 to pass FP arguments, or up to eight integer registers ($4..$11) may be used
 to pass integer args. For example where d1..d3 are double precision FP args,
 s1..s3 are single precision FP arguments,  and n1..n3 are integer arguments:
 if d1,d2,d3,s1,s2,s3,n1,n2,n3 are in the argument list then the register and
 stack assignments are $f12,$f13,$f14,$f15,$f16,$f17,$10,$11,stack.   Another
 example: n1,n2,d1 get assigned to $4,$5,$f14.

 Whenever possible, floating point arguments are passed  in floating point
 registers regardless of whether they are preceded by  integer parameters.
 [The 32-bit ABI allows only leading floating point arguments to be passed
 in FP registers;   those coming after integer registers  must be moved to
 integer registers.]

 Variable argument routines require an exception to this rule. Any floating
 point parameters in the variable part  of the argument list  are passed in
 integer registers.  There are several important cases involved:

 - If a varargs prototype (or the actual callee definition) is available to
   the caller,  it places floating point parameters directly in the integer
   register required, and there are no problems.

 - If no prototype is available to the caller for a direct call, then the
   caller's parameter profile is provided in the object file  (as are all
   global subprogram formal parameter profiles),  and the linker (ld/rld)
   generates a diagnostic message if the linked entry point  turns out to
   be a varargs routine.

 - If no prototype is available to the caller for an indirect call (that
   is, via a function pointer),  then the caller assumes that the callee
   is not a varargs routine  and places floating point parameters  in FP
   registers. If the callee is varargs, the code is not ANSI-conformant.




4. Differences between the 32-bit and 64-bit compilers
------------------------------------------------------

As previously stated, there are two types of compilers available with IRIX
6.0: 32-bit (ucode) compilers that are bug fixed versions of the compilers
available under IRIX 5.2, and there are the 64-bit compilers.    Using the
-32 switch on the compiler command line  will invoke the  32-bit  compiler
stages.  Using -64 will invoke the 64-bit compiler stages.

*The default switches on Power Challenge/Onyx/Indigo2 systems (R8000)  are
"-64 -mips4".*   The 64-bit compilers are used and code is generated using
the MIPS4 instruction set.  When using the 32-bit compilers often, setting
the environment variable SGI_ABI to 32 saves typing -32 on the compile line.

Simple 64-bit compiler stage flow is cc->fec->be->ld instead of the 32-bit
cc->cfe->ugen->as1->ld, or for FORTRAN  f77->cpp->fef77->be->ld instead of
the 32-bit flow f77->cpp->fcom->ugen->as1->ld.

		Components of the MIPSpro 64-bit compilers

	MIPSpro    Ucode			     MIPSpro  Ucode
	64-bit     32-bit			     64-bit   32-bit
       	   FORTRAN 77	      	Role performed		    C
	-----------------	--------------	     ---------------
	 f77	   f77		driver	      	      cc      cc
	 cpp	   cpp 	 	preprocessor	      fec     cfe
	fef77	   fcom		front-end	      fec     cfe
	fef77	   fopt		scalar optimizer      copt    copt
	fef77	   fcom		interprets parallel   mpc   accom_mp
				  directives
	fef77p     pfa		automatic parallel    pca     pca
				  accelerator
	 be	   ugen&as1	back-end	      be      ugen&as1
	 ld	   ld		linker		      ld      ld


Although they are two separate and different compiler systems,  the 32-bit
and 64-bit compilers have similar command line interfaces.  Please look to
the CC(1) and F77(1) man pages for a list of supported options,   also the
MIPSpro Porting and Transition Guide summarizes the difference in switches.

 32-bit compiler flags that the 64-bit compiler does not support:

 -32	     by definition
 -mips1      generate code using MIPS1 instruction set (the default)
 -mips2      generate code using MIPS2 instruction set

 64-bit compiler flags that the 32-bit compiler does not support:

 -64 	     by definition
 -mips3      generate code using MIPS3 instruction set
 -mips4      generate code using MIPS4 instruction set (default on R8000)
 -help	     print list of possible options

 Switches accepted by both compilers but having different semantics:

 -v -show    -show gives the compiler stage flow for 64-bit; -v for ucode
 -woff	     turn off named warnings, but the warning numbers are different
	     between 32-bit and 64-bit compilers.
 -Wc,arg     where c designates to which pass of the compiler the argument
	     is going to be passed.  Since the compiler stages changed,
	     the choices for c are different.

FORTRAN 77 compiler differences
-------------------------------

 A major part of the front-end is  the Kuck and Associates Processor (KAP).
 KAP is an optimizer that analyzes  data dependence  to guide serial optim-
 izations and automatic parallelizations. KAP performs scalar optimizations
 optimizations such as outer loop unrolling and pulling out loop invariants.
 Unlike the ucode compiler, with -O2 BOTH scalar and back-end optimizations
 are performed.

 Other differences with  the FORTRAN 77 front-end,  fef77,  are as follows:
  - fef77 allows empty arguments in subroutine/function calls  which result
    in zeroes passed by value.
  - fef77 implements the FORTRAN 8x (subset of FORTRAN 90) array syntax
  - fef77 and fcom are different in how they fold REAL*4 constants.   fcom
    internally promotes them to REAL*8 whereas fef77 adheres to ANSI standard
  - fef77 allows less constant expressions in parameter statements than fcom
  - fcom allows lines longer than 256 characters by default; fef77 currently
    has a hard limit of 256.

C compiler differences
----------------------

 Some compiler switches that are no longer supported under 6.0 MIPSpro  are
 -wlint (use lint instead), any options that have to do with ucode like -j,
 -Olimit, -Xcpluscomm, -varargs, and many others.

 Because of the LP64 model,  data type size differences  and alignment can
 cause internal padding in structures (likewise for FORTRAN common blocks).
 For example:

	struct t { char c1; char c2; short s; long l; } t1;

 	       Word alignment for l, sizeof struct is 8
	-----------------------------------------------------------------
 32-bit	|  c1	|  c2	|	s	|		l		|
	-----------------------------------------------------------------

 	       Doubleword alignment for l, sizeof struct is 16
	-----------------------------------------------------------------
 64-bit	|  c1	|  c2	|	s	|		pad		|
	-----------------------------------------------------------------
 	|	 	 	     l	 				|
	-----------------------------------------------------------------
   Byte 1	2	3		5				8

 Because of alignment and padding,  programs should use the heuristic of
 placing the largest data types in the beginning of the struct or common
 block (and smallest at the end).

C++ compiler differences
------------------------

 In some cases, the Silicon Graphics native C++ compilers are not backwards-
 compatible with cfront because cfront has defects, behaves in a non-
 deterministic manner, or fails to adhere to the standard.  Also, there is a
 new, incompatible mechanism for handling C++ templates.

 The native compilers provide much tighter error checking than cfront.
 Please see the online C++ Programmer's Guide and c++_dev release notes for
 more information.

Back-end differences
--------------------

 With the 64-bit compiler, the back-end is the code generator and optimizer
 rolled into one stage, "be". This is where the software pipelining happens.
 With MIPSpro compilers,  -O3 means software pipelining whereas with ucode
 it means intra-procedure optimizations.   There are a variety of back-end
 optimizations flags that are available,  many are activated  based off of
 the -O level specified, but some require  semantic knowledge  of the code
 and therefore can only be activated by *the user*.   For details on these
 options please consult the  MIPSpro  64-bit  Porting and Transition Guide
 and the compiler driver man page (CC(1), F77(1)).

Library structure differences
-----------------------------

 There is also a new library structure with IRIX 6.0; the drivers (cc, f77)
 and rld know where to find the matching MIPS instruction set libraries.
 The location of the shared objects is part of the ABI;  MIPS1/2 32-bit
 programs have a different ABI than MIPS3/4 64-bit programs.  The native
 development library directory structure is the same for MIPS1/2, but for
 MIPS3/4 libraries the shared objects are located in /usr/lib64/mips3 and
 /usr/lib64/mips4, and the compiler stages are located in /usr/lib64/cmplrs.




5. Porting implications
-----------------------

C implications
--------------

Within source code, most porting problems will arise from assumptions, implicit
or explicit, about either absolute or relative sizes of the int, long int, or
pointer types.  The most common classical assumptions are likely to be:

 * size assumptions
      sizeof (int) == sizeof (void *)
      sizeof (int) == sizeof (long)
      sizeof (long) == 4
      sizeof (void *) == 4
      sizeof (long) == sizeof (float) [long gets narrowed in mixed expressions]
   Use -fullwarn or -wlint (with -32, lint in -64) to expose size issues

 * sign extension possible
      int *p, i;
      p = (int *) i;
   This generates a "warning(1412): source type of cast is too small to hold
   all pointers: sign extension possible".    Dereferencing the pointer will
   at run-time cause  a bus error  and  core dump  since a stack address  is
   similar to 0xffffffae50, not 0xffffffffffffae50.  Notice that the virtual
   address is 40 bits (10 hex bits) with MIPS3/4; a user process can create
   a virtual address space of up to size 1 terabyte (2^40),  provided memory
   plus swap size is larger.

 * truncation possible
      unsigned i, *p
      i = (unsigned) p;
   This generates a "warning(1411): destination type of cast is too small to
   hold all pointers: truncation possible" because a 4 byte data type is not
   big enough to hold an 8 byte data type.

 * format strings in printf or scanf
      %d and %ld differ when compiled -64
   Using %d to print 8 bytes will print only the low order 4 bytes, use %ld.

 * constants - hex constants are not sign extended
      long x;
      ... ( (long) ( x + 0xffffffff ) ) ...
   With -32 this evaluates to (x-1), however with -64 it is (x+4294967295).

 * use prototypes, especially when varargs are involved
      printf("%f",float_val);
   Recall the subprogram interface change  that was mentioned  in section 3,
   varargs floating point arguments are required to be put into integer regs
   so not prototyping the above would result in the wrong value printed.
   Watch for compiler/linker warning messages in order to spot these instances.

FORTRAN implications
--------------------

The FORTRAN compiler has no data size changes, as its types have specific bit
sizes (REAL still implies 4 bytes), so standard ANSI FORTRAN 77 code should
have no problems.   For FORTRAN code that interfaces to C, care needs to be
taken since arguments are passed by reference and pointers are now 8 bytes.
Also, %loc now returns 64-bit addresses and %val passes 64-bit values.

 If the C code used ints to contain the pointers then a change is needed.

	Example: 	FORTRAN calling C

		FORTRAN			      C
		-------			     ---
	    integer i,j  		foo_(int *i, int *j) or,
	    call foo(i,j)		foo_(long i, long j) less preferable

 FORTRAN subprograms called by C where long int arguments are passed (by
 address) may need to change argument declarations.

	Example:	C calling FORTRAN

		   C			   FORTRAN
		  ---			   -------
	     long l1, l2;		subroutine foo( i, j )
	     foo_( &l1, &l2 );	    #if (_MIPS_SZLONG==64)
	    				integer*8 i, j
				    #else
	    				integer*4 i, j
				    #endif

 FORTRAN arguments passed by %VAL calls to C routines should be declared as
 long ints in the C routines.

	Example: 	FORTRAN arguments passed by %VAL

		FORTRAN			      C
		-------			     ---
	    call foo(%VAL(i))		foo_( long i }

 FORTRAN code that uses %LOC may need to be changed to store 8 byte address.

	Example:	FORTRAN use of %LOC in -64

	    common // heap
	#if (_MIPS_SZPTR==64)
	    integer*8 haddress
	#else
	    integer*4 haddress
	#endif
	    haddress = %loc(heap)


Coding variable size issues
---------------------------

Typically one would not want to maintain two copies of source code, but rather
maintain a single source  with more complex makefiles to generate two outputs,
a 32-bit binary and a 64-bit binary.    Care must be taken to create  only one
version of every header file.  Use typedef types for fields which are to be of
a constant size and for those to be of a natural size.  For example, off_t and
size_t are now 64 bits in -64 only,  while the same types compiled -32 remains
32 bits in size.  See /usr/include/inttypes.h.

A less preferable alternative is to use #if for 32 vs 64, as shown in the above
FORTRAN examples.  See the porting guide for a full list of compiler predefined
variables, like _MIPS_SZPTR, for MIPS1 and MIPS4 executables.




6. Diagnosing porting problems
------------------------------

This section describes possible causes for code that runs differently and/or
incorrectly than IRIX 5.2. For code that is not achieving expected performance,
see the next section, titled "Performance tuning code for the R8000".

32-bit program gets different answer on R8000
---------------------------------------------

The first case is a 32-bit program that gives different floating point results
when executed on an R8000 as compared to an R3000 or R4000 CPU.    Two reasons
for this possibility are  algorithm changes  in the 32-bit DSOs  provided with
the MIPSpro 6.0 environment  and the second  is a hardware change  in handling
floating point exceptions.

 A. Algorithm change in a needed library

To determine if this affects the code,  try copying from a 5.2 system the
needed 32-bit DSOs (use "elfdump -Dl a.out" to see what is needed) and at
run-time, link with those DSOs  (by using the LD_LIBRARY_PATH environment
variable, see RLD(1) ).  The 32-bit DSOs provided in IRIX 6.0 are similar
to those provided in IRIX 5.3.

 B. Hardware change in handling very tiny floating point numbers (fpmode)

To determine if this is affecting the code, try toggling fpmodes and re-running
the program via fpmode, "fpmode precise  ".  There is excellent
documentation of the two FP modes, performance and precise, in chapter five of
the MIPSpro  64-Bit  Porting and Transition Guide  and see also the  FPMODE(1)
man page.  Essentially, the MIPS Floating-Point Architecture has been extended
to improve performance significantly for those programs that do not care about
denormalized numbers that are generated by their code.   A denormalized number
is a set of very tiny numbers,  less than  2^126 (~10^38)  in single precision
and less than 2^1022 (10^308) in double precision.    For R4x00 systems, these
raised underflow exceptions and were trapped by the kernel, and are now flushed
to zero in hardware in performance floating point mode.

In many cases, the application performs correctly  if all the denormalized
intermediate results were rounded to zero.   However, if the application truly
requires representation of denormalized numbers in order to perform correctly,
then use "fpmode precise" (or likewise from a program use SYSSGI(2)).  Precise
exception mode fully complies with the IEEE Standard and is compatible in every
way to the preexisting MIPS floating-point architecture.

One final reason to use  "fpmode precise"  is when debugging a program that
generates floating point exceptions and needs the floating point signal handler
to have the right state information,  such as the instruction  that caused the
exception.  With "fpmode performance",  FP instructions can be executed out of
order and exceptions trapped imprecisely so the floating point signal handlers
may not work as expected.

It should be emphasized that running in performance mode does not affect those
applications which do not cause floating point exceptions.


64-bit program gets different answer than its 32-bit counterpart
----------------------------------------------------------------

In addition to the reasons above (library algorithm changes and FP performance
mode), reasons why a 64-bit program can get different answers  than the 32-bit
version of the same source base are  the MIPS4  madd instructions,  additional
library accuracy for 16 byte arithmetic, operation reductions by optimizations,
reassociation of operations by optimizations, or an unstable user algorithm.

 A. MIPS4 madd instructions

The intermediate result of the multiply-add/subtract instructions is calculated
to infinite precision and is not rounded prior to the addition or subtraction.
The result  is then rounded  according to the rounding mode  specified by  the
instruction.   This can yield slightly different calculations  than a multiply
instruction (which is rounded) and an add instruction (which is rounded again).

 B. Additional accuracy in math library for 16 byte arithmetic

The MIPS3/4 math library, -lm, contains routines newly implemented (1994) using
algorithms which take advantage of the MIPS architecture.  FORTRAN code that
uses REAL*16 or C code that uses long double, and makes math function calls,
could get different run-time results due to additional accuracy that the QUAD
routines offer.

 C. Operation reductions by optimizations (IEEE arithmetic non-conformance)

The extent to which optimizations must preserve IEEE floating point arithmetic
is controlled by the -OPT:IEEE_arithmetic option.  In this FORTRAN do loop,

        DO i = 1,1000
           sum = sum + a(i)/divisor
        END DO

at -OPT:IEEE_arithmetic=1,  the generated code must do all the loop iterations
in order, with a divide and an add in each.  Using -OPT:IEEE_arithmetic=3, the
divide can be treated like a(i)*(1.0/divisor).   On the R8000,  the reciprocal
can be calculated with a recip instruction before the loop is entered, reducing
the loop body to a much faster multiply-add (madd) per iteration.   Note, that
IEEE arithmetic conformance is the default (=1) so this flag could only be the
cause of the 64-bit program giving different floating point results when
setting -OPT:IEEE_arithmetic=2 or 3.

 D. Reassociation of operations by optimizations (causing roundoff errors)

For 64-bit programs compiled with -O3 optimization, cumulative roundoff errors
may have occurred due to associative rearrangement, even across loop
iterations,
and distribution of multiplication over addition/subtraction.  Note,  with -O3
the default is -OPT:roundoff=2 (out of a possible 3); -O0 through -O2 defaults
to -OPT:roundoff=0.    To see if roundoff is affecting the program,  try using
"-mips4 -O3 -OPT:roundoff=0" flags.

Also, the KAP loop analyzer is on by default with -O3 and may perform
optimizations that introduce roundoff error.   To turn off these optimizations
one can recompile with "-WK,-o=0,-r=0,-so=0", unless using -pfa to parallelize
the code since KAP also performs the parallel optimizations in which case just
use "-WK,-r=0".  KAP flags affect the front-end compilation stage and OPT
options affect the back-end.

 E. Unstable user algorithm

It is not considered preferable programming practice to test computed floating
point values for equality.  Rather, it is preferred to test
	IF ( abs ( abs(x) - abs(y) ) < eps )
where eps is an appropriately picked delta, instead of
	IF ( abs(x) > abs(y) )
Such an algorithm may give different results even with -OPT:roundoff=0 because
of the nature of floating point representation.

Recall that the 64-bit FORTRAN front-end is different than the 32-bit compiler
in constant folding.   REAL*4 are not interally promoted to REAL*8, as before.
	program test			% f77 -32 cf.f
	real r				% a.out
	r = 3.14 / 3			       1.0466666222
	write (6,10) r			% f77 -64 cf.f
	format (f17.10)			% a.out
	end				       1.0466667414
As long as the user's algorithm doesn't depend on so many digits of precision,
this change (required for FORTRAN ANSI standard adherence) won't be an issue.


Isolating parallel processing (MP) problems
-------------------------------------------

SGI recommends first getting the application working with no parallelization at
the highest optimization level.    When testing the parallel version, first run
it with only 1 thread  (either on a 1 cpu machine or by setting the environment
variable MP_SET_NUMTHREADS to 1).    If there is time,   go back down to -g for
the first MP test, run that with 1 thread and multiple threads,  and then go up
the optimization scale,   testing one thread and then testing multiple threads.
This follows the general principle of changing only one thing at a time.   This
methodology of incremental iterations will most likely help quickly narrow down
the problem.




7. Performance tuning code for the R8000
----------------------------------------

The R8000 architecture is a big performance advantage for code  with floating
point loops having large iteration counts.   **The R8000 performance story is
all about software pipelining (SWP) and the MIPS4 instruction set.**  Keeping
the pipeline full with 4 instructions per cycle,  unrolling inner loops to
eliminate redundant loads, reducing the ratio of loads to madd (multiply-add)
instructions per loop iteration, reducing the ratio of hardware cycle count per
loop iteration,  and making use of the many registers, are all techniques that
lead to  peak performance for loop intensive programs, especially floating
point loop intensive programs.

More about software pipelining
------------------------------

 A. Definition

SWP is mixing operations from  different loop iterations  in each iteration of
the hardware loop so that the pipeline is filled (4 instructions per cycle).

 B. How software pipelined code looks and how it works

For a simple DAXPY loop (double precision a times x plus y),  shown below, can
be coded as two load instructions followed by a madd instruction and a store.

	DO i = 1, n			 0:  ldc1(x) ldc1(y) madd
	   y(i) = y(i) + a*x(i)		 1: 2: 3:
	END DO				 4:  sdc1

However, there is a three cycle delay before the results of the madd can be
stored.  So, to keep the interim cycles filled the loop can be unrolled and
rewritten such that operations from different loop iterations can be mixed
in each iteration of the hardware loop.  This can look like the following:

	Windup:
	   1:  t1 = ldc1   t2 = ldc1   t7 = madd t1 + t2
	   2:  t4 = ldc1   t5 = ldc1   t8 = madd t4 + t5
	L1:
	   1:  t1 = ldc1   t2 = ldc1   t3 = madd t1 + t2
	   2:  t4 = ldc1   t5 = ldc1   t6 = madd t4 + t5
	   3:  sdc1 t7     sdc1 t8     beq compensation1

	   4:  t1 = ldc1   t2 = ldc1   t7 = madd t1 + t2
	   5:  t4 = ldc1   t5 = ldc1   t8 = madd t4 + t5
	   6:  sdc1 t3     sdc1 t6     bne L1
	Winddown:
	   1:  sdc1 t7     sdc1 t8     br ALLDONE
	compensation1:
	   1:  t7 = t3	   t8 = t6
	   2:  br Winddown
	ALLDONE:

In the above example, there are two loop replications (1-3 and 4-6). Note that
every loop replication completes 2 loop iterations in 3 cycles, instead of one
loop iteration in five cycles.   The stores in this loop are storing  the madd
results from previous iterations.    But, in general  any operations  from any
number of different iterations can be mixed.  In order to properly prepare for
entry into such a loop,  a Windup section of code is added to set up registers
for the first stores in the main loop.  In order to exit the loop properly,  a
Winddown section is added  to perform  the final stores.    Any preparation of
registers needed for the Winddown section is done in the compensation section.

 C. Parallelization (MP) tradeoffs

Software pipelining is performed only on inner loops.  Outer loop unrolling is
performed by KAP (see -sopt in F77(1), or for C see the PCA man page). Because
of SWP,  it is more important for the  R8000  to have  inner loops  with large
iteration counts, as compared to the R4400.   This may affect the strategy for
determining the optimal nesting  of loops.   When tuning for both single proc-
essor performance and multiprocessing, there may exist tradeoffs between plac-
ing the largest iteration count on  the inside  for best SWP or on the outside
to provide the greatest MP opportunity.


64-bit code not performing as well as expected
----------------------------------------------

 A. SWP failed

Some programs when compiled -64  will not perform better than on the R4400.
However loop intensive programs, especially ones with SAXBY or DAXBYs, will
run faster.   If the contrary is observed, possible reasons are as follows:

    * not compiling with -O3.  Software pipelining is activated by -O3.

    * SWP works only on inner loops; inner loops with subroutine calls
      (including many intrinsics) or branches will not software pipeline.

    * loops with many lines of code generally will not SWP because of not
      enough available registers; unrolling uses a lot of registers.

    * KAP and SWP conflicts: KAP may fuse loops causing the resulting loop
      to not pipeline well (try using "-WK,-nofuse"); KAP may unroll inner
      loops which may be SWP'ed; when hand tuning code for the SWPer,  one
      may need to turn KAP off with "-WK,-o=1,-r=0".

Just how well did the software pipeliner do?    Check for statistics in the .s
file that gets created with the -S option.  This is an annotated assembly code
file that also denotes the sections such as Windup and Winddown.   The .s file
reflects  the actual order of instructions  unlike  assembly files produced by
previous releases.  If a loop can be hand software pipelined for a better
schedule or operations then it is considered a bug.

One may be able to decrease cycle counts by making changes like splitting loops
that do not pipeline into smaller loops that do, and adding the Cray directive
"cdir$ ivdep" when SWP complains about  possible recurrences  but there really
are no data dependencies.     It may also  help to change  the number  of loop
unrollings that SWP performs. The default is 2, so the -SWP:unroll_times_max=4
and -SWP:unroll_times_max=1 will show if more or less unrolling, respectively,
is beneficial.

 B. Bellows stalls

Bellows stalls is the name for the scenario where two loads, or  one load and
one store,  in the same cycle,  both operate on the same memory bank.    When
this occurs one of the accesses is delayed by one cycle in the address bellow
register.   This means that in the worst case, only one of the two max memory
accesses per cycle occurs and the software pipeliner's efficiency is reduced.
The compiler cannot always guarantee even/odd access. Some techniques to work-
around the possible bellow stalls are as follows:

    * padding common blocks; try trial and error or verify that consecutive
      accesses go to alternate banks of memory.

    * not declaring array sizes as a power of 2 (Budnick and Kuck, '71)
      This can affect loops containing double precision arrays.

    * interleaving iterations such as:

          real a, x(100000), y(100000)

          do i = 1, nn, 4
                y(i) = y(i) + a*x(i)
                y(i+2) = y(i+2) + a*x(i+2)
                y(i+1) = y(i+1) + a*x(i+1)
                y(i+3) = y(i+3) + a*x(i+3)
          end do

      instead of

          do i = 1, nn
                y(i) = y(i) + a*x(i)
          end do

To see if bellow stalls is affecting the program, assuming the floating point
intensive loops did pipeline well,  use prof to find the number of instructions
executed in the loop (assuming it is large enough to warrant investigation),
and divide this by the execution time for the loop;  how close to the expected
performance did the loop get?

 C. Parallel speed-up (running on increasing # of CPUs)

One may find that the  Power Onyx/Challenge  does not exhibit more MP speed-up
than the Onyx/Challenge for the same parallelization strategy. This is because
the R8000 has  significantly  sped up the calculations in the parallel region,
but the overhead of communication between processors, being a memory operation,
is the same as on Onyx/Challenge. This effect would be greater for fine-grained
parallel regions (short MP tasks)  and should be negligible for coarse-grained
parallel regions.

The primary determining factor in the MP overhead  is the cost of doing a bus
transaction, measured in terms of cpu instructions (i.e. how much work has to
be given up in order to do the bus transaction).  The bus transaction time has
been going down, but not nearly at the rate that CPUs have been getting faster.
So, although the  *absolute*  cost of a bus transaction on an Onyx is a little
less than say on a 4D440, the *relative* cost is much higher, thus the MP
overhead is worse and the speed-up numbers are lower.

By the same token, note that *absolute* times for applications have gone down
(they are running faster), but the *relative* times (the speed-up) is worse.

The R8000 differs in this trend because there is (a small amount of) hardware
assist to reduce the MP overhead.  Despite this, many applications will still
see even worse *speed-up* numbers on the R8000, because the cpu is so fast
(i.e. because the *absolute* numbers have gone down so much).




8. References
-------------

Online Insight manuals  that have been changed  for the IRIX 6.0 Development
Option (IDO) and MIPSpro compilers are the following:

 MIPSpro 64-Bit Porting and Transition Guide
 MIPSpro Compiling, Debugging and Performance Transition Guide
 MIPSpro Assembly Language Programming Guide (MIPSpro 6.0.1)
 MIPSpro FORTRAN 77 Programming Guide
 MIPSpro FORTRAN 77 Language Reference Manual
 MIPSpro Power FORTRAN Programmer's Guide
 The OpenGL Programming Guide (Addison-Wesley Publishing Company)
 The OpenGL Reference Manual  (Addison-Wesley Publishing Company)
 The OpenGL Porting Guide
 Indigo Magic Desktop Integration Guide

MIPSpro compiler release notes that are supplied:

 c_dev 		pfa_dev		compiler_dev	gl_dev
 c++_dev 	pwrc_dev	dev		motif_dev
 ftn_dev	complib		IDO		x_dev


In conclusion, for a comprehensive understanding of the development issues in
the 64-bit environment, even after having read this article, it is essential
to read the MIPSpro 64-bit Porting and Transition Guide; this reading is
suggested  prior  to compilation of existing code with the 64-bit compilers.
And, as always, reading the release notes for all of the compiler products for
the details on the release's installation instructions, changes and additions,
bug fixes, and known problems and work-arounds, is highly recommended.