---------------------------------------------------------------

2007-05-09.

---------------------------------------------------------------

Question:  what variation of execution speed and (possibly) of
numerical results is there between our widely used processors and 
between various compilers.

Limitations:  as is well known, speed of execution can vary 
greatly with the problem, but only one problem (!) is used here;
only two compiler families are used: GCC and Intel's compiler.
Linux 2.6.{17-20} is the only kernel on which the tests were
made.

The test-case: this is the `bacol' 1D Parabolic PDE solver, 
driven with a problem of the Burgers equation: the driver and 
the solver, complete with subroutines, are the unmodified files
from http://www.mscs.dal.ca/~keast/research/bacol.html , in 
Fortran77.  This is not a bad thing for me to choose, since I 
intend to run this many, many times (though not with Burger's 
equation) in the line of duty...  Some FEM and general linear
algebra would be no bad idea to test too.  Burger's equation
is quite a good one to use here: there is a sharp change from 
~1 to 0 in the solution, showing up small errors in the small 
quantities, and there is a known analytic solution.

Stimulus for comparisons:  the numerical reliability is `quite'
important to me, and it has also apparently been a problem for
users of `sandys' on the Xeon system `magsim'; speed is very 
important for me, if there is to be an outer loop that tries to
fit parameters to the already time-consuming problem of solving
a non-linear PDE for dozens of different input signals (in a 
stress-grading system), so an improvement in speed by switching
compiler or by changing options is useful to know of.


---------------------------------------------------------------------
Summary of Numerical and Speed Results

*** Warning: note the specificity of the test-case: I know well
    that one program can go quicker on CPU-A than on CPU-B, while
    another shows a doubling of speed on CPU-B. ***

With the sample tested (essentially recent Intel and AMD CPUs and
GNU and Intel compilers), the results of the tested program fall
into two groups:  Intel on Intel, or either compiler on AMD64, gives
the `expected' result.  GNU compilers on Intel CPUs gave different
results.  No tried change of options to the compilers changed either
of these claims.

The speed increase by using -O2 rather than -O0 is huge: as much as 
a factor of three.

The difference in speed between strict floating-point conformance
to the standard, and fast-math, was about 10% in favour of 'fast-math'
on the Intel CPUs, regardless of compiler, for reasonable other options
of -O2 or -O3.

When insisting on conformant floating-point, the speed of executables
from the two compilers is around a 30% increase in favour of Intel
(~50% longer time with GNU), on Intel CPUS.
When allowing fast-math, the difference is much less, and the speed
is higher still.
The difference between compilers was much smaller on the AMD64 CPUs 
-- hardly worth considering non-GNU, certainly not if involving 
any cost.

? What is it making two series of GNU Fortran compiler (g77,gfortran,
from respectively GCC 3.4.6, 4.1.1) give significant differences
in results between (recent) Intel and AMD CPUs?  Intel's compiler 
avoids this.  First step:  try a "vanilla" GCC with no Gentoo patches?
Didn't do that, but did install newly compiled, newer gentoo systems
on some of the computers:  this made all results the same! ! ! 


---------------------------------------------------------------
The Computers Used

h1:  `magsim' (Magnet Group simulation server)
	2 * Xeon 3GHz CPU with HyperThreading, 2GB RAM
	(32bit system, though CPUs claimed as EM64T capable)
	GNU C Library 2.5, 
		Compiled by GNU CC version 3.4.6,
		kernel headers linux 2.6.11
	Linux 2.6.17-gentoo-r4 #3 SMP PREEMPT 
	g77:  gcc version 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)
	gfortran:  gcc version 4.1.1 (Gentoo 4.1.1-r3)
	ifort:  Intel Fortran compiler version 9.1

h2:  `one' (Valery's desktop system)
	1 * Pentium4 3GHz CPU with Hyperthreading, 1GB RAM
	GNU C Library 2.5, 
		Compiled by GNU CC version 3.4.6,
		kernel headers 2.6.11
	g77: gcc version 3.3.6 (Gentoo 3.3.6, ssp-3.3.6-1.0, pie-8.7.8)
	g77: gcc version 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)
	gfortran: gcc version 4.1.1 (Gentoo 4.1.1-r3)

h3:  `cecill' (Cecilia's desktop system)
	1 * Pentium4 2.6GHz CPU -- HyperThreading DISABLED, 1GB RAM
	GNU C Library 2.5,
		Compiled by GNU CC version 3.4.6,
		kernel headers 2.6.11
	g77:  gcc version 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)
	gfortran:  gcc version 4.1.1 (Gentoo 4.1.1-r3)

h4:  `gnu'  (Nathaniel's desktop system)
	1 * AMD64-3700+ 2.2GHz CPU, 1GB RAM 
	GNU C Library 2.5,
		Compiled by GNU CC version 3.4.6,
		kernel headers linux 2.6.11
	g77:  gcc version 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)
	gfortran: gcc version 4.1.1 (Gentoo 4.1.1-r3)
	ifort:  Intel Fortran compiler version 9.1, EM64T

h5:  `magsim' (same host as h1, but with new system, using EM64T)
	GNU C Library 2.5,
		Compiled by GNU CC version 4.1.2 (Gentoo 4.1.2).
		kernel headrers linux 2.6.17 
	gfortran:  gcc version 4.1.2 (Gentoo 4.1.2)
	ifort:  Intel Fortran compiler version 9.1, EM64T

(  `shockley':  old HP-UX,  B.10.20 A 9000/715, PA-RISC )

---------------------------------------------------------------
`Expected' data for the Burgers test-case.

INPUT IS  
      KCOL =  2, NINT =  10, ATOL(1) =0.10E-03, RTOL(1) =0.10E-03
      EPS =0.10E-02   TOUT =0.10E+01
      IDID =    3
      THE OUTPUT IS  
      KCOL =  2, NINT =  15
               XOUT              UOUT             EXACTU
            0.000000E+00      0.100000E+01      0.100000E+01
		.....		......		.....


---------------------------------------------------------------------
Compiler Options for Consideration


GCC (gfortran)

-march=k8 (this works on all GCCs; amd64 is an alias on newer GCC)
[-march=nocona (for Xeon in 64-bit EM64T mode) -- not used in first tests]
-march=prescott (P4 with "pni", Xeon with EM64T support used in 32bit mode)
-march=pentium4 (P4 without "pni")

-O2 -fomit-frame-pointer   (my usual GCC optimisation for system) 

-mieee-fp -mieee-with-inexact 
-mfpmath=387|sse|387,sse
-ffast-math  (speed over IEEE754 conformance) 


Intel (ifort)

-axN -xN  (SSE2 processors, e.g. P4 before the `Prescott New Instructions')
-axP -xP  (SSE3 processors, e.g. P4 later models, Xeon)

-O2  (lots of optimisations, e.g. loop unroll, )
-O3  (still more -- more compilation time, no guarantee of better execution)

-mno-ieee-fp  (increase precision and rearrange for speed, reduce consistency)
-mieee-fp  (conform to IEEE FP std)
-fp-model-strict   (precise&except)
-fp-model-fast fast=2  (high speed at expense of `accuracy')


[-fpic ("needed for shared objects")]


---------------------------------------------------------------------
Variation of Numerical Result

h1: magsim: dual Xeon: GNU compilers have problems.
0.935918E-13  0.610623E-15  0.000000E+00  --> expected & all Intel compiler results
0.935762E-13  0.630518E-15  0.425549E-17  --> all GNU compiler and opts (gfortran,g77)
h2: same as h1
h3: GNU compilers same as on h1 (Intel not used)
h4: all compilers, GNU (2of) and Intel, give the same, expected, result 
h5: all compilers, GNU and Intel, give the same, expected, result
(`shockley':  the /opt/fortran/bin/f77 gave the expected result)

---------------------------------------------------------------------
Timings of Execution Speed

host : usertime : compilation command

h1: 0.960s        ifort -axP -xP -O0 -mieee-fp -fp-model-strict
h1: 0.912s        ifort -axN -xN -O0 -mieee-fp -fp-model-strict
h1: 0.900s        ifort  -O0 -mieee-fp -fp-model-strict
h1: 0.872s        g77 -march=prescott -O0 -ffast-math
h1: 0.868s        g77 -march=prescott -O0 -mieee-fp
h1: 0.868s        g77 -march=pentium4 -O0 -ffast-math
h1: 0.864s        g77 -march=pentium4 -O0
h1: 0.860s        g77 -march=prescott -O0
h1: 0.860s        g77 -march=pentium4 -O0 -mieee-fp
h1: 0.852s        gfortran-4.1.1 -march=pentium4 -O0 -mieee-fp
h1: 0.848s        gfortran-4.1.1 -march=prescott -O0
h1: 0.848s        gfortran-4.1.1 -march=pentium4 -O0 -ffast-math
h1: 0.844s        gfortran-4.1.1 -march=prescott -O0 -ffast-math
h1: 0.836s        gfortran-4.1.1 -march=pentium4 -O0
h1: 0.832s        gfortran-4.1.1 -march=prescott -O0 -mieee-fp
h1: 0.680s        g77 -march=prescott -O3 -mieee-fp
h1: 0.672s        g77 -march=prescott -O3
h1: 0.672s        g77 -march=pentium4 -O3 -mieee-fp
h1: 0.668s        g77 -march=pentium4 -O2
h1: 0.664s        g77 -march=pentium4 -O3
h1: 0.664s        g77 -march=pentium4 -O2 -mieee-fp
h1: 0.660s        g77 -march=prescott -O2 -mieee-fp
h1: 0.656s        g77 -march=prescott -O2
h1: 0.636s        gfortran-4.1.1 -march=prescott -O2 -mieee-fp
h1: 0.632s        gfortran-4.1.1 -march=pentium4 -O2
h1: 0.628s        gfortran-4.1.1 -march=prescott -O2
h1: 0.628s        gfortran-4.1.1 -march=pentium4 -O2 -mieee-fp
h1: 0.624s        gfortran-4.1.1 -march=prescott -O3 -mieee-fp
h1: 0.620s        gfortran-4.1.1 -march=prescott -O3
h1: 0.612s        gfortran-4.1.1 -march=pentium4 -O3 -mieee-fp
h1: 0.608s        gfortran-4.1.1 -march=pentium4 -O3
h1: 0.592s        g77 -march=prescott -O3 -ffast-math
h1: 0.588s        g77 -march=pentium4 -O2 -ffast-math
h1: 0.584s        g77 -march=pentium4 -O3 -ffast-math
h1: 0.580s        g77 -march=prescott -O2 -ffast-math
h1: 0.576s        gfortran-4.1.1 -march=prescott -O2 -ffast-math
h1: 0.560s        gfortran-4.1.1 -march=pentium4 -O3 -ffast-math
h1: 0.560s        gfortran-4.1.1 -march=pentium4 -O2 -ffast-math
h1: 0.552s        gfortran-4.1.1 -march=prescott -O3 -ffast-math
h1: 0.492s        ifort -axP -xP -O3 -mieee-fp -fp-model-strict
h1: 0.492s        ifort -axN -xN -O3 -mieee-fp -fp-model-strict
h1: 0.480s        ifort -axN -xN -O2 -mieee-fp -fp-model-strict
h1: 0.460s        ifort  -O3 -mieee-fp -fp-model-strict
h1: 0.452s        ifort -axP -xP -O2 -mieee-fp -fp-model-strict
h1: 0.448s        ifort  -O2 -mieee-fp -fp-model-strict

h2: 0.936s        ifort -axN -xN -O0 -mieee-fp -fp-model-strict
h2: 0.860s        ifort  -O0 -mno-ieee-fp
h2: 0.860s        ifort  -O0 -mieee-fp -fp-model-strict
h2: 0.860s        ifort -axN -xN -O0 -mno-ieee-fp
h2: 0.844s        g77 -march=pentium4 -O0 -ffast-math
h2: 0.840s        g77 -march=pentium4 -O0
h2: 0.832s        g77 -march=pentium4 -O0 -mieee-fp
h2: 0.828s        gfortran-4.1.1 -march=pentium4 -O0
h2: 0.820s        gfortran-4.1.1 -march=pentium4 -O0 -ffast-math
h2: 0.812s        gfortran-4.1.1 -march=pentium4 -O0 -mieee-fp
h2: 0.680s        g77 -march=pentium4 -O2 -mieee-fp
h2: 0.680s        g77 -march=pentium4 -O2
h2: 0.676s        g77 -march=pentium4 -O3
h2: 0.656s        g77 -march=pentium4 -O3 -mieee-fp
h2: 0.632s        gfortran-4.1.1 -march=pentium4 -O3
h2: 0.624s        gfortran-4.1.1 -march=pentium4 -O2
h2: 0.620s        gfortran-4.1.1 -march=pentium4 -O3 -mieee-fp
h2: 0.616s        gfortran-4.1.1 -march=pentium4 -O2 -mieee-fp
h2: 0.608s        g77 -march=pentium4 -O2 -ffast-math
h2: 0.604s        g77 -march=pentium4 -O3 -ffast-math
h2: 0.588s        gfortran-4.1.1 -march=pentium4 -O3 -ffast-math
h2: 0.568s        gfortran-4.1.1 -march=pentium4 -O2 -ffast-math
h2: 0.468s        ifort -axN -xN -O3 -mieee-fp -fp-model-strict
h2: 0.456s        ifort  -O3 -mieee-fp -fp-model-strict
h2: 0.448s        ifort  -O2 -mieee-fp -fp-model-strict
h2: 0.444s        ifort -axN -xN -O2 -mieee-fp -fp-model-strict
h2: 0.412s        ifort  -O3 -mno-ieee-fp
h2: 0.412s        ifort -axN -xN -O3 -mno-ieee-fp
h2: 0.404s        ifort  -O2 -mno-ieee-fp
h2: 0.372s        ifort -axN -xN -O2 -mno-ieee-fp

h3: 0.976s        g77 -march=pentium4 -O0
h3: 0.972s        g77 -march=pentium4 -O0 -mieee-fp
h3: 0.956s        gfortran-4.1.1 -march=pentium4 -O0 -mieee-fp
h3: 0.952s        gfortran-4.1.1 -march=pentium4 -O0
h3: 0.928s        gfortran-4.1.1 -march=pentium4 -O0 -ffast-math
h3: 0.924s        g77 -march=pentium4 -O0 -ffast-math
h3: 0.816s        g77 -march=pentium4 -O2
h3: 0.800s        gfortran-4.1.1 -march=pentium4 -O2
h3: 0.796s        g77 -march=pentium4 -O2 -mieee-fp
h3: 0.780s        g77 -march=pentium4 -O3 -mieee-fp
h3: 0.768s        g77 -march=pentium4 -O3
h3: 0.736s        gfortran-4.1.1 -march=pentium4 -O2 -mieee-fp
h3: 0.720s        g77 -march=pentium4 -O2 -ffast-math
h3: 0.708s        gfortran-4.1.1 -march=pentium4 -O3 -mieee-fp
h3: 0.708s        gfortran-4.1.1 -march=pentium4 -O3
h3: 0.704s        g77 -march=pentium4 -O3 -ffast-math
h3: 0.664s        gfortran-4.1.1 -march=pentium4 -O2 -ffast-math
h3: 0.652s        gfortran-4.1.1 -march=pentium4 -O3 -ffast-math

h4: 1.048s        gfortran-4.1.1 -march=k8 -O0
h4: 1.044s        gfortran-4.1.1 -march=k8 -O0 -ffast-math
h4: 1.035s        gfortran-4.1.1 -march=k8 -O0 -mieee-fp
h4: 0.961s        ifort  -O0 -mno-ieee-fp
h4: 0.961s        ifort -axP -xP -O0 -mno-ieee-fp
h4: 0.954s        ifort -axW -xW -O0 -mno-ieee-fp
h4: 0.943s        ifort -axW -xW -O0 -mieee-fp -fp-model-strict
h4: 0.937s        ifort  -O0 -mieee-fp -fp-model-strict
h4: 0.935s        ifort -axP -xP -O0 -mieee-fp -fp-model-strict
h4: 0.906s        g77 -march=k8 -O0 -ffast-math
h4: 0.905s        g77 -march=k8 -O0 -mieee-fp
h4: 0.897s        g77 -march=k8 -O0
h4: 0.371s        ifort -axW -xW -O3 -mieee-fp -fp-model-strict
h4: 0.355s        ifort -axW -xW -O2 -mieee-fp -fp-model-strict
h4: 0.334s        g77 -march=k8 -O2
h4: 0.330s        ifort  -O3 -mieee-fp -fp-model-strict
h4: 0.327s        ifort  -O2 -mieee-fp -fp-model-strict
h4: 0.326s        g77 -march=k8 -O3 -mieee-fp
h4: 0.325s        gfortran-4.1.1 -march=k8 -O3
h4: 0.325s        g77 -march=k8 -O2 -mieee-fp
h4: 0.318s        gfortran-4.1.1 -march=k8 -O3 -mieee-fp
h4: 0.317s        ifort -axW -xW -O3 -mno-ieee-fp
h4: 0.316s        g77 -march=k8 -O3
h4: 0.314s        gfortran-4.1.1 -march=k8 -O2
h4: 0.308s        gfortran-4.1.1 -march=k8 -O2 -mieee-fp
h4: 0.298s        gfortran-4.1.1 -march=k8 -O3 -ffast-math
h4: 0.295s        ifort -axW -xW -O2 -mno-ieee-fp
h4: 0.292s        ifort  -O3 -mno-ieee-fp
h4: 0.292s        g77 -march=k8 -O2 -ffast-math
h4: 0.288s        gfortran-4.1.1 -march=k8 -O2 -ffast-math
h4: 0.287s        ifort  -O2 -mno-ieee-fp
h4: 0.285s        g77 -march=k8 -O3 -ffast-math

h5: 1.056s        gfortran -O0  -mieee-fp
h5: 1.052s        gfortran -O0
h5: 0.892s        gfortran -O0 -march=nocona
h5: 0.884s        gfortran -O0 -march=nocona -mieee-fp
h5: 0.852s        ifort -O0 -xP -mp1
h5: 0.852s        ifort -O0 -xP -mieee-fp
h5: 0.852s        ifort -O0  -mp1
h5: 0.852s        ifort -O0  -mieee-fp
h5: 0.576s        gfortran -O3
h5: 0.552s        gfortran -O2  -mieee-fp
h5: 0.548s        gfortran -O3  -mieee-fp
h5: 0.544s        gfortran -O2
h5: 0.480s        ifort -O3 -xP -mieee-fp
h5: 0.460s        ifort -O2 -xP -mieee-fp
h5: 0.456s        ifort -O2  -mieee-fp
h5: 0.436s        ifort -O3  -mieee-fp
h5: 0.424s        gfortran -O2 -march=nocona
h5: 0.420s        gfortran -O3 -march=nocona -mieee-fp
h5: 0.416s        gfortran -O3 -march=nocona
h5: 0.416s        gfortran -O2 -march=nocona -mieee-fp
h5: 0.388s        ifort -O3 -xP -mp1
h5: 0.356s        ifort -O3  -mp1
h5: 0.356s        ifort -O2  -mp1
h5: 0.352s        ifort -O2 -xP -mp1

# grep '^h[1-4]: *0' filename \
   | sed -e 's/^\(h[0-4]\): *\(0[^ ]*\)\([\t ]*\)/\2:  \1  /' \
   | sort -rn 


---------------------------------------------------------------------

An aside (more relevant to those who don't use compiled languages).
Comparison of "nbench" (modified matlab benchmark, just non-graphical
tests) on matlab-7.4 (2007a) on the different hosts.

    LU    FFT   ODE   Sparse TOTAL
h1  0.25  0.29  0.37  0.06   0.97 (32bit)
h2  0.24  0.30  0.36  0.06   0.96 (32bit)
h3  0.27  0.30  0.37  0.07   1.01 (32bit)
h4  0.26  0.19  0.18  0.05   0.67 (32bit)
h5  0.20  0.29  0.33  0.06   0.86 (64bit)
h5  0.24  0.27  0.37  0.06   0.94 (32bit matlab, 64bit sys)


So, for matlab's FFT and ODE, at any rate (most useful to me of
the four things here) the AMD64 thrashes the others by some tens
of percent (nearly double the speed on ODE). Whether this is more
due to the pipelining and memory controller or to the extended
registers available would be interesting to see if magsim (h1) is 
indeed upgraded to a 64bit system and turns out to support that
(it was not realised on purchase that the Xeons are apparently new
enough to have EM64T).


---------------------------------------------------------------------