--------------------------------------------------------------- 2007-05-09. --------------------------------------------------------------- Question: what variation of execution speed and (possibly) of numerical results is there between our widely used processors and between various compilers. Limitations: as is well known, speed of execution can vary greatly with the problem, but only one problem (!) is used here; only two compiler families are used: GCC and Intel's compiler. Linux 2.6.{17-20} is the only kernel on which the tests were made. The test-case: this is the `bacol' 1D Parabolic PDE solver, driven with a problem of the Burgers equation: the driver and the solver, complete with subroutines, are the unmodified files from http://www.mscs.dal.ca/~keast/research/bacol.html , in Fortran77. This is not a bad thing for me to choose, since I intend to run this many, many times (though not with Burger's equation) in the line of duty... Some FEM and general linear algebra would be no bad idea to test too. Burger's equation is quite a good one to use here: there is a sharp change from ~1 to 0 in the solution, showing up small errors in the small quantities, and there is a known analytic solution. Stimulus for comparisons: the numerical reliability is `quite' important to me, and it has also apparently been a problem for users of `sandys' on the Xeon system `magsim'; speed is very important for me, if there is to be an outer loop that tries to fit parameters to the already time-consuming problem of solving a non-linear PDE for dozens of different input signals (in a stress-grading system), so an improvement in speed by switching compiler or by changing options is useful to know of. --------------------------------------------------------------------- Summary of Numerical and Speed Results *** Warning: note the specificity of the test-case: I know well that one program can go quicker on CPU-A than on CPU-B, while another shows a doubling of speed on CPU-B. *** With the sample tested (essentially recent Intel and AMD CPUs and GNU and Intel compilers), the results of the tested program fall into two groups: Intel on Intel, or either compiler on AMD64, gives the `expected' result. GNU compilers on Intel CPUs gave different results. No tried change of options to the compilers changed either of these claims. The speed increase by using -O2 rather than -O0 is huge: as much as a factor of three. The difference in speed between strict floating-point conformance to the standard, and fast-math, was about 10% in favour of 'fast-math' on the Intel CPUs, regardless of compiler, for reasonable other options of -O2 or -O3. When insisting on conformant floating-point, the speed of executables from the two compilers is around a 30% increase in favour of Intel (~50% longer time with GNU), on Intel CPUS. When allowing fast-math, the difference is much less, and the speed is higher still. The difference between compilers was much smaller on the AMD64 CPUs -- hardly worth considering non-GNU, certainly not if involving any cost. ? What is it making two series of GNU Fortran compiler (g77,gfortran, from respectively GCC 3.4.6, 4.1.1) give significant differences in results between (recent) Intel and AMD CPUs? Intel's compiler avoids this. First step: try a "vanilla" GCC with no Gentoo patches? Didn't do that, but did install newly compiled, newer gentoo systems on some of the computers: this made all results the same! ! ! --------------------------------------------------------------- The Computers Used h1: `magsim' (Magnet Group simulation server) 2 * Xeon 3GHz CPU with HyperThreading, 2GB RAM (32bit system, though CPUs claimed as EM64T capable) GNU C Library 2.5, Compiled by GNU CC version 3.4.6, kernel headers linux 2.6.11 Linux 2.6.17-gentoo-r4 #3 SMP PREEMPT g77: gcc version 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9) gfortran: gcc version 4.1.1 (Gentoo 4.1.1-r3) ifort: Intel Fortran compiler version 9.1 h2: `one' (Valery's desktop system) 1 * Pentium4 3GHz CPU with Hyperthreading, 1GB RAM GNU C Library 2.5, Compiled by GNU CC version 3.4.6, kernel headers 2.6.11 g77: gcc version 3.3.6 (Gentoo 3.3.6, ssp-3.3.6-1.0, pie-8.7.8) g77: gcc version 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9) gfortran: gcc version 4.1.1 (Gentoo 4.1.1-r3) h3: `cecill' (Cecilia's desktop system) 1 * Pentium4 2.6GHz CPU -- HyperThreading DISABLED, 1GB RAM GNU C Library 2.5, Compiled by GNU CC version 3.4.6, kernel headers 2.6.11 g77: gcc version 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9) gfortran: gcc version 4.1.1 (Gentoo 4.1.1-r3) h4: `gnu' (Nathaniel's desktop system) 1 * AMD64-3700+ 2.2GHz CPU, 1GB RAM GNU C Library 2.5, Compiled by GNU CC version 3.4.6, kernel headers linux 2.6.11 g77: gcc version 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9) gfortran: gcc version 4.1.1 (Gentoo 4.1.1-r3) ifort: Intel Fortran compiler version 9.1, EM64T h5: `magsim' (same host as h1, but with new system, using EM64T) GNU C Library 2.5, Compiled by GNU CC version 4.1.2 (Gentoo 4.1.2). kernel headrers linux 2.6.17 gfortran: gcc version 4.1.2 (Gentoo 4.1.2) ifort: Intel Fortran compiler version 9.1, EM64T ( `shockley': old HP-UX, B.10.20 A 9000/715, PA-RISC ) --------------------------------------------------------------- `Expected' data for the Burgers test-case. INPUT IS KCOL = 2, NINT = 10, ATOL(1) =0.10E-03, RTOL(1) =0.10E-03 EPS =0.10E-02 TOUT =0.10E+01 IDID = 3 THE OUTPUT IS KCOL = 2, NINT = 15 XOUT UOUT EXACTU 0.000000E+00 0.100000E+01 0.100000E+01 ..... ...... ..... --------------------------------------------------------------------- Compiler Options for Consideration GCC (gfortran) -march=k8 (this works on all GCCs; amd64 is an alias on newer GCC) [-march=nocona (for Xeon in 64-bit EM64T mode) -- not used in first tests] -march=prescott (P4 with "pni", Xeon with EM64T support used in 32bit mode) -march=pentium4 (P4 without "pni") -O2 -fomit-frame-pointer (my usual GCC optimisation for system) -mieee-fp -mieee-with-inexact -mfpmath=387|sse|387,sse -ffast-math (speed over IEEE754 conformance) Intel (ifort) -axN -xN (SSE2 processors, e.g. P4 before the `Prescott New Instructions') -axP -xP (SSE3 processors, e.g. P4 later models, Xeon) -O2 (lots of optimisations, e.g. loop unroll, ) -O3 (still more -- more compilation time, no guarantee of better execution) -mno-ieee-fp (increase precision and rearrange for speed, reduce consistency) -mieee-fp (conform to IEEE FP std) -fp-model-strict (precise&except) -fp-model-fast fast=2 (high speed at expense of `accuracy') [-fpic ("needed for shared objects")] --------------------------------------------------------------------- Variation of Numerical Result h1: magsim: dual Xeon: GNU compilers have problems. 0.935918E-13 0.610623E-15 0.000000E+00 --> expected & all Intel compiler results 0.935762E-13 0.630518E-15 0.425549E-17 --> all GNU compiler and opts (gfortran,g77) h2: same as h1 h3: GNU compilers same as on h1 (Intel not used) h4: all compilers, GNU (2of) and Intel, give the same, expected, result h5: all compilers, GNU and Intel, give the same, expected, result (`shockley': the /opt/fortran/bin/f77 gave the expected result) --------------------------------------------------------------------- Timings of Execution Speed host : usertime : compilation command h1: 0.960s ifort -axP -xP -O0 -mieee-fp -fp-model-strict h1: 0.912s ifort -axN -xN -O0 -mieee-fp -fp-model-strict h1: 0.900s ifort -O0 -mieee-fp -fp-model-strict h1: 0.872s g77 -march=prescott -O0 -ffast-math h1: 0.868s g77 -march=prescott -O0 -mieee-fp h1: 0.868s g77 -march=pentium4 -O0 -ffast-math h1: 0.864s g77 -march=pentium4 -O0 h1: 0.860s g77 -march=prescott -O0 h1: 0.860s g77 -march=pentium4 -O0 -mieee-fp h1: 0.852s gfortran-4.1.1 -march=pentium4 -O0 -mieee-fp h1: 0.848s gfortran-4.1.1 -march=prescott -O0 h1: 0.848s gfortran-4.1.1 -march=pentium4 -O0 -ffast-math h1: 0.844s gfortran-4.1.1 -march=prescott -O0 -ffast-math h1: 0.836s gfortran-4.1.1 -march=pentium4 -O0 h1: 0.832s gfortran-4.1.1 -march=prescott -O0 -mieee-fp h1: 0.680s g77 -march=prescott -O3 -mieee-fp h1: 0.672s g77 -march=prescott -O3 h1: 0.672s g77 -march=pentium4 -O3 -mieee-fp h1: 0.668s g77 -march=pentium4 -O2 h1: 0.664s g77 -march=pentium4 -O3 h1: 0.664s g77 -march=pentium4 -O2 -mieee-fp h1: 0.660s g77 -march=prescott -O2 -mieee-fp h1: 0.656s g77 -march=prescott -O2 h1: 0.636s gfortran-4.1.1 -march=prescott -O2 -mieee-fp h1: 0.632s gfortran-4.1.1 -march=pentium4 -O2 h1: 0.628s gfortran-4.1.1 -march=prescott -O2 h1: 0.628s gfortran-4.1.1 -march=pentium4 -O2 -mieee-fp h1: 0.624s gfortran-4.1.1 -march=prescott -O3 -mieee-fp h1: 0.620s gfortran-4.1.1 -march=prescott -O3 h1: 0.612s gfortran-4.1.1 -march=pentium4 -O3 -mieee-fp h1: 0.608s gfortran-4.1.1 -march=pentium4 -O3 h1: 0.592s g77 -march=prescott -O3 -ffast-math h1: 0.588s g77 -march=pentium4 -O2 -ffast-math h1: 0.584s g77 -march=pentium4 -O3 -ffast-math h1: 0.580s g77 -march=prescott -O2 -ffast-math h1: 0.576s gfortran-4.1.1 -march=prescott -O2 -ffast-math h1: 0.560s gfortran-4.1.1 -march=pentium4 -O3 -ffast-math h1: 0.560s gfortran-4.1.1 -march=pentium4 -O2 -ffast-math h1: 0.552s gfortran-4.1.1 -march=prescott -O3 -ffast-math h1: 0.492s ifort -axP -xP -O3 -mieee-fp -fp-model-strict h1: 0.492s ifort -axN -xN -O3 -mieee-fp -fp-model-strict h1: 0.480s ifort -axN -xN -O2 -mieee-fp -fp-model-strict h1: 0.460s ifort -O3 -mieee-fp -fp-model-strict h1: 0.452s ifort -axP -xP -O2 -mieee-fp -fp-model-strict h1: 0.448s ifort -O2 -mieee-fp -fp-model-strict h2: 0.936s ifort -axN -xN -O0 -mieee-fp -fp-model-strict h2: 0.860s ifort -O0 -mno-ieee-fp h2: 0.860s ifort -O0 -mieee-fp -fp-model-strict h2: 0.860s ifort -axN -xN -O0 -mno-ieee-fp h2: 0.844s g77 -march=pentium4 -O0 -ffast-math h2: 0.840s g77 -march=pentium4 -O0 h2: 0.832s g77 -march=pentium4 -O0 -mieee-fp h2: 0.828s gfortran-4.1.1 -march=pentium4 -O0 h2: 0.820s gfortran-4.1.1 -march=pentium4 -O0 -ffast-math h2: 0.812s gfortran-4.1.1 -march=pentium4 -O0 -mieee-fp h2: 0.680s g77 -march=pentium4 -O2 -mieee-fp h2: 0.680s g77 -march=pentium4 -O2 h2: 0.676s g77 -march=pentium4 -O3 h2: 0.656s g77 -march=pentium4 -O3 -mieee-fp h2: 0.632s gfortran-4.1.1 -march=pentium4 -O3 h2: 0.624s gfortran-4.1.1 -march=pentium4 -O2 h2: 0.620s gfortran-4.1.1 -march=pentium4 -O3 -mieee-fp h2: 0.616s gfortran-4.1.1 -march=pentium4 -O2 -mieee-fp h2: 0.608s g77 -march=pentium4 -O2 -ffast-math h2: 0.604s g77 -march=pentium4 -O3 -ffast-math h2: 0.588s gfortran-4.1.1 -march=pentium4 -O3 -ffast-math h2: 0.568s gfortran-4.1.1 -march=pentium4 -O2 -ffast-math h2: 0.468s ifort -axN -xN -O3 -mieee-fp -fp-model-strict h2: 0.456s ifort -O3 -mieee-fp -fp-model-strict h2: 0.448s ifort -O2 -mieee-fp -fp-model-strict h2: 0.444s ifort -axN -xN -O2 -mieee-fp -fp-model-strict h2: 0.412s ifort -O3 -mno-ieee-fp h2: 0.412s ifort -axN -xN -O3 -mno-ieee-fp h2: 0.404s ifort -O2 -mno-ieee-fp h2: 0.372s ifort -axN -xN -O2 -mno-ieee-fp h3: 0.976s g77 -march=pentium4 -O0 h3: 0.972s g77 -march=pentium4 -O0 -mieee-fp h3: 0.956s gfortran-4.1.1 -march=pentium4 -O0 -mieee-fp h3: 0.952s gfortran-4.1.1 -march=pentium4 -O0 h3: 0.928s gfortran-4.1.1 -march=pentium4 -O0 -ffast-math h3: 0.924s g77 -march=pentium4 -O0 -ffast-math h3: 0.816s g77 -march=pentium4 -O2 h3: 0.800s gfortran-4.1.1 -march=pentium4 -O2 h3: 0.796s g77 -march=pentium4 -O2 -mieee-fp h3: 0.780s g77 -march=pentium4 -O3 -mieee-fp h3: 0.768s g77 -march=pentium4 -O3 h3: 0.736s gfortran-4.1.1 -march=pentium4 -O2 -mieee-fp h3: 0.720s g77 -march=pentium4 -O2 -ffast-math h3: 0.708s gfortran-4.1.1 -march=pentium4 -O3 -mieee-fp h3: 0.708s gfortran-4.1.1 -march=pentium4 -O3 h3: 0.704s g77 -march=pentium4 -O3 -ffast-math h3: 0.664s gfortran-4.1.1 -march=pentium4 -O2 -ffast-math h3: 0.652s gfortran-4.1.1 -march=pentium4 -O3 -ffast-math h4: 1.048s gfortran-4.1.1 -march=k8 -O0 h4: 1.044s gfortran-4.1.1 -march=k8 -O0 -ffast-math h4: 1.035s gfortran-4.1.1 -march=k8 -O0 -mieee-fp h4: 0.961s ifort -O0 -mno-ieee-fp h4: 0.961s ifort -axP -xP -O0 -mno-ieee-fp h4: 0.954s ifort -axW -xW -O0 -mno-ieee-fp h4: 0.943s ifort -axW -xW -O0 -mieee-fp -fp-model-strict h4: 0.937s ifort -O0 -mieee-fp -fp-model-strict h4: 0.935s ifort -axP -xP -O0 -mieee-fp -fp-model-strict h4: 0.906s g77 -march=k8 -O0 -ffast-math h4: 0.905s g77 -march=k8 -O0 -mieee-fp h4: 0.897s g77 -march=k8 -O0 h4: 0.371s ifort -axW -xW -O3 -mieee-fp -fp-model-strict h4: 0.355s ifort -axW -xW -O2 -mieee-fp -fp-model-strict h4: 0.334s g77 -march=k8 -O2 h4: 0.330s ifort -O3 -mieee-fp -fp-model-strict h4: 0.327s ifort -O2 -mieee-fp -fp-model-strict h4: 0.326s g77 -march=k8 -O3 -mieee-fp h4: 0.325s gfortran-4.1.1 -march=k8 -O3 h4: 0.325s g77 -march=k8 -O2 -mieee-fp h4: 0.318s gfortran-4.1.1 -march=k8 -O3 -mieee-fp h4: 0.317s ifort -axW -xW -O3 -mno-ieee-fp h4: 0.316s g77 -march=k8 -O3 h4: 0.314s gfortran-4.1.1 -march=k8 -O2 h4: 0.308s gfortran-4.1.1 -march=k8 -O2 -mieee-fp h4: 0.298s gfortran-4.1.1 -march=k8 -O3 -ffast-math h4: 0.295s ifort -axW -xW -O2 -mno-ieee-fp h4: 0.292s ifort -O3 -mno-ieee-fp h4: 0.292s g77 -march=k8 -O2 -ffast-math h4: 0.288s gfortran-4.1.1 -march=k8 -O2 -ffast-math h4: 0.287s ifort -O2 -mno-ieee-fp h4: 0.285s g77 -march=k8 -O3 -ffast-math h5: 1.056s gfortran -O0 -mieee-fp h5: 1.052s gfortran -O0 h5: 0.892s gfortran -O0 -march=nocona h5: 0.884s gfortran -O0 -march=nocona -mieee-fp h5: 0.852s ifort -O0 -xP -mp1 h5: 0.852s ifort -O0 -xP -mieee-fp h5: 0.852s ifort -O0 -mp1 h5: 0.852s ifort -O0 -mieee-fp h5: 0.576s gfortran -O3 h5: 0.552s gfortran -O2 -mieee-fp h5: 0.548s gfortran -O3 -mieee-fp h5: 0.544s gfortran -O2 h5: 0.480s ifort -O3 -xP -mieee-fp h5: 0.460s ifort -O2 -xP -mieee-fp h5: 0.456s ifort -O2 -mieee-fp h5: 0.436s ifort -O3 -mieee-fp h5: 0.424s gfortran -O2 -march=nocona h5: 0.420s gfortran -O3 -march=nocona -mieee-fp h5: 0.416s gfortran -O3 -march=nocona h5: 0.416s gfortran -O2 -march=nocona -mieee-fp h5: 0.388s ifort -O3 -xP -mp1 h5: 0.356s ifort -O3 -mp1 h5: 0.356s ifort -O2 -mp1 h5: 0.352s ifort -O2 -xP -mp1 # grep '^h[1-4]: *0' filename \ | sed -e 's/^\(h[0-4]\): *\(0[^ ]*\)\([\t ]*\)/\2: \1 /' \ | sort -rn --------------------------------------------------------------------- An aside (more relevant to those who don't use compiled languages). Comparison of "nbench" (modified matlab benchmark, just non-graphical tests) on matlab-7.4 (2007a) on the different hosts. LU FFT ODE Sparse TOTAL h1 0.25 0.29 0.37 0.06 0.97 (32bit) h2 0.24 0.30 0.36 0.06 0.96 (32bit) h3 0.27 0.30 0.37 0.07 1.01 (32bit) h4 0.26 0.19 0.18 0.05 0.67 (32bit) h5 0.20 0.29 0.33 0.06 0.86 (64bit) h5 0.24 0.27 0.37 0.06 0.94 (32bit matlab, 64bit sys) So, for matlab's FFT and ODE, at any rate (most useful to me of the four things here) the AMD64 thrashes the others by some tens of percent (nearly double the speed on ODE). Whether this is more due to the pipelining and memory controller or to the extended registers available would be interesting to see if magsim (h1) is indeed upgraded to a 64bit system and turns out to support that (it was not realised on purchase that the Xeons are apparently new enough to have EM64T). ---------------------------------------------------------------------