Benchmarking: HPL (optimised) with OpenMPI and Vanilla Centos

From Define Wiki
Jump to navigation Jump to search

Install OpenMPI

Optionally rebuid OpenMPI ()

  yum install openmpi openmpi-devel

Setup the user environment

# add to the end of your ~/.bashrc
export PATH=$PATH:/usr/lib64/openmpi/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/openmpi/lib

Build ATLAS

NOTE" Ensure throttling is disabled as ATLAS will give out

/etc/init.d/cpuspeed stop 

# may also have to apply this patch for ATLAS 3.10.1
patch -p0 CONFIG/src/probe_arch.c << EOF
@@ -238,8 +238,7 @@ int main(int nargs, char **args)
       printf("CPU MHZ=%d\n",
              ProbeOneInt(OS, asmd, targ, "-m", "CPU MHZ=", &sure));
    if (flags & Pthrottle)
-      printf("CPU THROTTLE=%d\n",
-             ProbeOneInt(OS, asmd, targ, "-t", "CPU THROTTLE=", &sure));
+      printf("CPU THROTTLE=0\n");
    if (flags & P64)
    {
       if (asmd == gas_x86_64)
EOF

Download From: http://sourceforge.net/projects/math-atlas/files/Stable/3.10.1/atlas3.10.1.tar.bz2/download (or more modern: http://downloads.sourceforge.net/project/math-atlas/Stable/3.10.3/atlas3.10.3.tar.bz2)

  wget http://downloads.sourceforge.net/project/math-atlas/Stable/3.10.1/atlas3.10.1.tar.bz2
  tar jxvf atlas3.10.1.tar.bz2 
  mkdir atlas3.10.1_build
  cd atlas3.10.1_build/
  /root/scratch/ATLAS/configure --prefix=/opt/atlas/3.10.1
  make 
  make check 
  make install

Get Linpack

Latest versions are available from : http://www.netlib.org/benchmark/hpl/

  wget http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz
  tar zxvf hpl-2.1.tar.gz

Setup the Makefile

  cd hpl-2.1
  cp setup/Make.Linux_PII_CBLAS Make.Linux_JB

Edit the Makefile

# Change the following
[root@localhost hpl-2.1]# diff setup/Make.Linux_PII_CBLAS Make.Linux_JB
64c64
< ARCH         = Linux_PII_CBLAS
---
> ARCH         = Linux_JB
70c70
< TOPdir       = $(HOME)/hpl
---
> TOPdir       = $(HOME)/scratch/hpl-2.1
84,86c84,86
< MPdir        = /usr/local/mpi
< MPinc        = -I$(MPdir)/include
< MPlib        = $(MPdir)/lib/libmpich.a
---
> MPdir        = /usr/lib64/openmpi
> MPinc        = -I/usr/include/openmpi-x86_64
> MPlib        = $(MPdir)/lib/libmpi.so
95c95
< LAdir        = $(HOME)/netlib/ARCHIVES/Linux_PII
---
> LAdir        = /opt/atlas/3.10.1/lib
176c176
< LINKER       = /usr/bin/g77
---
> LINKER       = mpif90

Build

  make arch=Linux_JB

Run (check the run works ok with the default HPL.dat file - which is for 4 core)

  cd bin/Linux_JB
  mpirun -np 4 ./xhpl

Edit the HPL.dat to optimise

  • Problem size (N): Your problem size should be the largest to fit in the memory to get best performance. Our sample system had a total of 16GB. There are 125M double precision elements per 1GB of memory so 16GB RAM results in 2 billion double precision elements. Square root of that number is 44721. You need to leave some memory for Operating System and other things. As a rule of thumb, 80% of the total memory will be a starting point for problem size - so, in this case, say, that is 35777. N / (P * Q) needs to be an integer, so 35328 is a reasonable number. N.B. If the problem size is too large, it is swapped out, and the performance will degrade.
  • Block Size (NB): HPL uses the block size NB for the data distribution as well as for the computational granularity. A very small NB will limit computational performance because no data reuse will occur, and also the number of messages will also increase. "Good" block sizes are almost always in the [32 .. 256] interval and it depends on Cache size. These block size are found to be good, 80-216 for IA32; 128-192 for IA64 3M cache; 400 for 4M cache for IA64 and 130 for Woodcrests.
  • Process Grid Ratio (PXQ): This depends on physical interconnection network. P and Q should be approximately equal, with Q slightly larger than P. P * Q should equal the number of available cores. Our sample system had a 6-core CPU, so P = 2 & Q = 3

These numbers can be auto generated by a number of online tools, for example: http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/

Example of HPL.dat file from initial testing:

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
35328         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2            Ps
3            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)