Benchmarking: HPL (optimised) with OpenMPI and Vanilla Centos
Install OpenMPI
Optionally rebuid OpenMPI ()
yum install openmpi openmpi-develSetup the user environment
# add to the end of your ~/.bashrc
export PATH=$PATH:/usr/lib64/openmpi/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/openmpi/libBuild ATLAS
NOTE" Ensure throttling is disabled as ATLAS will give out
/etc/init.d/cpuspeed stop
# may also have to apply this patch for ATLAS 3.10.1
patch -p0 CONFIG/src/probe_arch.c << EOF
@@ -238,8 +238,7 @@ int main(int nargs, char **args)
printf("CPU MHZ=%d\n",
ProbeOneInt(OS, asmd, targ, "-m", "CPU MHZ=", &sure));
if (flags & Pthrottle)
- printf("CPU THROTTLE=%d\n",
- ProbeOneInt(OS, asmd, targ, "-t", "CPU THROTTLE=", &sure));
+ printf("CPU THROTTLE=0\n");
if (flags & P64)
{
if (asmd == gas_x86_64)
EOFDownload From: http://sourceforge.net/projects/math-atlas/files/Stable/3.10.1/atlas3.10.1.tar.bz2/download
wget http://downloads.sourceforge.net/project/math-atlas/Stable/3.10.1/atlas3.10.1.tar.bz2
tar jxvf atlas3.10.1.tar.bz2
mkdir atlas3.10.1_build
cd atlas3.10.1_build/
/root/scratch/ATLAS/configure --prefix=/opt/atlas/3.10.1
make
make check
make installGet Linpack
Latest versions are available from : http://www.netlib.org/benchmark/hpl/
wget http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz
tar zxvf hpl-2.1.tar.gzSetup the Makefile
cd hpl-2.1
cp setup/Make.Linux_PII_CBLAS Make.Linux_JBEdit the Makefile
# Change the following
[root@localhost hpl-2.1]# diff setup/Make.Linux_PII_CBLAS Make.Linux_JB
64c64
< ARCH = Linux_PII_CBLAS
---
> ARCH = Linux_JB
70c70
< TOPdir = $(HOME)/hpl
---
> TOPdir = $(HOME)/scratch/hpl-2.1
84,86c84,86
< MPdir = /usr/local/mpi
< MPinc = -I$(MPdir)/include
< MPlib = $(MPdir)/lib/libmpich.a
---
> MPdir = /usr/lib64/openmpi
> MPinc = -I/usr/include/openmpi-x86_64
> MPlib = $(MPdir)/lib/libmpi.so
95c95
< LAdir = $(HOME)/netlib/ARCHIVES/Linux_PII
---
> LAdir = /usr/lib64/atlas-sse3/
176c176
< LINKER = /usr/bin/g77
---
> LINKER = mpif90Build
make arch=Linux_JBRun (check the run works ok with the default HPL.dat file - which is for 4 core)
cd bin/Linux_JB
mpirun -np 4 ./xhplEdit the HPL.dat to optimise
- Problem size (N): Your problem size should be the largest to fit in the memory to get best performance. Our sample system had a total of 16GB. There are 125M double precision elements per 1GB of memory so 16GB RAM results in 2 billion double precision elements. Square root of that number is 44721. You need to leave some memory for Operating System and other things. As a rule of thumb, 80% of the total memory will be a starting point for problem size - so, in this case, say, that is 35777. N / (P * Q) needs to be an integer, so 35328 is a reasonable number. N.B. If the problem size is too large, it is swapped out, and the performance will degrade.
- Block Size (NB): HPL uses the block size NB for the data distribution as well as for the computational granularity. A very small NB will limit computational performance because no data reuse will occur, and also the number of messages will also increase. "Good" block sizes are almost always in the [32 .. 256] interval and it depends on Cache size. These block size are found to be good, 80-216 for IA32; 128-192 for IA64 3M cache; 400 for 4M cache for IA64 and 130 for Woodcrests.
- Process Grid Ratio (PXQ): This depends on physical interconnection network. P and Q should be approximately equal, with Q slightly larger than P. P * Q should equal the number of available cores. Our sample system had a 6-core CPU, so P = 2 & Q = 3
These numbers can be auto generated by a number of online tools, for example: http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/
Example of HPL.dat file from initial testing:
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
35328 Ns
1 # of NBs
128 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
2 Ps
3 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)