Benchmarking: HPL (optimised) with Intel MKL and Intel MPI
Getting the software
- Intel provide prebuilt binaries for use with Intel MPI and MKL.
- The latest version can be downloaded from: http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download
- Below assumes MKL and Intel MPI are installed (using Cluster Studio XE in this example)
Parameter optimisation
Main parameters you need to consider while running HPL:
- Problem size (N): Your problem size should be the largest to fit in the memory to get best performance. For e.g.: If you have 10 nodes with 1 GB RAM, total memory is 10GB. i.e. nearly 1342 M double precision elements. Square root of that number is 36635. You need to leave some memory for Operating System and other things. As a rule of thumb, 80% of the total memory will be a starting point for problem size (So, in this case, say, 33000). If the problem size is too large, it is swapped out, and the performance will degrade.
- Block Size (NB): HPL uses the block size NB for the data distribution as well as for the computational granularity. A very small NB will limit computational performance because no data reuse will occur, and also the number of messages will also increase. "Good" block sizes are almost always in the [32 .. 256] interval and it depends on Cache size. These block size are found to be good, 80-216 for IA32; 128-192 for IA64 3M cache; 400 for 4M cache for IA64 and 130 for Woodcrests.
- Process Grid Ratio (PXQ): This depends on physical interconnection network. P and Q should be approximately equal, with Q slightly larger than P. For e.g. for a 480 processor cluster, 20X24 will be a good ratio.
Tips: You can also try changing the node-order in the machine file for check the performance improvement. Choose all the above parameters by trial and error to get the best performance.
You can also use a simple PHP web tool to enter you system specs and it will suggest for you optimal input parameters for your HPL file before running the benchmark on the cluster. The tool can be accessed via the URL below under sourceforge:
http://hpl‐calculator.sourceforge.net
Alternative bash script to calculate N
[david@head-boston ~]$ cat /home/david/bin/calc_hpl_N
#!/bin/bash
if [ $# -ne 2 ]
then
echo "";
echo "Usage: $0 [Number of nodes] [Memory per node (Gb)]" >&2;
echo "Example: $0 32 8";
exit 1
fi
NUM_NODES=$1;
MEM_PER_NODE=$2;
echo -e "---------------";
echo -e "[\E[32mNodes\E[39m]: ${NUM_NODES} ";
echo -e "[\E[32mMemory\E[39m]: ${MEM_PER_NODE}Gb";
N=`echo "sqrt ( ${NUM_NODES} * ${MEM_PER_NODE} * 0.8 * 100000000)" | bc`
echo -e "---------------";
echo -e "[\E[32mN\E[39m]: ${N}";
echo -e "---------------";Running the benchmark on a Single Node
wget http://registrationcenter.intel.com/irc_nas/3669/l_lpk_p_11.1.1.004.tgz
tar zxvf l_lpk_p_11.1.1.004.tgz
cd linpack_11.1.1/benchmarks/linpack
./runme_xeon64This will run locally on a single node across a range of parameters. Expect output along the lines of the following:
[david@compute002 linpack]$ ./runme_xeon64
This is a SAMPLE run script for SMP LINPACK. Change it to reflect
the correct number of CPUs/threads, problem input files, etc..
Tue Jan 28 10:11:28 GMT 2014
Intel(R) Optimized LINPACK Benchmark data
Current date/time: Tue Jan 28 10:11:28 2014
CPU frequency: 2.699 GHz
Number of CPUs: 2
Number of cores: 16
Number of threads: 16
Parameters are set to:
Number of tests: 15
Number of equations to solve (problem size) : 1000 2000 5000 10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000
Leading dimension of array : 1000 2000 5008 10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000
Number of trials to run : 4 2 2 2 2 2 2 2 2 2 1 1 1 1 1
Data alignment value (in Kbytes) : 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1
Maximum memory requested that can be used=16200901024, at the size=45000
=================== Timing linear equation system solver ===================
Size LDA Align. Time(s) GFlops Residual Residual(norm) Check
1000 1000 4 0.020 33.8085 8.724688e-13 2.975343e-02 pass
1000 1000 4 0.006 117.8493 8.724688e-13 2.975343e-02 pass
1000 1000 4 0.006 120.3352 8.724688e-13 2.975343e-02 pass
1000 1000 4 0.006 119.2729 8.724688e-13 2.975343e-02 pass
2000 2000 4 0.031 170.4505 4.701128e-12 4.089406e-02 pass
2000 2000 4 0.031 172.9420 4.701128e-12 4.089406e-02 pass
5000 5008 4 0.365 228.3127 2.434170e-11 3.394253e-02 pass
5000 5008 4 0.362 230.6524 2.434170e-11 3.394253e-02 pass
10000 10000 4 2.670 249.7646 8.916344e-11 3.143993e-02 pass
10000 10000 4 2.666 250.1105 8.916344e-11 3.143993e-02 pass
15000 15000 4 8.508 264.5137 2.165846e-10 3.411244e-02 pass
15000 15000 4 8.571 262.5517 2.165846e-10 3.411244e-02 pass
18000 18008 4 14.205 273.7570 2.945255e-10 3.225417e-02 pass
18000 18008 4 14.387 270.2939 2.945255e-10 3.225417e-02 pass
20000 20016 4 19.636 271.6554 3.831049e-10 3.391318e-02 passRunning the benchmark across a Cluster
Use the mp_linpack benchmark for the cluster tests
# As above to download / unpack, but we'll use the mp_ directory
cd linpack_11.1.1/benchmarks/mp_linpack/bin_intel/intel64By default, the file to edit for the HPL variable is HPL_serial.dat
For Intel MPI we'll need bring up the mpd daemons.
# create a hosts file,
mpdboot -n 16 -f ./hosts -r ssh -1
mpdboot -n 16 -f ./hostsVerify its working ok
[mpiuser@compute000 intel64]$ mpdtrace
compute000
compute004
compute011
..Edit the runme_intel and modify the mpiexec line (include -PSM for the Qlogic adaptors)
#mpiexec -np 4 ./xhpl_intel64 | tee -a xhpl_intel64_outputs.txt
mpirun -PSM -np 256 -perhost 1 -hostfile ./hosts ./xhpl_intel64 | tee -a boston_xhpl_intel64.logOutput:
[mpiuser@compute000 intel64]$ ./runme_intel64
This is a SAMPLE run script. Change it to reflect the correct number
of CPUs/threads, number of nodes, MPI processes per node, etc..
Tue Jan 28 16:32:34 GMT 2014
This run was done on: Tue Jan 28 16:32:34 GMT 2014
================================================================================
HPLinpack 2.1 -- High-Performance Linpack benchmark -- October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 403200
NB : 224
PMAP : Row-major process mapping
P : 16
Q : 16
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 0
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
Column=002240 Fraction=0.005 Mflops=5131712.36
Column=004256 Fraction=0.010 Mflops=5202156.44
Column=006272 Fraction=0.015 Mflops=5210892.07
Column=008288 Fraction=0.020 Mflops=5225158.81
Column=010304 Fraction=0.025 Mflops=5231852.98
Column=012320 Fraction=0.030 Mflops=5233990.21
Column=014336 Fraction=0.035 Mflops=5239867.83
Column=016352 Fraction=0.040 Mflops=5240982.58
Column=018368 Fraction=0.045 Mflops=5244348.49
Column=020384 Fraction=0.050 Mflops=5244236.32