Benchmarking: HPL (optimised) with Intel MKL and Intel MPI

Getting the software

Intel provide prebuilt binaries for use with Intel MPI and MKL.
The latest version can be downloaded from: http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download
Below assumes MKL and Intel MPI are installed (using Cluster Studio XE in this example)

Parameter optimisation

Main parameters you need to consider while running HPL:

Problem size (N): Your problem size should be the largest to fit in the memory to get best performance. For e.g.: If you have 10 nodes with 1 GB RAM, total memory is 10GB. i.e. nearly 1342 M double precision elements. Square root of that number is 36635. You need to leave some memory for Operating System and other things. As a rule of thumb, 80% of the total memory will be a starting point for problem size (So, in this case, say, 33000). If the problem size is too large, it is swapped out, and the performance will degrade.

Block Size (NB): HPL uses the block size NB for the data distribution as well as for the computational granularity. A very small NB will limit computational performance because no data reuse will occur, and also the number of messages will also increase. "Good" block sizes are almost always in the [32 .. 256] interval and it depends on Cache size. These block size are found to be good, 80-216 for IA32; 128-192 for IA64 3M cache; 400 for 4M cache for IA64 and 130 for Woodcrests.

Process Grid Ratio (PXQ): This depends on physical interconnection network. P and Q should be approximately equal, with Q slightly larger than P. For e.g. for a 480 processor cluster, 20X24 will be a good ratio.

Tips: You can also try changing the node-order in the machine file for check the performance improvement. Choose all the above parameters by trial and error to get the best performance.

You can also use a simple PHP web tool to enter you system specs and it will suggest for you optimal input parameters for your HPL file before running the benchmark on the cluster. The tool can be accessed via the URL below under sourceforge:

http://hpl‐calculator.sourceforge.net

Alternative bash script to calculate N

[david@head-boston ~]$ cat /home/david/bin/calc_hpl_N
#!/bin/bash
 
if [ $# -ne 2 ]
then
        echo "";
        echo "Usage: $0 [Number of nodes] [Memory per node (Gb)]" >&2;
        echo "Example: $0 32 8";
        exit 1
fi
 
NUM_NODES=$1;
MEM_PER_NODE=$2;
 
echo -e "---------------";
echo -e "[\E[32mNodes\E[39m]: ${NUM_NODES} ";
echo -e "[\E[32mMemory\E[39m]: ${MEM_PER_NODE}Gb";
 
N=`echo "sqrt ( ${NUM_NODES} * ${MEM_PER_NODE} * 0.8 * 100000000)" | bc`
 
echo -e "---------------";
echo -e "[\E[32mN\E[39m]: ${N}";
echo -e "---------------";

Running the benchmark on a Single Node

  wget http://registrationcenter.intel.com/irc_nas/3669/l_lpk_p_11.1.1.004.tgz
  tar zxvf l_lpk_p_11.1.1.004.tgz 
  cd linpack_11.1.1/benchmarks/linpack
  ./runme_xeon64

This will run locally on a single node across a range of parameters. Expect output along the lines of the following:

[david@compute002 linpack]$ ./runme_xeon64
This is a SAMPLE run script for SMP LINPACK. Change it to reflect
the correct number of CPUs/threads, problem input files, etc..
Tue Jan 28 10:11:28 GMT 2014
Intel(R) Optimized LINPACK Benchmark data

Current date/time: Tue Jan 28 10:11:28 2014

CPU frequency:    2.699 GHz
Number of CPUs: 2
Number of cores: 16
Number of threads: 16

Parameters are set to:

Number of tests: 15
Number of equations to solve (problem size) : 1000  2000  5000  10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000
Leading dimension of array                  : 1000  2000  5008  10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000
Number of trials to run                     : 4     2     2     2     2     2     2     2     2     2     1     1     1     1     1    
Data alignment value (in Kbytes)            : 4     4     4     4     4     4     4     4     4     4     4     1     1     1     1    

Maximum memory requested that can be used=16200901024, at the size=45000

=================== Timing linear equation system solver ===================

Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm) Check
1000   1000   4      0.020      33.8085  8.724688e-13 2.975343e-02   pass
1000   1000   4      0.006      117.8493 8.724688e-13 2.975343e-02   pass
1000   1000   4      0.006      120.3352 8.724688e-13 2.975343e-02   pass
1000   1000   4      0.006      119.2729 8.724688e-13 2.975343e-02   pass
2000   2000   4      0.031      170.4505 4.701128e-12 4.089406e-02   pass
2000   2000   4      0.031      172.9420 4.701128e-12 4.089406e-02   pass
5000   5008   4      0.365      228.3127 2.434170e-11 3.394253e-02   pass
5000   5008   4      0.362      230.6524 2.434170e-11 3.394253e-02   pass
10000  10000  4      2.670      249.7646 8.916344e-11 3.143993e-02   pass
10000  10000  4      2.666      250.1105 8.916344e-11 3.143993e-02   pass
15000  15000  4      8.508      264.5137 2.165846e-10 3.411244e-02   pass
15000  15000  4      8.571      262.5517 2.165846e-10 3.411244e-02   pass
18000  18008  4      14.205     273.7570 2.945255e-10 3.225417e-02   pass
18000  18008  4      14.387     270.2939 2.945255e-10 3.225417e-02   pass
20000  20016  4      19.636     271.6554 3.831049e-10 3.391318e-02   pass

Running the benchmark across a Cluster

Use the mp_linpack benchmark for the cluster tests

  # As above to download / unpack, but we'll use the mp_ directory
  cd linpack_11.1.1/benchmarks/mp_linpack/bin_intel/intel64

By default, the file to edit for the HPL variable is HPL_serial.dat

  vi HPL_serial.dat

For Intel MPI we'll need bring up the mpd daemons.

 
  # create a hosts file, 
  mpdboot -n 16 -f ./hosts -r ssh -1
  mpdboot -n 16 -f ./hosts

Verify its working ok

  [mpiuser@compute000 intel64]$ mpdtrace
  compute000
  compute004
  compute011
  ..

Edit the runme_intel and modify the mpiexec line (include -PSM for the Qlogic adaptors)

#mpiexec -np 4 ./xhpl_intel64 | tee -a xhpl_intel64_outputs.txt

mpirun -PSM -np 256 -perhost 1 -hostfile ./hosts ./xhpl_intel64 | tee -a boston_xhpl_intel64.log
# Or for Mellanox
mpirun -env I_MPI_DEVICE rdma

Output:

[mpiuser@compute000 intel64]$ ./runme_intel64
This is a SAMPLE run script.  Change it to reflect the correct number
of CPUs/threads, number of nodes, MPI processes per node, etc..
Tue Jan 28 16:32:34 GMT 2014
This run was done on: Tue Jan 28 16:32:34 GMT 2014
================================================================================
HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N        :  403200 
NB       :     224 
PMAP     : Row-major process mapping
P        :      16 
Q        :      16 
PFACT    :   Right 
NBMIN    :       4 
NDIV     :       2 
RFACT    :   Crout 
BCAST    :  1ringM 
DEPTH    :       0 
SWAP     : Mix (threshold = 64)
L1       : transposed form
U        : transposed form
EQUIL    : yes
ALIGN    :    8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

Column=002240 Fraction=0.005 Mflops=5131712.36
Column=004256 Fraction=0.010 Mflops=5202156.44
Column=006272 Fraction=0.015 Mflops=5210892.07
Column=008288 Fraction=0.020 Mflops=5225158.81
Column=010304 Fraction=0.025 Mflops=5231852.98
Column=012320 Fraction=0.030 Mflops=5233990.21
Column=014336 Fraction=0.035 Mflops=5239867.83
Column=016352 Fraction=0.040 Mflops=5240982.58
Column=018368 Fraction=0.045 Mflops=5244348.49
Column=020384 Fraction=0.050 Mflops=5244236.32

Benchmarking: HPL (optimised) with Intel MKL and Intel MPI

Contents

Getting the software

Parameter optimisation

Running the benchmark on a Single Node

Running the benchmark across a Cluster

Navigation menu

Search