Benchmarking: HPL (optimised) on both Intel CPUs and Xeon Phis

Getting the software

Intel provide prebuilt, offload, binaries for use with Xeon Phis, Intel MPI and MKL.
The latest version can be downloaded from: https://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download
Below assumes MPSS, MKL and Intel MPI are installed (using Cluster Studio XE in this example)

Binaries and Files to use and modify

To run HPL on both CPUs and Xeon PHIs the offload binaries must be used. These binaries, along with their corresponding files, can be found:

[root@boston ~]# cd linpack_11.2.0/benchmarks/mp_linpack/bin_intel/intel64
[root@boston ~]# ls
HPL_offload.dat
runme_offload_intel64
runme_offload_intel64_dynamic
runme_offload_intel64_prv
xhpl_offload_intel64
xhpl_offload_intel64_dynamic

The *_dynamic binaries use dynamic library linking, while the other ones use static linking.

Parameter optimisation

The parameters you need to consider while running HPL on offload mode are explained in the following list, grouped by files:

.dat file:
- Problem size N. Your problem size should be the largest to fit in the memory to get best performance. For e.g.: If you have 1 node with 32 GB RAM, that's 4096 M double precision elements. Square root of that number is 65536. You need to leave some memory for Operating System and other things. As a rule of thumb, 80% of the total memory will be a starting point for problem size (So, in this case, say, 52428). If the problem size is too large, it is swapped out, and the performance will degrade. To achieve better performance choose N divisible by NB*LCM(P,Q), where LCM is the least common multiple of the two numbers.
- Block Size NB. HPL uses the block size NB for the data distribution as well as for the computational granularity. A very small NB will limit computational performance because no data reuse will occur, and also the number of messages will also increase. The best value of NB depends on the number of Phis:
  - For 1 Phi, 960 would be the best value
  - For 2 Phis, 1024 would be the best value.
These values are not strict. Experimentation could be needed to achieve best performance. Best try both values for both configurations and see the results.
- Process Grid Ratio PxQ. P and Q - the number of rows and columns in the process grid, respectively. P*Q must be the number of MPI processes that HPL is using. For the hybrid offload version of the Intel Optimized MP LINPACK Benchmark, keep P and Q roughly the same size.
runme_offload_intel64[_dynamic] files:
- MPI_PROC_NUM. The total number of MPI processes (across all nodes).
- MPI_PER_NODE. The number of MPI processes per each cluster node. To get best performance of HPL, enable non-uniform memory access (NUMA) and set MPI_PER_NODE equal to the number of NUMA sockets.
- NUMMIC. The number of Intel Xeon Phi coprocessors per each cluster node.
runme_offload_intel64_prv file:
This file automatically sets the following variables for each node:
- HPL_HOST_NODE. The MPI rank
- HPL_MIC_DEVICE. The id of each Phi for a particular MPI rank
- HPL_MIC_SHAREMODE. This variables gets set when there are more MPI processes than Phi cards, in a node, so the Phi card gets shared between the MPI ranks.
The following variables can, also, be set, by the user, in this file:
- MKL_MIC_ENABLE. Needs to be set to 1, if the MKL_MIC_* set of variables needs to be defined.
- MKL_MIC_WORKDIVISION. A floating point ranging from 0.0 to 1.0. Specifies the fraction of work to do on all the Intel Xeon Phis.
- HPL_MIC_EXQUEUES. An integer ranging from 0 to 512. The default is 128. Specifies the queue size on an Intel Xeon Phi coprocessor. Using a larger number is typically better while it increases the memory consumption for the Intel Xeon Phi coprocessor. If out of memory errors are encountered, try a lower number.
More environment variables can be found in https://software.intel.com/en-us/node/484975 and https://software.intel.com/en-us/node/433580.

If different configuration is needed for some nodes, then an if statement, checking the hostname, can be issued, and the required environment variables can be set inside it.

Tips: You can also try changing the node-order in the machine file for check the performance improvement. Choose all the above parameters by trial and error to get the best performance.

Running the Benchmark

Before running HPL, the correct Environment variables must be set:

# source <parent product directory>\bin\compilervars.sh intel64
# source <mpi directory>\bin64\mpivars.sh intel64
# source <mkl directory>\bin\mklvars.sh intel64

Alternatively, load the corresponding modules.

One Node

Modify the values of the parameters discussed above to correspond to 1-node configuration
Modify the mpirun command, located in the runme_offload_intel64[_dynamic] files, so that it doesn't use any hostfile (or have only the running node in the hostfile).
Execute: ./runme_offload_intel64[_dynamic]

Multiple Nodes

Compose a hostfile, which contains the hostnames of the desired nodes
Start the mpd daemons:
```
 mpdboot -n 16 -f ./hosts
```
Check if they are working:
```
 mpdtrace
```
Modify the values of the parameters discussed above to correspond to n-node configuration. Basically you need to change:
- N (problem size)
- P and Q
- MPI_PROC_NUM variable
- Add the -hostfile and -perhost arguments in the mpirun command in the runme_offload_intel64[_dynamic] files
Run ./runme_offload_intel64[_dynamic]

NOTE:

When HPL is run, it prints out the number of Xeon Phis:

Number of Intel Xeon Phi coprocessors: 1

This number counts only one Intel Xeon Phi coprocessor per MPI process.