Benchmarking: HPL (optimised) with Intel MKL and Intel MPI
Jump to navigation
Jump to search
Getting the software
- Intel provide prebuilt binaries for use with Intel MPI and MKL.
- The latest version can be downloaded from: http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download
- Below assumes MKL and Intel MPI are installed (using Cluster Studio XE in this example)
Parameter optimisation
Main parameters you need to consider while running HPL:
- Problem size (N): Your problem size should be the largest to fit in the memory to get best performance. For e.g.: If you have 10 nodes with 1 GB RAM, total memory is 10GB. i.e. nearly 1342 M double precision elements. Square root of that number is 36635. You need to leave some memory for Operating System and other things. As a rule of thumb, 80% of the total memory will be a starting point for problem size (So, in this case, say, 33000). If the problem size is too large, it is swapped out, and the performance will degrade.
- Block Size (NB): HPL uses the block size NB for the data distribution as well as for the computational granularity. A very small NB will limit computational performance because no data reuse will occur, and also the number of messages will also increase. "Good" block sizes are almost always in the [32 .. 256] interval and it depends on Cache size. These block size are found to be good, 80-216 for IA32; 128-192 for IA64 3M cache; 400 for 4M cache for IA64 and 130 for Woodcrests.
- Process Grid Ratio (PXQ): This depends on physical interconnection network. P and Q should be approximately equal, with Q slightly larger than P. For e.g. for a 480 processor cluster, 20X24 will be a good ratio.
Tips: You can also try changing the node-order in the machine file for check the performance improvement. Choose all the above parameters by trial and error to get the best performance.
You can also use a simple PHP web tool to enter you system specs and it will suggest for you optimal input parameters for your HPL file before running the benchmark on the cluster. The tool can be accessed via the URL below under sourceforge:
http://hpl‐calculator.sourceforge.net
Running the benchmark on a Single Node
wget http://registrationcenter.intel.com/irc_nas/3669/l_lpk_p_11.1.1.004.tgz
tar zxvf l_lpk_p_11.1.1.004.tgz
cd linpack_11.1.1/benchmarks/linpack
./runme_xeon64This will run locally on a single node across a range of parameters. Expect output along the lines of the following:
[david@compute002 linpack]$ ./runme_xeon64
This is a SAMPLE run script for SMP LINPACK. Change it to reflect
the correct number of CPUs/threads, problem input files, etc..
Tue Jan 28 10:11:28 GMT 2014
Intel(R) Optimized LINPACK Benchmark data
Current date/time: Tue Jan 28 10:11:28 2014
CPU frequency: 2.699 GHz
Number of CPUs: 2
Number of cores: 16
Number of threads: 16
Parameters are set to:
Number of tests: 15
Number of equations to solve (problem size) : 1000 2000 5000 10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000
Leading dimension of array : 1000 2000 5008 10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000
Number of trials to run : 4 2 2 2 2 2 2 2 2 2 1 1 1 1 1
Data alignment value (in Kbytes) : 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1
Maximum memory requested that can be used=16200901024, at the size=45000
=================== Timing linear equation system solver ===================
Size LDA Align. Time(s) GFlops Residual Residual(norm) Check
1000 1000 4 0.020 33.8085 8.724688e-13 2.975343e-02 pass
1000 1000 4 0.006 117.8493 8.724688e-13 2.975343e-02 pass
1000 1000 4 0.006 120.3352 8.724688e-13 2.975343e-02 pass
1000 1000 4 0.006 119.2729 8.724688e-13 2.975343e-02 pass
2000 2000 4 0.031 170.4505 4.701128e-12 4.089406e-02 pass
2000 2000 4 0.031 172.9420 4.701128e-12 4.089406e-02 pass
5000 5008 4 0.365 228.3127 2.434170e-11 3.394253e-02 pass
5000 5008 4 0.362 230.6524 2.434170e-11 3.394253e-02 pass
10000 10000 4 2.670 249.7646 8.916344e-11 3.143993e-02 pass
10000 10000 4 2.666 250.1105 8.916344e-11 3.143993e-02 pass
15000 15000 4 8.508 264.5137 2.165846e-10 3.411244e-02 pass
15000 15000 4 8.571 262.5517 2.165846e-10 3.411244e-02 pass
18000 18008 4 14.205 273.7570 2.945255e-10 3.225417e-02 pass
18000 18008 4 14.387 270.2939 2.945255e-10 3.225417e-02 pass
20000 20016 4 19.636 271.6554 3.831049e-10 3.391318e-02 pass