Benchmarking: HPL (optimised) with Intel MKL and Intel MPI

From Define Wiki
Revision as of 10:37, 28 January 2014 by David (talk | contribs)
Jump to navigation Jump to search

Getting the software

Parameter optimisation

Main parameters you need to consider while running HPL:

  • Problem size (N): Your problem size should be the largest to fit in the memory to get best performance. For e.g.: If you have 10 nodes with 1 GB RAM, total memory is 10GB. i.e. nearly 1342 M double precision elements. Square root of that number is 36635. You need to leave some memory for Operating System and other things. As a rule of thumb, 80% of the total memory will be a starting point for problem size (So, in this case, say, 33000). If the problem size is too large, it is swapped out, and the performance will degrade.
  • Block Size (NB): HPL uses the block size NB for the data distribution as well as for the computational granularity. A very small NB will limit computational performance because no data reuse will occur, and also the number of messages will also increase. "Good" block sizes are almost always in the [32 .. 256] interval and it depends on Cache size. These block size are found to be good, 80-216 for IA32; 128-192 for IA64 3M cache; 400 for 4M cache for IA64 and 130 for Woodcrests.
  • Process Grid Ratio (PXQ): This depends on physical interconnection network. P and Q should be approximately equal, with Q slightly larger than P. For e.g. for a 480 processor cluster, 20X24 will be a good ratio.

Tips: You can also try changing the node-order in the machine file for check the performance improvement. Choose all the above parameters by trial and error to get the best performance.

You can also use a simple PHP web tool to enter you system specs and it will suggest for you optimal input parameters for your HPL file before running the benchmark on the cluster. The tool can be accessed via the URL below under sourceforge:

http://hpl‐calculator.sourceforge.net

Running the benchmark on a Single Node

  wget http://registrationcenter.intel.com/irc_nas/3669/l_lpk_p_11.1.1.004.tgz
  tar zxvf l_lpk_p_11.1.1.004.tgz 
  cd linpack_11.1.1/benchmarks/linpack
  ./runme_xeon64

This will run locally on a single node across a range of parameters. Expect output along the lines of the following:

[david@compute002 linpack]$ ./runme_xeon64
This is a SAMPLE run script for SMP LINPACK. Change it to reflect
the correct number of CPUs/threads, problem input files, etc..
Tue Jan 28 10:11:28 GMT 2014
Intel(R) Optimized LINPACK Benchmark data

Current date/time: Tue Jan 28 10:11:28 2014

CPU frequency:    2.699 GHz
Number of CPUs: 2
Number of cores: 16
Number of threads: 16

Parameters are set to:

Number of tests: 15
Number of equations to solve (problem size) : 1000  2000  5000  10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000
Leading dimension of array                  : 1000  2000  5008  10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000
Number of trials to run                     : 4     2     2     2     2     2     2     2     2     2     1     1     1     1     1    
Data alignment value (in Kbytes)            : 4     4     4     4     4     4     4     4     4     4     4     1     1     1     1    

Maximum memory requested that can be used=16200901024, at the size=45000

=================== Timing linear equation system solver ===================

Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm) Check
1000   1000   4      0.020      33.8085  8.724688e-13 2.975343e-02   pass
1000   1000   4      0.006      117.8493 8.724688e-13 2.975343e-02   pass
1000   1000   4      0.006      120.3352 8.724688e-13 2.975343e-02   pass
1000   1000   4      0.006      119.2729 8.724688e-13 2.975343e-02   pass
2000   2000   4      0.031      170.4505 4.701128e-12 4.089406e-02   pass
2000   2000   4      0.031      172.9420 4.701128e-12 4.089406e-02   pass
5000   5008   4      0.365      228.3127 2.434170e-11 3.394253e-02   pass
5000   5008   4      0.362      230.6524 2.434170e-11 3.394253e-02   pass
10000  10000  4      2.670      249.7646 8.916344e-11 3.143993e-02   pass
10000  10000  4      2.666      250.1105 8.916344e-11 3.143993e-02   pass
15000  15000  4      8.508      264.5137 2.165846e-10 3.411244e-02   pass
15000  15000  4      8.571      262.5517 2.165846e-10 3.411244e-02   pass
18000  18008  4      14.205     273.7570 2.945255e-10 3.225417e-02   pass
18000  18008  4      14.387     270.2939 2.945255e-10 3.225417e-02   pass
20000  20016  4      19.636     271.6554 3.831049e-10 3.391318e-02   pass