Results:HPL GPU

From Define Wiki
Revision as of 23:01, 24 November 2015 by Adam (talk | contribs) (→‎Results for HPL on GPU)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Results for HPL on GPU

  • CentOS 7.1 - Kernel 3.10.0-229.20.1.el7.x86_64
  • CUDA-7.5
  • OpenMPI 1.8.5
  • Intel Compiler and MKL 2013_3.174
Some observations from runs with 128GB RAM:
  • A larger N value allows for a larger DGEMM_SPLIT value.
  • Performance suffers significantly when using large N values 100k+ with low core count's per GPU (<3).
  • In some instances oversubscribing cores per GPU will offer a performance boost.
  • The N values for the 128GB runs were the maximum obtainable for that particular GPU configuration
HPL Results (Nvidia Compiled Binary) from Nvidia Tesla K Series Cards on Single Dual Socket Systems
GPU Qty System CPU Freq Cores Memory (GB) N NB Result TFlop/s DGEMM Split
K80m 1 1028QG-TRT E5-2650v3 2.5GHz 10 64GB 51072 896 1.753 TFlop/s 0.85
K80m 2 1028QG-TRT E5-2650v3 2.5GHz 10 64GB 79744 896 3.663 TFlop/s 0.85
K80m 3 1028QG-TRT E5-2650v3 2.5GHz 10 64GB 79744 896 5.129 TFlop/s 0.85
K80m 4 1028QG-TRT E5-2650v3 2.5GHz 10 64GB 79744 896 6.154 TFlop/s 0.85
K80m 1 1028QG-TRT E5-2650v3 2.5GHz 10 128GB 51968 896 1.721 TFlop/s 0.90
K80m 2 1028QG-TRT E5-2650v3 2.5GHz 10 128GB 116480 896 3.945 TFlop/s 0.90
K80m 3 1028QG-TRT E5-2650v3 2.5GHz 10 128GB 91398 896 4.855 TFlop/s 0.85
K80m 4 1028QG-TRT E5-2650v3 2.5GHz 10 128GB 102144 896 6.557 TFlop/s 0.90
K80m 1 1028QG-TRT E5-2698v3 2.3GHz 16 128GB 896 TFlop/s
K80m 2 1028QG-TRT E5-2698v3 2.3GHz 16 128GB 896 TFlop/s
K80m 3 1028QG-TRT E5-2698v3 2.3GHz 16 128GB 896 TFlop/s
K80m 4 1028QG-TRT E5-2698v3 2.3GHz 16 128GB 896 TFlop/s