Difference between revisions of "Results:HPL GPU"

From Define Wiki
Jump to navigation Jump to search
 
(6 intermediate revisions by the same user not shown)
Line 8: Line 8:
 
{| class="wikitable" style="text-align:center; width:100%; "
 
{| class="wikitable" style="text-align:center; width:100%; "
 
|-
 
|-
! colspan="11" | FERMI HPL Results from Nvidia Tesla K Series Cards on Single Dual Socket Systems
+
! colspan="11" | HPL Results (Nvidia Compiled Binary) from Nvidia Tesla K Series Cards on Single Dual Socket Systems
 
|-
 
|-
 
! GPU || Qty || System || CPU || Freq || Cores || Memory (GB)|| N || NB || Result TFlop/s || DGEMM Split
 
! GPU || Qty || System || CPU || Freq || Cores || Memory (GB)|| N || NB || Result TFlop/s || DGEMM Split
Line 22: Line 22:
 
! colspan="11" |
 
! colspan="11" |
 
|-
 
|-
| scope="row" | K80m || 1 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || || 896 || TFlop/s || 0.90
+
| scope="row" | K80m || 1 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || 51968 || 896 || 1.721 TFlop/s || 0.90
 
|-
 
|-
| scope="row" | K80m || 2 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || || 896 || TFlop/s || 0.90
+
| scope="row" | K80m || 2 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || 116480 || 896 || 3.945 TFlop/s || 0.90
 
|-
 
|-
| scope="row" | K80m || 3 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || || 896 || TFlop/s || 0.90
+
| scope="row" | K80m || 3 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || 91398 || 896 || 4.855 TFlop/s || 0.85
 
|-
 
|-
| scope="row" | K80m || 4 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || || 896 || TFlop/s || 0.90
+
| scope="row" | K80m || 4 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || 102144 || 896 || 6.557 TFlop/s || 0.90
 
|-
 
|-
 +
 +
'''Some observations from runs with 128GB RAM:'''
 +
* A larger N value allows for a larger DGEMM_SPLIT value.
 +
* Performance suffers significantly when using large N values 100k+ with low core count's per GPU (<3).
 +
* In some instances oversubscribing cores per GPU will offer a performance boost.
 +
* The N values for the 128GB runs were the maximum obtainable for that particular GPU configuration
 +
 
! colspan="11" |
 
! colspan="11" |
 
|-
 
|-

Latest revision as of 23:01, 24 November 2015

Results for HPL on GPU

  • CentOS 7.1 - Kernel 3.10.0-229.20.1.el7.x86_64
  • CUDA-7.5
  • OpenMPI 1.8.5
  • Intel Compiler and MKL 2013_3.174
Some observations from runs with 128GB RAM:
  • A larger N value allows for a larger DGEMM_SPLIT value.
  • Performance suffers significantly when using large N values 100k+ with low core count's per GPU (<3).
  • In some instances oversubscribing cores per GPU will offer a performance boost.
  • The N values for the 128GB runs were the maximum obtainable for that particular GPU configuration
HPL Results (Nvidia Compiled Binary) from Nvidia Tesla K Series Cards on Single Dual Socket Systems
GPU Qty System CPU Freq Cores Memory (GB) N NB Result TFlop/s DGEMM Split
K80m 1 1028QG-TRT E5-2650v3 2.5GHz 10 64GB 51072 896 1.753 TFlop/s 0.85
K80m 2 1028QG-TRT E5-2650v3 2.5GHz 10 64GB 79744 896 3.663 TFlop/s 0.85
K80m 3 1028QG-TRT E5-2650v3 2.5GHz 10 64GB 79744 896 5.129 TFlop/s 0.85
K80m 4 1028QG-TRT E5-2650v3 2.5GHz 10 64GB 79744 896 6.154 TFlop/s 0.85
K80m 1 1028QG-TRT E5-2650v3 2.5GHz 10 128GB 51968 896 1.721 TFlop/s 0.90
K80m 2 1028QG-TRT E5-2650v3 2.5GHz 10 128GB 116480 896 3.945 TFlop/s 0.90
K80m 3 1028QG-TRT E5-2650v3 2.5GHz 10 128GB 91398 896 4.855 TFlop/s 0.85
K80m 4 1028QG-TRT E5-2650v3 2.5GHz 10 128GB 102144 896 6.557 TFlop/s 0.90
K80m 1 1028QG-TRT E5-2698v3 2.3GHz 16 128GB 896 TFlop/s
K80m 2 1028QG-TRT E5-2698v3 2.3GHz 16 128GB 896 TFlop/s
K80m 3 1028QG-TRT E5-2698v3 2.3GHz 16 128GB 896 TFlop/s
K80m 4 1028QG-TRT E5-2698v3 2.3GHz 16 128GB 896 TFlop/s