Difference between revisions of "Results:HPL GPU"

From Define Wiki
Jump to navigation Jump to search
 
(15 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== Results for HPL on GPU ==
 
== Results for HPL on GPU ==
 +
 +
* CentOS 7.1 - Kernel 3.10.0-229.20.1.el7.x86_64
 +
* CUDA-7.5
 +
* OpenMPI 1.8.5
 +
* Intel Compiler and MKL 2013_3.174
 +
 
{| class="wikitable" style="text-align:center; width:100%; "
 
{| class="wikitable" style="text-align:center; width:100%; "
 
|-
 
|-
! colspan="11" | FERMI HPL Results from Nvidia Tesla K Series Cards on Single Dual Socket Systems
+
! colspan="11" | HPL Results (Nvidia Compiled Binary) from Nvidia Tesla K Series Cards on Single Dual Socket Systems
 +
|-
 +
! GPU || Qty || System || CPU || Freq || Cores || Memory (GB)|| N || NB || Result TFlop/s || DGEMM Split
 +
|-
 +
| scope="row" | K80m || 1 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 64GB || 51072 || 896 || 1.753 TFlop/s || 0.85
 +
|-
 +
| scope="row" | K80m || 2 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 64GB || 79744 || 896 || 3.663 TFlop/s || 0.85
 +
|-
 +
| scope="row" | K80m || 3 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 64GB || 79744 || 896 || 5.129 TFlop/s || 0.85
 +
|-
 +
| scope="row" | K80m || 4 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 64GB || 79744 || 896 || 6.154 TFlop/s || 0.85
 
|-
 
|-
! GPU || Qty || System || CPU || Freq || Cores || Memory (GB)|| N || NB || Result TFlop/s || Output File
+
! colspan="11" |
 
|-
 
|-
| scope="row" | K80m || 1 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 64GB || 51072 || 896 || 1.753 TFlop/s || [[gpu-hpl-1-k80.out]]
+
| scope="row" | K80m || 1 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || 51968 || 896 || 1.721 TFlop/s || 0.90
 
|-
 
|-
| scope="row" | K80m || 2 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 64GB || 79744 || 896 || 3.663 TFlop/s || [[gpu-hpl-2-k80.out]]
+
| scope="row" | K80m || 2 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || 116480 || 896 || 3.945 TFlop/s || 0.90
 
|-
 
|-
| scope="row" | K80m || 3 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 64GB || 79744 || 896 || 5.129 TFlop/s || [[gpu-hpl-3-k80.out]]
+
| scope="row" | K80m || 3 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || 91398 || 896 || 4.855 TFlop/s || 0.85
 
|-
 
|-
| scope="row" | K80m || 4 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 64GB || 79744 || 896 || 6.044 TFlop/s || [[gpu-hpl-4-k80.out]]
+
| scope="row" | K80m || 4 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || 102144 || 896 || 6.557 TFlop/s || 0.90
 +
|-
 +
 
 +
'''Some observations from runs with 128GB RAM:'''
 +
* A larger N value allows for a larger DGEMM_SPLIT value.
 +
* Performance suffers significantly when using large N values 100k+ with low core count's per GPU (<3).
 +
* In some instances oversubscribing cores per GPU will offer a performance boost.
 +
* The N values for the 128GB runs were the maximum obtainable for that particular GPU configuration
 +
 
 +
! colspan="11" |
 +
|-
 +
| scope="row" | K80m || 1 || 1028QG-TRT || E5-2698v3 || 2.3GHz || 16 || 128GB || || 896 ||  TFlop/s ||
 +
|-
 +
| scope="row" | K80m || 2 || 1028QG-TRT || E5-2698v3 || 2.3GHz || 16 || 128GB || || 896 ||  TFlop/s ||
 +
|-
 +
| scope="row" | K80m || 3 || 1028QG-TRT || E5-2698v3 || 2.3GHz || 16 || 128GB || || 896 ||  TFlop/s ||
 +
|-
 +
| scope="row" | K80m || 4 || 1028QG-TRT || E5-2698v3 || 2.3GHz || 16 || 128GB || || 896 ||  TFlop/s ||
 
|-
 
|-
 
|}
 
|}
 
{{Line chart
 
| color_background = white
 
| width = 500
 
| height = 350
 
| padding_left = 40
 
| padding_right = 15
 
| padding_top = 10
 
| padding_bottom = 20
 
| number_of_series = 3
 
| number_of_x-values = 10
 
| label_x1 = Val. 1 | label_x2 = Val. 2 | label_x3 = Val. 3 | label_x4 = Val. 4 | label_x5 = Val. 5
 
| label_x6 = Val. 6 | label_x7 = Val. 7 | label_x8 = Val. 8 | label_x9 = Val. 9 | label_x10 = Val. 10
 
| y_max = 3000
 
| y_min = 1000
 
| scale = yes
 
| interval_primary_scale = 1000
 
| interval_secondary_scale = 100
 
| S01V02 = 2200 | S01V03 = 2400 | S01V04 = 2500 | S01V05 = 2600 | S01V06 = 2500
 
| S02V01 = 1400 | S02V02 = 2000 | S02V03 = 1600 | S02V04 = 1800 | S02V05 = 2400
 
| S02V06 = 2400 | S02V07 = 2500 | S02V08 = 2000 | S02V09 = 1600 | S02V10 = 1800
 
| S03V01 = 1800 | S03V04 = 2000 | S03V05 = 1600 | S03V06 = 1800 | S03V07 = 2400
 
| S03V09 = 2400
 
| points = yes
 
}}
 
{{legend|red|Series 1}}
 
{{legend|blue|Series 2}}
 
{{legend|green|series 3}}
 

Latest revision as of 23:01, 24 November 2015

Results for HPL on GPU

  • CentOS 7.1 - Kernel 3.10.0-229.20.1.el7.x86_64
  • CUDA-7.5
  • OpenMPI 1.8.5
  • Intel Compiler and MKL 2013_3.174
Some observations from runs with 128GB RAM:
  • A larger N value allows for a larger DGEMM_SPLIT value.
  • Performance suffers significantly when using large N values 100k+ with low core count's per GPU (<3).
  • In some instances oversubscribing cores per GPU will offer a performance boost.
  • The N values for the 128GB runs were the maximum obtainable for that particular GPU configuration
HPL Results (Nvidia Compiled Binary) from Nvidia Tesla K Series Cards on Single Dual Socket Systems
GPU Qty System CPU Freq Cores Memory (GB) N NB Result TFlop/s DGEMM Split
K80m 1 1028QG-TRT E5-2650v3 2.5GHz 10 64GB 51072 896 1.753 TFlop/s 0.85
K80m 2 1028QG-TRT E5-2650v3 2.5GHz 10 64GB 79744 896 3.663 TFlop/s 0.85
K80m 3 1028QG-TRT E5-2650v3 2.5GHz 10 64GB 79744 896 5.129 TFlop/s 0.85
K80m 4 1028QG-TRT E5-2650v3 2.5GHz 10 64GB 79744 896 6.154 TFlop/s 0.85
K80m 1 1028QG-TRT E5-2650v3 2.5GHz 10 128GB 51968 896 1.721 TFlop/s 0.90
K80m 2 1028QG-TRT E5-2650v3 2.5GHz 10 128GB 116480 896 3.945 TFlop/s 0.90
K80m 3 1028QG-TRT E5-2650v3 2.5GHz 10 128GB 91398 896 4.855 TFlop/s 0.85
K80m 4 1028QG-TRT E5-2650v3 2.5GHz 10 128GB 102144 896 6.557 TFlop/s 0.90
K80m 1 1028QG-TRT E5-2698v3 2.3GHz 16 128GB 896 TFlop/s
K80m 2 1028QG-TRT E5-2698v3 2.3GHz 16 128GB 896 TFlop/s
K80m 3 1028QG-TRT E5-2698v3 2.3GHz 16 128GB 896 TFlop/s
K80m 4 1028QG-TRT E5-2698v3 2.3GHz 16 128GB 896 TFlop/s