Difference between revisions of "Results:HPL GPU"

Latest revision as of 23:01, 24 November 2015

Results for HPL on GPU

CentOS 7.1 - Kernel 3.10.0-229.20.1.el7.x86_64
CUDA-7.5
OpenMPI 1.8.5
Intel Compiler and MKL 2013_3.174

Some observations from runs with 128GB RAM:

A larger N value allows for a larger DGEMM_SPLIT value.
Performance suffers significantly when using large N values 100k+ with low core count's per GPU (<3).
In some instances oversubscribing cores per GPU will offer a performance boost.
The N values for the 128GB runs were the maximum obtainable for that particular GPU configuration

HPL Results (Nvidia Compiled Binary) from Nvidia Tesla K Series Cards on Single Dual Socket Systems
GPU	Qty	System	CPU	Freq	Cores	Memory (GB)	N	NB	Result TFlop/s	DGEMM Split
K80m	1	1028QG-TRT	E5-2650v3	2.5GHz	10	64GB	51072	896	1.753 TFlop/s	0.85
K80m	2	1028QG-TRT	E5-2650v3	2.5GHz	10	64GB	79744	896	3.663 TFlop/s	0.85
K80m	3	1028QG-TRT	E5-2650v3	2.5GHz	10	64GB	79744	896	5.129 TFlop/s	0.85
K80m	4	1028QG-TRT	E5-2650v3	2.5GHz	10	64GB	79744	896	6.154 TFlop/s	0.85

K80m	1	1028QG-TRT	E5-2650v3	2.5GHz	10	128GB	51968	896	1.721 TFlop/s	0.90
K80m	2	1028QG-TRT	E5-2650v3	2.5GHz	10	128GB	116480	896	3.945 TFlop/s	0.90
K80m	3	1028QG-TRT	E5-2650v3	2.5GHz	10	128GB	91398	896	4.855 TFlop/s	0.85
K80m	4	1028QG-TRT	E5-2650v3	2.5GHz	10	128GB	102144	896	6.557 TFlop/s	0.90

K80m	1	1028QG-TRT	E5-2698v3	2.3GHz	16	128GB		896	TFlop/s
K80m	2	1028QG-TRT	E5-2698v3	2.3GHz	16	128GB		896	TFlop/s
K80m	3	1028QG-TRT	E5-2698v3	2.3GHz	16	128GB		896	TFlop/s
K80m	4	1028QG-TRT	E5-2698v3	2.3GHz	16	128GB		896	TFlop/s

@@ Line 1: / Line 1: @@
 == Results for HPL on GPU ==
+* CentOS 7.1 - Kernel 3.10.0-229.20.1.el7.x86_64
+* CUDA-7.5
+* OpenMPI 1.8.5
+* Intel Compiler and MKL 2013_3.174
 {| class="wikitable" style="text-align:center; width:100%; "
 |-
-! colspan="11" | FERMI HPL Results from Nvidia Tesla K Series Cards on Single Dual Socket Systems
+! colspan="11" | HPL Results (Nvidia Compiled Binary) from Nvidia Tesla K Series Cards on Single Dual Socket Systems
+|-
+! GPU || Qty || System || CPU || Freq || Cores || Memory (GB)|| N || NB || Result TFlop/s || DGEMM Split
+|-
+| scope="row" | K80m || 1 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 64GB || 51072 || 896 || 1.753 TFlop/s || 0.85
+|-
+| scope="row" | K80m || 2 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 64GB || 79744 || 896 || 3.663 TFlop/s || 0.85
+|-
+| scope="row" | K80m || 3 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 64GB || 79744 || 896 || 5.129 TFlop/s || 0.85
+|-
+| scope="row" | K80m || 4 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 64GB || 79744 || 896 || 6.154 TFlop/s || 0.85
 |-
-! GPU || Qty || System || CPU || Freq || Cores || Memory (GB)|| N || NB || Result TFlop/s || Output File
+! colspan="11" |
 |-
-| scope="row" | K80m || 1 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 64GB || 51072 || 896 || 1.753 TFlop/s || [[gpu-hpl-1-k80.out]]
+| scope="row" | K80m || 1 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || 51968 || 896 || 1.721 TFlop/s || 0.90
 |-
-| scope="row" | K80m || 2 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 64GB || 79744 || 896 || 3.663 TFlop/s || [[gpu-hpl-2-k80.out]]
+| scope="row" | K80m || 2 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || 116480 || 896 || 3.945 TFlop/s || 0.90
 |-
-| scope="row" | K80m || 3 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 64GB || 79744 || 896 || 5.129 TFlop/s || [[gpu-hpl-3-k80.out]]
+| scope="row" | K80m || 3 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || 91398 || 896 || 4.855 TFlop/s || 0.85
 |-
-| scope="row" | K80m || 4 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 64GB || 79744 || 896 || 6.044 TFlop/s || [[gpu-hpl-4-k80.out]]
+| scope="row" | K80m || 4 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || 102144 || 896 || 6.557 TFlop/s || 0.90
+|-
+'''Some observations from runs with 128GB RAM:'''
+* A larger N value allows for a larger DGEMM_SPLIT value.
+* Performance suffers significantly when using large N values 100k+ with low core count's per GPU (<3).
+* In some instances oversubscribing cores per GPU will offer a performance boost.
+* The N values for the 128GB runs were the maximum obtainable for that particular GPU configuration
+! colspan="11" |
+|-
+| scope="row" | K80m || 1 || 1028QG-TRT || E5-2698v3 || 2.3GHz || 16 || 128GB || || 896 ||  TFlop/s ||
+|-
+| scope="row" | K80m || 2 || 1028QG-TRT || E5-2698v3 || 2.3GHz || 16 || 128GB || || 896 ||  TFlop/s ||
+|-
+| scope="row" | K80m || 3 || 1028QG-TRT || E5-2698v3 || 2.3GHz || 16 || 128GB || || 896 ||  TFlop/s ||
+|-
+| scope="row" | K80m || 4 || 1028QG-TRT || E5-2698v3 || 2.3GHz || 16 || 128GB || || 896 ||  TFlop/s ||
 |-
 |}
-{{Line chart
-| color_background = white
-| width = 500
-| height = 350
-| padding_left = 40
-| padding_right = 15
-| padding_top = 10
-| padding_bottom = 20
-| number_of_series = 3
-| number_of_x-values = 10
-| label_x1 = Val. 1 | label_x2 = Val. 2 | label_x3 = Val. 3 | label_x4 = Val. 4 | label_x5 = Val. 5
-| label_x6 = Val. 6 | label_x7 = Val. 7 | label_x8 = Val. 8 | label_x9 = Val. 9 | label_x10 = Val. 10
-| y_max = 3000
-| y_min = 1000
-| scale = yes
-| interval_primary_scale = 1000
-| interval_secondary_scale = 100
-| S01V02 = 2200 | S01V03 = 2400 | S01V04 = 2500 | S01V05 = 2600 | S01V06 = 2500
-| S02V01 = 1400 | S02V02 = 2000 | S02V03 = 1600 | S02V04 = 1800 | S02V05 = 2400
-| S02V06 = 2400 | S02V07 = 2500 | S02V08 = 2000 | S02V09 = 1600 | S02V10 = 1800
-| S03V01 = 1800 | S03V04 = 2000 | S03V05 = 1600 | S03V06 = 1800 | S03V07 = 2400
-| S03V09 = 2400
-| points = yes
-}}
-{{legend|red|Series 1}}
-{{legend|blue|Series 2}}
-{{legend|green|series 3}}

Difference between revisions of "Results:HPL GPU"

Latest revision as of 23:01, 24 November 2015

Results for HPL on GPU

Navigation menu

Search