Difference between revisions of "Results:HPL GPU"

Latest revision as of 23:01, 24 November 2015

Results for HPL on GPU

CentOS 7.1 - Kernel 3.10.0-229.20.1.el7.x86_64
CUDA-7.5
OpenMPI 1.8.5
Intel Compiler and MKL 2013_3.174

Some observations from runs with 128GB RAM:

A larger N value allows for a larger DGEMM_SPLIT value.
Performance suffers significantly when using large N values 100k+ with low core count's per GPU (<3).
In some instances oversubscribing cores per GPU will offer a performance boost.
The N values for the 128GB runs were the maximum obtainable for that particular GPU configuration

HPL Results (Nvidia Compiled Binary) from Nvidia Tesla K Series Cards on Single Dual Socket Systems
GPU	Qty	System	CPU	Freq	Cores	Memory (GB)	N	NB	Result TFlop/s	DGEMM Split
K80m	1	1028QG-TRT	E5-2650v3	2.5GHz	10	64GB	51072	896	1.753 TFlop/s	0.85
K80m	2	1028QG-TRT	E5-2650v3	2.5GHz	10	64GB	79744	896	3.663 TFlop/s	0.85
K80m	3	1028QG-TRT	E5-2650v3	2.5GHz	10	64GB	79744	896	5.129 TFlop/s	0.85
K80m	4	1028QG-TRT	E5-2650v3	2.5GHz	10	64GB	79744	896	6.154 TFlop/s	0.85

K80m	1	1028QG-TRT	E5-2650v3	2.5GHz	10	128GB	51968	896	1.721 TFlop/s	0.90
K80m	2	1028QG-TRT	E5-2650v3	2.5GHz	10	128GB	116480	896	3.945 TFlop/s	0.90
K80m	3	1028QG-TRT	E5-2650v3	2.5GHz	10	128GB	91398	896	4.855 TFlop/s	0.85
K80m	4	1028QG-TRT	E5-2650v3	2.5GHz	10	128GB	102144	896	6.557 TFlop/s	0.90

K80m	1	1028QG-TRT	E5-2698v3	2.3GHz	16	128GB		896	TFlop/s
K80m	2	1028QG-TRT	E5-2698v3	2.3GHz	16	128GB		896	TFlop/s
K80m	3	1028QG-TRT	E5-2698v3	2.3GHz	16	128GB		896	TFlop/s
K80m	4	1028QG-TRT	E5-2698v3	2.3GHz	16	128GB		896	TFlop/s

@@ Line 8: / Line 8: @@
 {| class="wikitable" style="text-align:center; width:100%; "
 |-
-! colspan="11" | FERMI HPL Results from Nvidia Tesla K Series Cards on Single Dual Socket Systems
+! colspan="11" | HPL Results (Nvidia Compiled Binary) from Nvidia Tesla K Series Cards on Single Dual Socket Systems
 |-
 ! GPU || Qty || System || CPU || Freq || Cores || Memory (GB)|| N || NB || Result TFlop/s || DGEMM Split
@@ Line 22: / Line 22: @@
 ! colspan="11" |
 |-
-| scope="row" | K80m || 1 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || || 896 ||  TFlop/s || 0.90
+| scope="row" | K80m || 1 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || 51968 || 896 || 1.721 TFlop/s || 0.90
 |-
-| scope="row" | K80m || 2 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || || 896 ||  TFlop/s || 0.90
+| scope="row" | K80m || 2 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || 116480 || 896 || 3.945 TFlop/s || 0.90
 |-
-| scope="row" | K80m || 3 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || || 896 ||  TFlop/s || 0.90
+| scope="row" | K80m || 3 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || 91398 || 896 || 4.855 TFlop/s || 0.85
 |-
-| scope="row" | K80m || 4 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || || 896 ||  TFlop/s || 0.90
+| scope="row" | K80m || 4 || 1028QG-TRT || E5-2650v3 || 2.5GHz || 10 || 128GB || 102144 || 896 || 6.557 TFlop/s || 0.90
 |-
+'''Some observations from runs with 128GB RAM:'''
+* A larger N value allows for a larger DGEMM_SPLIT value.
+* Performance suffers significantly when using large N values 100k+ with low core count's per GPU (<3).
+* In some instances oversubscribing cores per GPU will offer a performance boost.
+* The N values for the 128GB runs were the maximum obtainable for that particular GPU configuration
 ! colspan="11" |
 |-

Difference between revisions of "Results:HPL GPU"

Latest revision as of 23:01, 24 November 2015

Results for HPL on GPU

Navigation menu

Search