Difference between revisions of "Seattle: Building and running HPL"

Latest revision as of 18:52, 9 January 2015

Pre-Requisits

The default O/S image provided by AMD is very minimal and doesn't include many packages at all. The following should be installed:

yum groupinstall 'Development Tools'

yum install tar bzip2 bc g++

Install OpenMPI

Either install from the repos openmpi, build a copy manually or choose an alternative MPI.

yum install openmpi openmpi-devel

Set up user environment

# add to the end of your ~/.bashrc
export PATH=$PATH:/usr/lib64/openmpi/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/openmpi/lib;/usr/include/openmpi-aarch64/

Download & build ATLAS

Download ATLAS from: http://sourceforge.net/projects/math-atlas/files/Stable/3.10.1/atlas3.10.1.tar.bz2/download

Extract the contents of the tarball:

cd /root/scratch

wget http://downloads.sourceforge.net/project/math-atlas/Stable/3.10.1/atlas3.10.1.tar.bz2

tar jxvf atlas3.10.1.tar.bz2

Build ATLAS. N.B. - This will take a long time. On our sample Seattle system this took in excess of 5 hours. It is worth running the build process in a screen session or with nohup.

cd /root/scratch
mkdir atlas3.10.1_build
cd atlas3.10.1_build/
/root/scratch/ATLAS/configure --prefix=/opt/atlas/3.10.1
make 
make check

Download & Build Linpack

Latest versions are available from: http://www.netlib.org/benchmark/hpl/

Download and decompress HPL:

cd /root/scratch

wget http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz

tar zxvf hpl-2.1.tar.gz

Create the Makefile:

cd hpl-2.1
cp setup/Make.Linux_PII_CBLAS Make.Linux-aarch64

Edit the following items (approximate line numbers shown):

---
64 ARCH         = Linux_aarch64
---
70 TOPdir       = $(HOME)/scratch/hpl-2.1
---
84 MPdir        = /usr/lib64/openmpi
85 MPinc        = -I/usr/include/openmpi-aarch64
86 MPlib        = $(MPdir)/lib/libmpi.so
---
95 LAdir        = /root/scratch/atlas3.10.1_build/lib
---
176 LINKER      = /usr/lib64/openmpi/bin/mpif90

Build:

cd /root/scratch/hpl-2.1
make arch=Linux_aarch64

Test run using the default HPL.dat file (which is for 4 cores) to ensure the build works OK. The performance won't be great but HPL should run and complete within a few seconds:

cd bin/Linux_JB
mpirun -np 4 ./xhpl

Edit the HPL.dat to optimise

Problem size (N): Your problem size should be the largest to fit in the memory to get best performance. Our sample system had a total of 16GB. There are 125M double precision elements per 1GB of memory so 16GB RAM results in 2 billion double precision elements. Square root of that number is 44721. You need to leave some memory for Operating System and other things. As a rule of thumb, 80% of the total memory will be a starting point for problem size - so, in this case, say, that is 35777. N / (P * Q) needs to be an integer, so 35328 is a reasonable number. N.B. If the problem size is too large, it is swapped out, and the performance will degrade.

Block Size (NB): HPL uses the block size NB for the data distribution as well as for the computational granularity. A very small NB will limit computational performance because no data reuse will occur, and also the number of messages will also increase. "Good" block sizes are almost always in the [32 .. 256] interval and it depends on Cache size. These block size are found to be good, 80-216 for IA32; 128-192 for IA64 3M cache; 400 for 4M cache for IA64 and 130 for Woodcrests.

Process Grid Ratio (PXQ): This depends on physical interconnection network. P and Q should be approximately equal, with Q slightly larger than P. P * Q should equal the number of available cores. Our sample system had a 6-core CPU, so P = 2 & Q = 3

These numbers can be auto generated by a number of online tools, for example: http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/

Example of HPL.dat file from initial testing:

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
35328         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2            Ps
3            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)

Results

Intel Linpack results for Single SoC system
Haswell
CPU	Freq	Cores	Memory (GB)	N	NB	PxQ	System	Result (GFLOPS)
AMD Seattle	1.50GHz	6	16	35328	128	2x3	Overdrive-demo	8.963
AMD Seattle	1.50GHz	6	16	38010	128	2x3	Overdrive-demo	9.054
AMD Seattle	1.50GHz	6	16	38010	256	2x3	Overdrive-demo	9.738
AMD Seattle	1.50GHz	6	16	38010	384	2x3	Overdrive-demo	9.809
AMD Seattle	1.50GHz	6	16	43200	384	2x3	Overdrive-demo	9.832

@@ Line 8: / Line 8: @@
 == Install OpenMPI ==
-Either install from the repos '''<code>openmpi</code>''', build it manually of choose an alternative MPI.
+Either install from the repos '''<code>openmpi</code>''', build a copy manually or choose an alternative MPI.
 <syntaxhighlight>
 yum install openmpi openmpi-devel
@@ Line 42: / Line 42: @@
 </syntaxhighlight>
-== Get Linpack ==
+== Download & Build Linpack ==
 Latest versions are available from: http://www.netlib.org/benchmark/hpl/
@@ Line 73: / Line 73: @@
 LAdir        = /root/scratch/atlas3.10.1_build/lib
 ---
-LINKER       = /usr/lib64/openmpi/bin/mpif90
+LINKER      = /usr/lib64/openmpi/bin/mpif90
 </syntaxhighlight>
+Build:
+<syntaxhighlight>
+cd /root/scratch/hpl-2.1
+make arch=Linux_aarch64
+</syntaxhighlight>
+Test run using the default HPL.dat file (which is for 4 cores) to ensure the build works OK. The performance won't be great but HPL should run and complete within a few seconds:
+<syntaxhighlight>
+cd bin/Linux_JB
+mpirun -np 4 ./xhpl
+</syntaxhighlight>
+== Edit the HPL.dat to optimise ==
+* '''Problem size (N):''' Your problem size should be the largest to fit in the memory to get best performance. Our sample system had a total of '''16GB'''. There are 125M double precision elements per 1GB of memory so 16GB RAM results in '''2 billion double precision elements'''. Square root of that number is '''44721'''. You need to leave some memory for Operating System and other things.  As a rule of thumb, 80% of the total memory will be a starting point for problem size - so, in this case, say, that is '''35777'''. N / (P * Q) needs to be an integer, so '''35328''' is a reasonable number. N.B. If the problem size is too large, it is swapped out, and the performance will degrade.
+* '''Block Size (NB):''' HPL uses the block size NB for the data distribution as well as for the computational granularity. A very small NB will limit computational performance because no data reuse will occur, and also the number of messages will also increase. "Good" block sizes are almost always in the [32 .. 256] interval and it  depends on  Cache size.  These block size are found to be good, 80-216 for IA32; 128-192 for IA64 3M cache; 400 for 4M cache for IA64 and 130 for Woodcrests.
+* '''Process Grid Ratio (PXQ):''' This depends on physical interconnection network. P and Q should be approximately equal, with Q slightly larger than P. P * Q should equal the number of available cores. Our sample system had a 6-core CPU, so '''P = 2''' & '''Q = 3'''
+These numbers can be auto generated by a number of online tools, for example: http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/
+=== Example of HPL.dat file from initial testing: ===
+<syntaxhighlight>
+HPLinpack benchmark input file
+Innovative Computing Laboratory, University of Tennessee
+HPL.out      output file name (if any)
+            device out (6=stdout,7=stderr,file)
+            # of problems sizes (N)
+         Ns
+            # of NBs
+           NBs
+            PMAP process mapping (0=Row-,1=Column-major)
+            # of process grids (P x Q)
+            Ps
+            Qs
+.0         threshold
+            # of panel fact
+            PFACTs (0=left, 1=Crout, 2=Right)
+            # of recursive stopping criterium
+            NBMINs (>= 1)
+            # of panels in recursion
+            NDIVs
+            # of recursive panel fact.
+            RFACTs (0=left, 1=Crout, 2=Right)
+            # of broadcast
+            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
+            # of lookahead depth
+            DEPTHs (>=0)
+            SWAP (0=bin-exch,1=long,2=mix)
+</syntaxhighlight>
+== Results ==
+{| class="wikitable" style="text-align:center; width:100%; "
+|+ Intel Linpack results for Single SoC system
+|-
+! colspan="11" | Haswell
+|-
+! CPU || Freq || Cores || Memory (GB) || N || NB || PxQ || System || Result (GFLOPS)
+|-
+| scope="row" | AMD Seattle || 1.50GHz || 6 || 16 || 35328 || 128 || 2x3 || Overdrive-demo || 8.963
+|-
+| scope="row" | AMD Seattle || 1.50GHz || 6 || 16 || 38010 || 128 || 2x3 || Overdrive-demo || 9.054
+|-
+| scope="row" | AMD Seattle || 1.50GHz || 6 || 16 || 38010 || 256 || 2x3 || Overdrive-demo || 9.738
+|-
+| scope="row" | AMD Seattle || 1.50GHz || 6 || 16 || 38010 || 384 || 2x3 || Overdrive-demo || 9.809
+|-
+| scope="row" | AMD Seattle || 1.50GHz || 6 || 16 || 43200 || 384 || 2x3 || Overdrive-demo || 9.832
+|}