Difference between revisions of "Seattle: Building and running HPL"

Latest revision as of 18:52, 9 January 2015

Pre-Requisits

The default O/S image provided by AMD is very minimal and doesn't include many packages at all. The following should be installed:

yum groupinstall 'Development Tools'

yum install tar bzip2 bc g++

Install OpenMPI

Either install from the repos openmpi, build a copy manually or choose an alternative MPI.

yum install openmpi openmpi-devel

Set up user environment

# add to the end of your ~/.bashrc
export PATH=$PATH:/usr/lib64/openmpi/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/openmpi/lib;/usr/include/openmpi-aarch64/

Download & build ATLAS

Download ATLAS from: http://sourceforge.net/projects/math-atlas/files/Stable/3.10.1/atlas3.10.1.tar.bz2/download

Extract the contents of the tarball:

cd /root/scratch

wget http://downloads.sourceforge.net/project/math-atlas/Stable/3.10.1/atlas3.10.1.tar.bz2

tar jxvf atlas3.10.1.tar.bz2

Build ATLAS. N.B. - This will take a long time. On our sample Seattle system this took in excess of 5 hours. It is worth running the build process in a screen session or with nohup.

cd /root/scratch
mkdir atlas3.10.1_build
cd atlas3.10.1_build/
/root/scratch/ATLAS/configure --prefix=/opt/atlas/3.10.1
make 
make check

Download & Build Linpack

Latest versions are available from: http://www.netlib.org/benchmark/hpl/

Download and decompress HPL:

cd /root/scratch

wget http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz

tar zxvf hpl-2.1.tar.gz

Create the Makefile:

cd hpl-2.1
cp setup/Make.Linux_PII_CBLAS Make.Linux-aarch64

Edit the following items (approximate line numbers shown):

---
64 ARCH         = Linux_aarch64
---
70 TOPdir       = $(HOME)/scratch/hpl-2.1
---
84 MPdir        = /usr/lib64/openmpi
85 MPinc        = -I/usr/include/openmpi-aarch64
86 MPlib        = $(MPdir)/lib/libmpi.so
---
95 LAdir        = /root/scratch/atlas3.10.1_build/lib
---
176 LINKER      = /usr/lib64/openmpi/bin/mpif90

Build:

cd /root/scratch/hpl-2.1
make arch=Linux_aarch64

Test run using the default HPL.dat file (which is for 4 cores) to ensure the build works OK. The performance won't be great but HPL should run and complete within a few seconds:

cd bin/Linux_JB
mpirun -np 4 ./xhpl

Edit the HPL.dat to optimise

Problem size (N): Your problem size should be the largest to fit in the memory to get best performance. Our sample system had a total of 16GB. There are 125M double precision elements per 1GB of memory so 16GB RAM results in 2 billion double precision elements. Square root of that number is 44721. You need to leave some memory for Operating System and other things. As a rule of thumb, 80% of the total memory will be a starting point for problem size - so, in this case, say, that is 35777. N / (P * Q) needs to be an integer, so 35328 is a reasonable number. N.B. If the problem size is too large, it is swapped out, and the performance will degrade.

Block Size (NB): HPL uses the block size NB for the data distribution as well as for the computational granularity. A very small NB will limit computational performance because no data reuse will occur, and also the number of messages will also increase. "Good" block sizes are almost always in the [32 .. 256] interval and it depends on Cache size. These block size are found to be good, 80-216 for IA32; 128-192 for IA64 3M cache; 400 for 4M cache for IA64 and 130 for Woodcrests.

Process Grid Ratio (PXQ): This depends on physical interconnection network. P and Q should be approximately equal, with Q slightly larger than P. P * Q should equal the number of available cores. Our sample system had a 6-core CPU, so P = 2 & Q = 3

These numbers can be auto generated by a number of online tools, for example: http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/

Example of HPL.dat file from initial testing:

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
35328         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2            Ps
3            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)

Results

Intel Linpack results for Single SoC system
Haswell
CPU	Freq	Cores	Memory (GB)	N	NB	PxQ	System	Result (GFLOPS)
AMD Seattle	1.50GHz	6	16	35328	128	2x3	Overdrive-demo	8.963
AMD Seattle	1.50GHz	6	16	38010	128	2x3	Overdrive-demo	9.054
AMD Seattle	1.50GHz	6	16	38010	256	2x3	Overdrive-demo	9.738
AMD Seattle	1.50GHz	6	16	38010	384	2x3	Overdrive-demo	9.809
AMD Seattle	1.50GHz	6	16	43200	384	2x3	Overdrive-demo	9.832

@@ Line 8: / Line 8: @@
 == Install OpenMPI ==
-Either install from the repos '''<code>openmpi</code>''', build it manually of choose an alternative MPI.
+Either install from the repos '''<code>openmpi</code>''', build a copy manually or choose an alternative MPI.
 <syntaxhighlight>
 yum install openmpi openmpi-devel
@@ Line 73: / Line 73: @@
 LAdir        = /root/scratch/atlas3.10.1_build/lib
 ---
-LINKER       = /usr/lib64/openmpi/bin/mpif90
+LINKER      = /usr/lib64/openmpi/bin/mpif90
 </syntaxhighlight>
@@ Line 94: / Line 94: @@
 * '''Block Size (NB):''' HPL uses the block size NB for the data distribution as well as for the computational granularity. A very small NB will limit computational performance because no data reuse will occur, and also the number of messages will also increase. "Good" block sizes are almost always in the [32 .. 256] interval and it  depends on  Cache size.  These block size are found to be good, 80-216 for IA32; 128-192 for IA64 3M cache; 400 for 4M cache for IA64 and 130 for Woodcrests.
-* '''Process Grid Ratio (PXQ):''' This depends on physical interconnection network. P and Q should be approximately equal, with Q slightly larger than P. P * Q should equal the number of available cores. Our sample system had a 6-core CPU to '''P=2''' & '''Q=3'''
+* '''Process Grid Ratio (PXQ):''' This depends on physical interconnection network. P and Q should be approximately equal, with Q slightly larger than P. P * Q should equal the number of available cores. Our sample system had a 6-core CPU, so '''P = 2''' & '''Q = 3'''
 These numbers can be auto generated by a number of online tools, for example: http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/
@@ Line 138: / Line 138: @@
 | scope="row" | AMD Seattle || 1.50GHz || 6 || 16 || 35328 || 128 || 2x3 || Overdrive-demo || 8.963
 |-
-| scope="row" | AMD Seattle || 1.50GHz || 6 || 16 || 38010 || 128 || 2x3 || Overdrive-demo || tbc
+| scope="row" | AMD Seattle || 1.50GHz || 6 || 16 || 38010 || 128 || 2x3 || Overdrive-demo || 9.054
+|-
+| scope="row" | AMD Seattle || 1.50GHz || 6 || 16 || 38010 || 256 || 2x3 || Overdrive-demo || 9.738
+|-
+| scope="row" | AMD Seattle || 1.50GHz || 6 || 16 || 38010 || 384 || 2x3 || Overdrive-demo || 9.809
+|-
+| scope="row" | AMD Seattle || 1.50GHz || 6 || 16 || 43200 || 384 || 2x3 || Overdrive-demo || 9.832
 |}