Seattle: Building and running HPL

Pre-Requisits

The default O/S image provided by AMD is very minimal and doesn't include many packages at all. The following should be installed:

yum groupinstall 'Development Tools'

yum install tar bzip2 bc g++

Install OpenMPI

Either install from the repos openmpi, build it manually of choose an alternative MPI.

yum install openmpi openmpi-devel

Set up user environment

# add to the end of your ~/.bashrc
export PATH=$PATH:/usr/lib64/openmpi/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/openmpi/lib;/usr/include/openmpi-aarch64/

Download & build ATLAS

Download ATLAS from: http://sourceforge.net/projects/math-atlas/files/Stable/3.10.1/atlas3.10.1.tar.bz2/download

Extract the contents of the tarball:

cd /root/scratch

wget http://downloads.sourceforge.net/project/math-atlas/Stable/3.10.1/atlas3.10.1.tar.bz2

tar jxvf atlas3.10.1.tar.bz2

Build ATLAS. N.B. - This will take a long time. On our sample Seattle system this took in excess of 5 hours. It is worth running the build process in a screen session or with nohup.

cd /root/scratch
mkdir atlas3.10.1_build
cd atlas3.10.1_build/
/root/scratch/ATLAS/configure --prefix=/opt/atlas/3.10.1
make 
make check

Download & Build Linpack

Latest versions are available from: http://www.netlib.org/benchmark/hpl/

Download and decompress HPL:

cd /root/scratch

wget http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz

tar zxvf hpl-2.1.tar.gz

Create the Makefile:

cd hpl-2.1
cp setup/Make.Linux_PII_CBLAS Make.Linux-aarch64

Edit the following items (approximate line numbers shown):

---
64 ARCH         = Linux_aarch64
---
70 TOPdir       = $(HOME)/scratch/hpl-2.1
---
84 MPdir        = /usr/lib64/openmpi
85 MPinc        = -I/usr/include/openmpi-aarch64
86 MPlib        = $(MPdir)/lib/libmpi.so
---
95 LAdir        = /root/scratch/atlas3.10.1_build/lib
---
176 LINKER       = /usr/lib64/openmpi/bin/mpif90

Build:

cd /root/scratch/hpl-2.1
make arch=Linux_aarch64

Test run using the default HPL.dat file (which is for 4 cores) to ensure the build works OK. The performance won't be great but HPL should run and complete within a few seconds:

cd bin/Linux_JB
mpirun -np 4 ./xhpl

Edit the HPL.dat to optimise

Problem size (N): Your problem size should be the largest to fit in the memory to get best performance. Our sample system had a total of 16GB. There are 125M double precision elements per 1GB of memory so 16GB RAM results in 2 billion double precision elements. Square root of that number is 44721. You need to leave some memory for Operating System and other things. As a rule of thumb, 80% of the total memory will be a starting point for problem size - so, in this case, say, that is 35777. N / (P * Q) needs to be an integer, so 35328 is a reasonable number. N.B. If the problem size is too large, it is swapped out, and the performance will degrade.

Block Size (NB): HPL uses the block size NB for the data distribution as well as for the computational granularity. A very small NB will limit computational performance because no data reuse will occur, and also the number of messages will also increase. "Good" block sizes are almost always in the [32 .. 256] interval and it depends on Cache size. These block size are found to be good, 80-216 for IA32; 128-192 for IA64 3M cache; 400 for 4M cache for IA64 and 130 for Woodcrests.

Process Grid Ratio (PXQ): This depends on physical interconnection network. P and Q should be approximately equal, with Q slightly larger than P. P * Q should equal the number of available cores. Our sample system had a 6-core CPU to P=2 & Q=3

These numbers can be auto generated by a number of online tools, for example: http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/

Example of HPL.dat file from initial testing:

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
35328         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2            Ps
3            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)

Results

Intel Linpack results for Dual CPU system
Haswell
CPU	Freq	Cores	Memory (GB)	N	NB	PxQ	System	Result
AMD Seattle	1.50GHz	6	16	35328	128	2x3	Overdrive-demo	8.963