Difference between revisions of "Seattle: Building and running HPL"

From Define Wiki
Jump to navigation Jump to search
Line 42: Line 42:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
== Get Linpack ==
+
== Download & Build Linpack ==
 
Latest versions are available from: http://www.netlib.org/benchmark/hpl/
 
Latest versions are available from: http://www.netlib.org/benchmark/hpl/
  
Line 74: Line 74:
 
---
 
---
 
176 LINKER      = /usr/lib64/openmpi/bin/mpif90
 
176 LINKER      = /usr/lib64/openmpi/bin/mpif90
 +
</syntaxhighlight>
 +
 +
Build
 +
<syntaxhighlight>
 +
cd /root/scratch/hpl-2.1
 +
make arch=Linux_aarch64
 +
</syntaxhighlight>
 +
 +
Test run using the default HPL.dat file (which is for 4 cores) to ensure the build works OK. The performance won't be great but HPL should run and complete within a few seconds:
 +
<syntaxhighlight>
 +
cd bin/Linux_JB
 +
mpirun -np 4 ./xhpl
 +
</syntaxhighlight>
 +
 +
== Edit the HPL.dat to optimise ==
 +
 +
* '''Problem size (N):''' Your problem size should be the largest to fit in the memory to get best performance. Our sample system had a total of '''16GB'''. There are 125M double precision elements per 1GB of memory so 16GB RAM results in '''2 billion double precision elements'''. Square root of that number is '''44721'''. You need to leave some memory for Operating System and other things.  As a rule of thumb, 80% of the total memory will be a starting point for problem size - so, in this case, say, that is '''35777'''. N / (P * Q) needs to be an integer, so '''35328''' is a reasonable number. N.B. If the problem size is too large, it is swapped out, and the performance will degrade.
 +
 +
* '''Block Size (NB):''' HPL uses the block size NB for the data distribution as well as for the computational granularity. A very small NB will limit computational performance because no data reuse will occur, and also the number of messages will also increase. "Good" block sizes are almost always in the [32 .. 256] interval and it  depends on  Cache size.  These block size are found to be good, 80-216 for IA32; 128-192 for IA64 3M cache; 400 for 4M cache for IA64 and 130 for Woodcrests.
 +
 +
* '''Process Grid Ratio (PXQ):''' This depends on physical interconnection network. P and Q should be approximately equal, with Q slightly larger than P. P * Q should equal the number of available cores. Our sample system had a 6-core CPU to '''P=2''' & '''Q=3'''
 +
 +
These numbers can be auto generated by a number of online tools, for example: http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/
 +
 +
Example of HPL.dat file from initial testing:
 +
<syntaxhighlight>
 +
HPLinpack benchmark input file
 +
Innovative Computing Laboratory, University of Tennessee
 +
HPL.out      output file name (if any)
 +
6            device out (6=stdout,7=stderr,file)
 +
1            # of problems sizes (N)
 +
35328        Ns
 +
1            # of NBs
 +
128          NBs
 +
0            PMAP process mapping (0=Row-,1=Column-major)
 +
1            # of process grids (P x Q)
 +
2            Ps
 +
3            Qs
 +
16.0        threshold
 +
1            # of panel fact
 +
2            PFACTs (0=left, 1=Crout, 2=Right)
 +
1            # of recursive stopping criterium
 +
4            NBMINs (>= 1)
 +
1            # of panels in recursion
 +
2            NDIVs
 +
1            # of recursive panel fact.
 +
1            RFACTs (0=left, 1=Crout, 2=Right)
 +
1            # of broadcast
 +
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
 +
1            # of lookahead depth
 +
1            DEPTHs (>=0)
 +
2            SWAP (0=bin-exch,1=long,2=mix)
 
</syntaxhighlight>
 
</syntaxhighlight>

Revision as of 12:25, 9 January 2015

Pre-Requisits

The default O/S image provided by AMD is very minimal and doesn't include many packages at all. The following should be installed:

yum groupinstall 'Development Tools'

yum install tar bzip2 bc g++

Install OpenMPI

Either install from the repos openmpi, build it manually of choose an alternative MPI.

yum install openmpi openmpi-devel

Set up user environment

# add to the end of your ~/.bashrc
export PATH=$PATH:/usr/lib64/openmpi/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/openmpi/lib;/usr/include/openmpi-aarch64/

Download & build ATLAS

Download ATLAS from: http://sourceforge.net/projects/math-atlas/files/Stable/3.10.1/atlas3.10.1.tar.bz2/download

Extract the contents of the tarball:

cd /root/scratch

wget http://downloads.sourceforge.net/project/math-atlas/Stable/3.10.1/atlas3.10.1.tar.bz2

tar jxvf atlas3.10.1.tar.bz2

Build ATLAS. N.B. - This will take a long time. On our sample Seattle system this took in excess of 5 hours. It is worth running the build process in a screen session or with nohup.

cd /root/scratch
mkdir atlas3.10.1_build
cd atlas3.10.1_build/
/root/scratch/ATLAS/configure --prefix=/opt/atlas/3.10.1
make 
make check

Download & Build Linpack

Latest versions are available from: http://www.netlib.org/benchmark/hpl/

Download and decompress HPL:

cd /root/scratch

wget http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz

tar zxvf hpl-2.1.tar.gz

Create the Makefile:

cd hpl-2.1
cp setup/Make.Linux_PII_CBLAS Make.Linux-aarch64

Edit the following items (approximate line numbers shown):

---
64 ARCH         = Linux_aarch64
---
70 TOPdir       = $(HOME)/scratch/hpl-2.1
---
84 MPdir        = /usr/lib64/openmpi
85 MPinc        = -I/usr/include/openmpi-aarch64
86 MPlib        = $(MPdir)/lib/libmpi.so
---
95 LAdir        = /root/scratch/atlas3.10.1_build/lib
---
176 LINKER       = /usr/lib64/openmpi/bin/mpif90

Build

cd /root/scratch/hpl-2.1
make arch=Linux_aarch64

Test run using the default HPL.dat file (which is for 4 cores) to ensure the build works OK. The performance won't be great but HPL should run and complete within a few seconds:

cd bin/Linux_JB
mpirun -np 4 ./xhpl

Edit the HPL.dat to optimise

  • Problem size (N): Your problem size should be the largest to fit in the memory to get best performance. Our sample system had a total of 16GB. There are 125M double precision elements per 1GB of memory so 16GB RAM results in 2 billion double precision elements. Square root of that number is 44721. You need to leave some memory for Operating System and other things. As a rule of thumb, 80% of the total memory will be a starting point for problem size - so, in this case, say, that is 35777. N / (P * Q) needs to be an integer, so 35328 is a reasonable number. N.B. If the problem size is too large, it is swapped out, and the performance will degrade.
  • Block Size (NB): HPL uses the block size NB for the data distribution as well as for the computational granularity. A very small NB will limit computational performance because no data reuse will occur, and also the number of messages will also increase. "Good" block sizes are almost always in the [32 .. 256] interval and it depends on Cache size. These block size are found to be good, 80-216 for IA32; 128-192 for IA64 3M cache; 400 for 4M cache for IA64 and 130 for Woodcrests.
  • Process Grid Ratio (PXQ): This depends on physical interconnection network. P and Q should be approximately equal, with Q slightly larger than P. P * Q should equal the number of available cores. Our sample system had a 6-core CPU to P=2 & Q=3

These numbers can be auto generated by a number of online tools, for example: http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/

Example of HPL.dat file from initial testing:

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
35328         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2            Ps
3            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)