Difference between revisions of "Seattle: Building and running HPL"
| (13 intermediate revisions by the same user not shown) | |||
| Line 8: | Line 8: | ||
== Install OpenMPI == | == Install OpenMPI == | ||
| − | Either install from the repos '''<code>openmpi</code>''', build | + | Either install from the repos '''<code>openmpi</code>''', build a copy manually or choose an alternative MPI. |
<syntaxhighlight> | <syntaxhighlight> | ||
yum install openmpi openmpi-devel | yum install openmpi openmpi-devel | ||
| Line 42: | Line 42: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
| − | == | + | == Download & Build Linpack == |
Latest versions are available from: http://www.netlib.org/benchmark/hpl/ | Latest versions are available from: http://www.netlib.org/benchmark/hpl/ | ||
| Line 73: | Line 73: | ||
95 LAdir = /root/scratch/atlas3.10.1_build/lib | 95 LAdir = /root/scratch/atlas3.10.1_build/lib | ||
--- | --- | ||
| − | 176 LINKER | + | 176 LINKER = /usr/lib64/openmpi/bin/mpif90 |
</syntaxhighlight> | </syntaxhighlight> | ||
| + | |||
| + | Build: | ||
| + | <syntaxhighlight> | ||
| + | cd /root/scratch/hpl-2.1 | ||
| + | make arch=Linux_aarch64 | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | Test run using the default HPL.dat file (which is for 4 cores) to ensure the build works OK. The performance won't be great but HPL should run and complete within a few seconds: | ||
| + | <syntaxhighlight> | ||
| + | cd bin/Linux_JB | ||
| + | mpirun -np 4 ./xhpl | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | == Edit the HPL.dat to optimise == | ||
| + | |||
| + | * '''Problem size (N):''' Your problem size should be the largest to fit in the memory to get best performance. Our sample system had a total of '''16GB'''. There are 125M double precision elements per 1GB of memory so 16GB RAM results in '''2 billion double precision elements'''. Square root of that number is '''44721'''. You need to leave some memory for Operating System and other things. As a rule of thumb, 80% of the total memory will be a starting point for problem size - so, in this case, say, that is '''35777'''. N / (P * Q) needs to be an integer, so '''35328''' is a reasonable number. N.B. If the problem size is too large, it is swapped out, and the performance will degrade. | ||
| + | |||
| + | * '''Block Size (NB):''' HPL uses the block size NB for the data distribution as well as for the computational granularity. A very small NB will limit computational performance because no data reuse will occur, and also the number of messages will also increase. "Good" block sizes are almost always in the [32 .. 256] interval and it depends on Cache size. These block size are found to be good, 80-216 for IA32; 128-192 for IA64 3M cache; 400 for 4M cache for IA64 and 130 for Woodcrests. | ||
| + | |||
| + | * '''Process Grid Ratio (PXQ):''' This depends on physical interconnection network. P and Q should be approximately equal, with Q slightly larger than P. P * Q should equal the number of available cores. Our sample system had a 6-core CPU, so '''P = 2''' & '''Q = 3''' | ||
| + | |||
| + | These numbers can be auto generated by a number of online tools, for example: http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/ | ||
| + | |||
| + | === Example of HPL.dat file from initial testing: === | ||
| + | <syntaxhighlight> | ||
| + | HPLinpack benchmark input file | ||
| + | Innovative Computing Laboratory, University of Tennessee | ||
| + | HPL.out output file name (if any) | ||
| + | 6 device out (6=stdout,7=stderr,file) | ||
| + | 1 # of problems sizes (N) | ||
| + | 35328 Ns | ||
| + | 1 # of NBs | ||
| + | 128 NBs | ||
| + | 0 PMAP process mapping (0=Row-,1=Column-major) | ||
| + | 1 # of process grids (P x Q) | ||
| + | 2 Ps | ||
| + | 3 Qs | ||
| + | 16.0 threshold | ||
| + | 1 # of panel fact | ||
| + | 2 PFACTs (0=left, 1=Crout, 2=Right) | ||
| + | 1 # of recursive stopping criterium | ||
| + | 4 NBMINs (>= 1) | ||
| + | 1 # of panels in recursion | ||
| + | 2 NDIVs | ||
| + | 1 # of recursive panel fact. | ||
| + | 1 RFACTs (0=left, 1=Crout, 2=Right) | ||
| + | 1 # of broadcast | ||
| + | 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) | ||
| + | 1 # of lookahead depth | ||
| + | 1 DEPTHs (>=0) | ||
| + | 2 SWAP (0=bin-exch,1=long,2=mix) | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | == Results == | ||
| + | {| class="wikitable" style="text-align:center; width:100%; " | ||
| + | |+ Intel Linpack results for Single SoC system | ||
| + | |- | ||
| + | ! colspan="11" | Haswell | ||
| + | |- | ||
| + | ! CPU || Freq || Cores || Memory (GB) || N || NB || PxQ || System || Result (GFLOPS) | ||
| + | |- | ||
| + | | scope="row" | AMD Seattle || 1.50GHz || 6 || 16 || 35328 || 128 || 2x3 || Overdrive-demo || 8.963 | ||
| + | |- | ||
| + | | scope="row" | AMD Seattle || 1.50GHz || 6 || 16 || 38010 || 128 || 2x3 || Overdrive-demo || 9.054 | ||
| + | |- | ||
| + | | scope="row" | AMD Seattle || 1.50GHz || 6 || 16 || 38010 || 256 || 2x3 || Overdrive-demo || 9.738 | ||
| + | |- | ||
| + | | scope="row" | AMD Seattle || 1.50GHz || 6 || 16 || 38010 || 384 || 2x3 || Overdrive-demo || 9.809 | ||
| + | |- | ||
| + | | scope="row" | AMD Seattle || 1.50GHz || 6 || 16 || 43200 || 384 || 2x3 || Overdrive-demo || 9.832 | ||
| + | |} | ||
Latest revision as of 18:52, 9 January 2015
Pre-Requisits
The default O/S image provided by AMD is very minimal and doesn't include many packages at all. The following should be installed:
yum groupinstall 'Development Tools'
yum install tar bzip2 bc g++Install OpenMPI
Either install from the repos openmpi, build a copy manually or choose an alternative MPI.
yum install openmpi openmpi-develSet up user environment
# add to the end of your ~/.bashrc
export PATH=$PATH:/usr/lib64/openmpi/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/openmpi/lib;/usr/include/openmpi-aarch64/Download & build ATLAS
Download ATLAS from: http://sourceforge.net/projects/math-atlas/files/Stable/3.10.1/atlas3.10.1.tar.bz2/download
Extract the contents of the tarball:
cd /root/scratch
wget http://downloads.sourceforge.net/project/math-atlas/Stable/3.10.1/atlas3.10.1.tar.bz2
tar jxvf atlas3.10.1.tar.bz2Build ATLAS. N.B. - This will take a long time. On our sample Seattle system this took in excess of 5 hours. It is worth running the build process in a screen session or with nohup.
cd /root/scratch
mkdir atlas3.10.1_build
cd atlas3.10.1_build/
/root/scratch/ATLAS/configure --prefix=/opt/atlas/3.10.1
make
make checkDownload & Build Linpack
Latest versions are available from: http://www.netlib.org/benchmark/hpl/
Download and decompress HPL:
cd /root/scratch
wget http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz
tar zxvf hpl-2.1.tar.gzCreate the Makefile:
cd hpl-2.1
cp setup/Make.Linux_PII_CBLAS Make.Linux-aarch64Edit the following items (approximate line numbers shown):
---
64 ARCH = Linux_aarch64
---
70 TOPdir = $(HOME)/scratch/hpl-2.1
---
84 MPdir = /usr/lib64/openmpi
85 MPinc = -I/usr/include/openmpi-aarch64
86 MPlib = $(MPdir)/lib/libmpi.so
---
95 LAdir = /root/scratch/atlas3.10.1_build/lib
---
176 LINKER = /usr/lib64/openmpi/bin/mpif90Build:
cd /root/scratch/hpl-2.1
make arch=Linux_aarch64Test run using the default HPL.dat file (which is for 4 cores) to ensure the build works OK. The performance won't be great but HPL should run and complete within a few seconds:
cd bin/Linux_JB
mpirun -np 4 ./xhplEdit the HPL.dat to optimise
- Problem size (N): Your problem size should be the largest to fit in the memory to get best performance. Our sample system had a total of 16GB. There are 125M double precision elements per 1GB of memory so 16GB RAM results in 2 billion double precision elements. Square root of that number is 44721. You need to leave some memory for Operating System and other things. As a rule of thumb, 80% of the total memory will be a starting point for problem size - so, in this case, say, that is 35777. N / (P * Q) needs to be an integer, so 35328 is a reasonable number. N.B. If the problem size is too large, it is swapped out, and the performance will degrade.
- Block Size (NB): HPL uses the block size NB for the data distribution as well as for the computational granularity. A very small NB will limit computational performance because no data reuse will occur, and also the number of messages will also increase. "Good" block sizes are almost always in the [32 .. 256] interval and it depends on Cache size. These block size are found to be good, 80-216 for IA32; 128-192 for IA64 3M cache; 400 for 4M cache for IA64 and 130 for Woodcrests.
- Process Grid Ratio (PXQ): This depends on physical interconnection network. P and Q should be approximately equal, with Q slightly larger than P. P * Q should equal the number of available cores. Our sample system had a 6-core CPU, so P = 2 & Q = 3
These numbers can be auto generated by a number of online tools, for example: http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/
Example of HPL.dat file from initial testing:
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
35328 Ns
1 # of NBs
128 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
2 Ps
3 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)Results
| Haswell | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| CPU | Freq | Cores | Memory (GB) | N | NB | PxQ | System | Result (GFLOPS) | ||
| AMD Seattle | 1.50GHz | 6 | 16 | 35328 | 128 | 2x3 | Overdrive-demo | 8.963 | ||
| AMD Seattle | 1.50GHz | 6 | 16 | 38010 | 128 | 2x3 | Overdrive-demo | 9.054 | ||
| AMD Seattle | 1.50GHz | 6 | 16 | 38010 | 256 | 2x3 | Overdrive-demo | 9.738 | ||
| AMD Seattle | 1.50GHz | 6 | 16 | 38010 | 384 | 2x3 | Overdrive-demo | 9.809 | ||
| AMD Seattle | 1.50GHz | 6 | 16 | 43200 | 384 | 2x3 | Overdrive-demo | 9.832 | ||