Difference between revisions of "Benchmarking: HPL CUDA Accelerated for Linux64"
m |
m (K mouza moved page Benchmarking: CUDA Accelerated Linpack for Linux64 to Benchmarking: HPL CUDA Accelerated for Linux64) |
||
| (3 intermediate revisions by 2 users not shown) | |||
| Line 7: | Line 7: | ||
== Download == | == Download == | ||
| − | Download CUDA Accelerated Linpack for Linux64 (hpl-2.0_FERMI_v15.gz) from NVIDIA Developer Portal | + | Download CUDA Accelerated Linpack for Linux64 (hpl-2.0_FERMI_v15.gz) from NVIDIA Developer Portal (https://developer.nvidia.com/) |
untar hpl-2.0_FERMI_v15.gz | untar hpl-2.0_FERMI_v15.gz | ||
cd hpl-2.0_FERMI_v15 | cd hpl-2.0_FERMI_v15 | ||
| Line 49: | Line 49: | ||
prereq openmpi/1.6.3-gpu | prereq openmpi/1.6.3-gpu | ||
prepend-path PATH /shared/apps/linpack/hpl-2.0_FERMI_v15/bin/CUDA | prepend-path PATH /shared/apps/linpack/hpl-2.0_FERMI_v15/bin/CUDA | ||
| + | prepend-path LD_LIBRARY_PATH /shared/apps/intel/mkl/lib/intel64 | ||
| − | |||
== Run (from GPU node)== | == Run (from GPU node)== | ||
| − | + | Load necessary modules | |
| − | + | module load cuda/5.0 | |
| + | module load openmpi/1.6.3-gpu | ||
| + | module load gpu/linpack/hpl-2.0_FERMI_v15 | ||
| + | |||
| + | Copy HPL.dat file to local folder and make changes if required. | ||
| + | cp /shared/apps/linpack/hpl-2.0_FERMI_v15/bin/CUDA/HPL.dat . | ||
| + | |||
| + | Run linpack | ||
mpirun -np 2 run_linpack | mpirun -np 2 run_linpack | ||
| + | |||
| + | == Addition files == | ||
| + | Output from a single GPU run (using 1x K20 GPU) | ||
| + | <syntaxhighlight> | ||
| + | [david@compute021 hpl-gpu]$ ./run_linpack | ||
| + | ================================================================================ | ||
| + | HPLinpack 2.0 -- High-Performance Linpack benchmark -- September 10, 2008 | ||
| + | Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK | ||
| + | Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK | ||
| + | Modified by Julien Langou, University of Colorado Denver | ||
| + | ================================================================================ | ||
| + | |||
| + | An explanation of the input/output parameters follows: | ||
| + | T/V : Wall time / encoded variant. | ||
| + | N : The order of the coefficient matrix A. | ||
| + | NB : The partitioning blocking factor. | ||
| + | P : The number of process rows. | ||
| + | Q : The number of process columns. | ||
| + | Time : Time in seconds to solve the linear system. | ||
| + | Gflops : Rate of execution for solving the linear system. | ||
| + | |||
| + | The following parameter values will be used: | ||
| + | |||
| + | N : 82897 | ||
| + | NB : 768 | ||
| + | PMAP : Row-major process mapping | ||
| + | P : 1 | ||
| + | Q : 1 | ||
| + | PFACT : Left | ||
| + | NBMIN : 2 | ||
| + | NDIV : 2 | ||
| + | RFACT : Left | ||
| + | BCAST : 1ring | ||
| + | DEPTH : 1 | ||
| + | SWAP : Spread-roll (long) | ||
| + | L1 : no-transposed form | ||
| + | U : no-transposed form | ||
| + | EQUIL : yes | ||
| + | ALIGN : 8 double precision words | ||
| + | |||
| + | -------------------------------------------------------------------------------- | ||
| + | |||
| + | - The matrix A is randomly generated for each test. | ||
| + | - The following scaled residual check will be computed: | ||
| + | ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) | ||
| + | - The relative machine precision (eps) is taken to be 1.110223e-16 | ||
| + | - Computational tests pass if scaled residuals are less than 16.0 | ||
| + | |||
| + | ================================================================================ | ||
| + | T/V N NB P Q Time Gflops | ||
| + | -------------------------------------------------------------------------------- | ||
| + | WR10L2L2 82897 768 1 1 340.72 1.115e+03 | ||
| + | -------------------------------------------------------------------------------- | ||
| + | ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0037839 ...... PASSED | ||
| + | ================================================================================ | ||
| + | |||
| + | Finished 1 tests with the following results: | ||
| + | 1 tests completed and passed residual checks, | ||
| + | 0 tests completed and failed residual checks, | ||
| + | 0 tests skipped because of illegal input values. | ||
| + | -------------------------------------------------------------------------------- | ||
| + | |||
| + | End of Tests. | ||
| + | ================================================================================ | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | run_linpack script for single GPU run | ||
| + | <syntaxhighlight> | ||
| + | #!/bin/bash | ||
| + | |||
| + | #location of HPL | ||
| + | HPL_DIR=/shared/apps/linpack/hpl-2.0_FERMI_v15/ | ||
| + | |||
| + | # Number of CPU cores ( per GPU used = per MPI process ) | ||
| + | CPU_CORES_PER_GPU=16 | ||
| + | |||
| + | # FOR MKL | ||
| + | export MKL_NUM_THREADS=$CPU_CORES_PER_GPU | ||
| + | # FOR GOTO | ||
| + | export GOTO_NUM_THREADS=$CPU_CORES_PER_GPU | ||
| + | # FOR OMP | ||
| + | export OMP_NUM_THREADS=$CPU_CORES_PER_GPU | ||
| + | |||
| + | export MKL_DYNAMIC=FALSE | ||
| + | |||
| + | # hint: for 2050 or 2070 card | ||
| + | # try 350/(350 + MKL_NUM_THREADS*4*cpu frequency in GHz) | ||
| + | export CUDA_DGEMM_SPLIT=0.80 | ||
| + | |||
| + | # hint: try CUDA_DGEMM_SPLIT - 0.10 | ||
| + | export CUDA_DTRSM_SPLIT=0.70 | ||
| + | |||
| + | export LD_LIBRARY_PATH=$HPL_DIR/src/cuda:$LD_LIBRARY_PATH | ||
| + | |||
| + | #$HPL_DIR/bin/CUDA/xhpl | ||
| + | mpirun -np 1 -host $HOSTNAME ./xhpl | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | HPL.dat file | ||
| + | <syntaxhighlight> | ||
| + | [david@compute021 hpl-gpu]$ cat HPL.dat | ||
| + | HPLinpack benchmark input file | ||
| + | Innovative Computing Laboratory, University of Tennessee | ||
| + | HPL.out output file name (if any) | ||
| + | 6 device out (6=stdout,7=stderr,file) | ||
| + | 1 # of problems sizes (N) | ||
| + | 82897 Ns | ||
| + | 1 # of NBs | ||
| + | 768 1024 512 384 640 768 896 960 1024 1152 1280 384 640 960 768 640 256 960 512 768 1152 NBs | ||
| + | 0 PMAP process mapping (0=Row-,1=Column-major) | ||
| + | 1 # of process grids (P x Q) | ||
| + | 1 1 2 1 Ps | ||
| + | 1 2 2 4 Qs # (2 2 2 4 Qs. for the dual GPU run) | ||
| + | 16.0 threshold | ||
| + | 1 # of panel fact | ||
| + | 0 1 2 PFACTs (0=left, 1=Crout, 2=Right) | ||
| + | 1 # of recursive stopping criterium | ||
| + | 2 8 NBMINs (>= 1) | ||
| + | 1 # of panels in recursion | ||
| + | 2 NDIVs | ||
| + | 1 # of recursive panel fact. | ||
| + | 0 1 2 RFACTs (0=left, 1=Crout, 2=Right) | ||
| + | 1 # of broadcast | ||
| + | 0 2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) | ||
| + | 1 # of lookahead depth | ||
| + | 1 0 DEPTHs (>=0) | ||
| + | 1 SWAP (0=bin-exch,1=long,2=mix) | ||
| + | 192 swapping threshold | ||
| + | 1 L1 in (0=transposed,1=no-transposed) form | ||
| + | 1 U in (0=transposed,1=no-transposed) form | ||
| + | 1 Equilibration (0=no,1=yes) | ||
| + | 8 memory alignment in double (> 0) | ||
| + | </syntaxhighlight> | ||
Latest revision as of 17:23, 2 March 2015
Requirements
- CUDA 5.0
- GNU Compilers
- Intel MKL
- OpenMPI (Configured from computenode and compiled with GNU compilers)
- Tesla GPUs
Download
Download CUDA Accelerated Linpack for Linux64 (hpl-2.0_FERMI_v15.gz) from NVIDIA Developer Portal (https://developer.nvidia.com/)
untar hpl-2.0_FERMI_v15.gz cd hpl-2.0_FERMI_v15
Edit Make.CUDA
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
TOPdir = /shared/apps/linpack/hpl-2.0_FERMI_v15
.
.
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS) -----------------------------
# ----------------------------------------------------------------------
LAdir = /shared/apps/intel/mkl/lib/intel64
LAinc = -I/shared/apps/cuda/cuda-5.0/include
LAlib = -L$(TOPdir)/src/cuda -ldgemm \
-L/shared/apps/cuda/cuda-5.0/lib64 -lcublas -lcuda -lcudart \
-L$(LAdir) -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core
.
.
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
CC = mpicc
CCFLAGS = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops -W -Wall -fopenmp
Environment
This is to make sure that you don't have any issues in compiling HPL on your system because they have hard-coded in the /usr/local/cuda/ directory as the default directory in all of their makefiles. You can go through them all and change it, but it is just easier to create a link to where you have it installed.
cd /usr/local ln -s /shared/apps/cuda/cuda-5.0 cuda
Build
module load cuda/5.0 module load openmpi/1.6.3-gpu make arch=CUDA
Modulefile
File location: /shared/modulefiles/gpu/linpack
File name: hpl-2.0_FERMI_v15
#%Module 1.0 # prereq cuda/5.0 prereq openmpi/1.6.3-gpu prepend-path PATH /shared/apps/linpack/hpl-2.0_FERMI_v15/bin/CUDA prepend-path LD_LIBRARY_PATH /shared/apps/intel/mkl/lib/intel64
Run (from GPU node)
Load necessary modules
module load cuda/5.0 module load openmpi/1.6.3-gpu module load gpu/linpack/hpl-2.0_FERMI_v15
Copy HPL.dat file to local folder and make changes if required.
cp /shared/apps/linpack/hpl-2.0_FERMI_v15/bin/CUDA/HPL.dat .
Run linpack
mpirun -np 2 run_linpack
Addition files
Output from a single GPU run (using 1x K20 GPU)
[david@compute021 hpl-gpu]$ ./run_linpack
================================================================================
HPLinpack 2.0 -- High-Performance Linpack benchmark -- September 10, 2008
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 82897
NB : 768
PMAP : Row-major process mapping
P : 1
Q : 1
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Left
BCAST : 1ring
DEPTH : 1
SWAP : Spread-roll (long)
L1 : no-transposed form
U : no-transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR10L2L2 82897 768 1 1 340.72 1.115e+03
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0037839 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================run_linpack script for single GPU run
#!/bin/bash
#location of HPL
HPL_DIR=/shared/apps/linpack/hpl-2.0_FERMI_v15/
# Number of CPU cores ( per GPU used = per MPI process )
CPU_CORES_PER_GPU=16
# FOR MKL
export MKL_NUM_THREADS=$CPU_CORES_PER_GPU
# FOR GOTO
export GOTO_NUM_THREADS=$CPU_CORES_PER_GPU
# FOR OMP
export OMP_NUM_THREADS=$CPU_CORES_PER_GPU
export MKL_DYNAMIC=FALSE
# hint: for 2050 or 2070 card
# try 350/(350 + MKL_NUM_THREADS*4*cpu frequency in GHz)
export CUDA_DGEMM_SPLIT=0.80
# hint: try CUDA_DGEMM_SPLIT - 0.10
export CUDA_DTRSM_SPLIT=0.70
export LD_LIBRARY_PATH=$HPL_DIR/src/cuda:$LD_LIBRARY_PATH
#$HPL_DIR/bin/CUDA/xhpl
mpirun -np 1 -host $HOSTNAME ./xhplHPL.dat file
[david@compute021 hpl-gpu]$ cat HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
82897 Ns
1 # of NBs
768 1024 512 384 640 768 896 960 1024 1152 1280 384 640 960 768 640 256 960 512 768 1152 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
1 1 2 1 Ps
1 2 2 4 Qs # (2 2 2 4 Qs. for the dual GPU run)
16.0 threshold
1 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
2 8 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
0 2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 0 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
192 swapping threshold
1 L1 in (0=transposed,1=no-transposed) form
1 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)