Difference between revisions of "Benchmarking: HPL CUDA Accelerated for Linux64"

Latest revision as of 17:23, 2 March 2015

Requirements

- CUDA 5.0
- GNU Compilers
- Intel MKL
- OpenMPI (Configured from computenode and compiled with GNU compilers)
- Tesla GPUs

Download

Download CUDA Accelerated Linpack for Linux64 (hpl-2.0_FERMI_v15.gz) from NVIDIA Developer Portal (https://developer.nvidia.com/)

untar hpl-2.0_FERMI_v15.gz
cd hpl-2.0_FERMI_v15

Edit Make.CUDA

# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
TOPdir = /shared/apps/linpack/hpl-2.0_FERMI_v15
.
.
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS) -----------------------------
# ----------------------------------------------------------------------
LAdir        = /shared/apps/intel/mkl/lib/intel64
LAinc        = -I/shared/apps/cuda/cuda-5.0/include
LAlib = -L$(TOPdir)/src/cuda -ldgemm \
        -L/shared/apps/cuda/cuda-5.0/lib64 -lcublas -lcuda -lcudart \
        -L$(LAdir) -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core
.
.
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
CC      = mpicc
CCFLAGS = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops -W -Wall -fopenmp

Environment

This is to make sure that you don't have any issues in compiling HPL on your system because they have hard-coded in the /usr/local/cuda/ directory as the default directory in all of their makefiles. You can go through them all and change it, but it is just easier to create a link to where you have it installed.

cd /usr/local
ln -s /shared/apps/cuda/cuda-5.0 cuda

Build

module load cuda/5.0
module load openmpi/1.6.3-gpu
make arch=CUDA

Modulefile

File location: /shared/modulefiles/gpu/linpack
File name: hpl-2.0_FERMI_v15

#%Module 1.0
#
prereq  cuda/5.0
prereq  openmpi/1.6.3-gpu
prepend-path  PATH      /shared/apps/linpack/hpl-2.0_FERMI_v15/bin/CUDA
prepend-path  LD_LIBRARY_PATH           /shared/apps/intel/mkl/lib/intel64

Run (from GPU node)

Load necessary modules

module load cuda/5.0
module load openmpi/1.6.3-gpu
module load gpu/linpack/hpl-2.0_FERMI_v15

Copy HPL.dat file to local folder and make changes if required.

cp /shared/apps/linpack/hpl-2.0_FERMI_v15/bin/CUDA/HPL.dat .

Run linpack

mpirun -np 2 run_linpack

Addition files

Output from a single GPU run (using 1x K20 GPU)

[david@compute021 hpl-gpu]$ ./run_linpack 
================================================================================
HPLinpack 2.0  --  High-Performance Linpack benchmark  --   September 10, 2008
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   82897 
NB     :     768 
PMAP   : Row-major process mapping
P      :       1 
Q      :       1 
PFACT  :    Left 
NBMIN  :       2 
NDIV   :       2 
RFACT  :    Left 
BCAST  :   1ring 
DEPTH  :       1 
SWAP   : Spread-roll (long)
L1     : no-transposed form
U      : no-transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR10L2L2       82897   768     1     1             340.72              1.115e+03
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0037839 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

run_linpack script for single GPU run

#!/bin/bash

#location of HPL
HPL_DIR=/shared/apps/linpack/hpl-2.0_FERMI_v15/

# Number of CPU cores ( per GPU used = per MPI process )
CPU_CORES_PER_GPU=16

# FOR MKL
export MKL_NUM_THREADS=$CPU_CORES_PER_GPU
# FOR GOTO
export GOTO_NUM_THREADS=$CPU_CORES_PER_GPU
# FOR OMP
export OMP_NUM_THREADS=$CPU_CORES_PER_GPU

export MKL_DYNAMIC=FALSE

# hint: for 2050 or 2070 card
#       try 350/(350 + MKL_NUM_THREADS*4*cpu frequency in GHz)
export CUDA_DGEMM_SPLIT=0.80

# hint: try CUDA_DGEMM_SPLIT - 0.10
export CUDA_DTRSM_SPLIT=0.70

export LD_LIBRARY_PATH=$HPL_DIR/src/cuda:$LD_LIBRARY_PATH

#$HPL_DIR/bin/CUDA/xhpl
mpirun -np 1 -host $HOSTNAME ./xhpl

HPL.dat file

[david@compute021 hpl-gpu]$ cat HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
82897     Ns
1             # of NBs
768 1024 512 384 640 768 896 960 1024 1152 1280 384 640 960 768 640 256  960 512 768 1152         NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1 1 2 1        Ps
1 2 2 4        Qs             #  (2 2 2 4        Qs. for the dual GPU run)
16.0         threshold
1            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
2 8          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0 2          BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1 0          DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
192          swapping threshold
1            L1 in (0=transposed,1=no-transposed) form
1            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)