Benchmarking: HPL on a GPU using CUDA

From Define Wiki
Revision as of 11:57, 10 October 2012 by David (talk | contribs) (Created page with "Source and Build Instructions PDF are located on PDD: HPC Benchmarking/Applications/hpl-cuda PDD Link: <file>\\srv-vfs2\PDD_DATA\Product Development\High Performance Computin...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Source and Build Instructions PDF are located on PDD: HPC Benchmarking/Applications/hpl-cuda

PDD Link: <file>\\srv-vfs2\PDD_DATA\Product Development\High Performance Computing\HPC Benchmarking\Applications\hpl-cuda</file>

Build Source
  • Built using:
    • Platform mpi (/opt/platform_mpi)
    • Intel MKL (/shared/intel/composer-2011, 12.0 compilers)
    • CUDA 4.0 (/usr/local/cuda)
  • Untar/gz, cd in to the directory and edit the Make.CUDA file
# TOPDir around line 103
ifndef  TOPdir
TOPdir = /home/david/benchmarking/hpl-2.0_FERMI_v13
endif

# openmpi section
MPdir        = /opt/platform_mpi/
MPinc        = -I$(MPdir)/include
MPlib        = $(MPdir)/lib/linux_amd64/libmpi.so

# MKL LAdir/inc/lib
LAdir        = /shared/intel/composerxe-2011/mkl/lib/intel64/
LAinc        =
LAlib        = -L $(TOPdir)/src/cuda  -ldgemm -L/usr/local/cuda/lib64 -lcuda -lcudart -lcublas -L$(LAdir) -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5

# next two lines for Intel Compilers:
CC      = mpicc
CCFLAGS = $(HPL_DEFS) -O3 -axS -w -fomit-frame-pointer -funroll-loops -openmp

# rest of the file should be ok straight from unzipping, build using make
Build the binaries
make 
# which will end up producing
[david@vhpchead hpl-2.0_FERMI_v13]$ find bin/
bin/
bin/CUDA
bin/CUDA/xhpl
bin/CUDA/HPL.dat
bin/CUDA/HPL.dat_example
bin/CUDA/run_linpack
bin/CUDA/output_example
bin/CUDA/._HPL.dat
bin/CUDA/._run_linpack
Edit run_linpack script
  • In bin/CUDA/run_linpack, check the following is set:
#!/bin/bash

#location of HPL 
HPL_DIR=/home/david/benchmarking/hpl-2.0_FERMI_v13

# Number of CPU cores ( per GPU used = per MPI process )
CPU_CORES_PER_GPU=4

# FOR MKL
export MKL_NUM_THREADS=$CPU_CORES_PER_GPU
# FOR GOTO
export GOTO_NUM_THREADS=$CPU_CORES_PER_GPU
# FOR OMP
export OMP_NUM_THREADS=$CPU_CORES_PER_GPU

export MKL_DYNAMIC=FALSE

# hint: for 2050 or 2070 card
#       try 350/(350 + MKL_NUM_THREADS*4*cpu frequency in GHz) 
export CUDA_DGEMM_SPLIT=0.80

# hint: try CUDA_DGEMM_SPLIT - 0.10
export CUDA_DTRSM_SPLIT=0.70

export LD_LIBRARY_PATH=$HPL_DIR/src/cuda:$LD_LIBRARY_PATH

$HPL_DIR/bin/CUDA/xhpl
Run on a Single GPU
Results
  • From a E5620 system with 2x M2075
# CPU_CORES_PER_GPU=8
# CUDA_DGEMM_SPLIT=0.80
# CUDA_DTRSM_SPLIT=0.70
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR10L2L2      108032  1024     1     2            1170.08              7.184e+02
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0041656 ...... PASSED
================================================================================