Transferbench - PCIe and XGMI Bandwidth for AMD ROCm

# rocm 6.0.2 mi210 on ubuntu 2204
sudo apt install libnuma-dev

git clone https://github.com/ROCm/TransferBench.git
cd TransferBench
make

# Execution for performance :
	NUM_CPU_PER_TRANSFER=8 USE_MEMSET=1 ./TransferBench p2p 200M 26

# Execution for stability test : 
        ./TransferBench p2p 2G

Output

# run without params - shows we have 2 pice connected devices

Detected topology: 1 configured CPU NUMA node(s) [1 total]   2 GPU device(s)
            |NUMA 00| #Cpus | Closest GPU(s)
------------+-------+-------+---------------
NUMA 00 (00)|    10 |    32 | 0,1

        | gfx90a | gfx90a |
        | GPU 00 | GPU 01 | PCIe Bus ID  | #CUs | Closest NUMA | DMA engines
--------+--------+--------+--------------+------+-------------+------------
 GPU 00 |    -   | PCIE-2 | 0000:0a:00.0 |  104 | 0            |1,2,3,4
 GPU 01 | PCIE-2 |    -   | 0000:0b:00.0 |  104 | 0            |1,2,3,4

david@amin-dev-mi210:~/benchmarks/TransferBench$ ./TransferBench p2p 2G
TransferBench v1.50
===============================================================
[Common]                              (Suppress by setting HIDE_ENV=1)
ALWAYS_VALIDATE      =            0 : Validating after all iterations
BLOCK_BYTES          =          256 : Each CU gets a multiple of 256 bytes to copy
BLOCK_ORDER          =            0 : Transfer blocks order: Sequential
BYTE_OFFSET          =            0 : Using byte offset of 0
CONTINUE_ON_ERROR    =            0 : Stop after first error
CU_MASK              =            0 : All
FILL_PATTERN         =            0 : Element i = ((i * 517) modulo 383 + 31) * (srcBufferIdx + 1)
GFX_BLOCK_SIZE       =          256 : Threadblock size of 256
GFX_SINGLE_TEAM      =            1 : Combining CUs to work across entire data array
GFX_UNROLL           =            8 : Using GFX unroll factor of 8
GFX_WAVE_ORDER       =            0 : Using GFX wave ordering of Unroll,Wavefront,CU
NUM_CPU_DEVICES      =            1 : Using 1 CPU devices
NUM_GPU_DEVICES      =            2 : Using 2 GPU devices
NUM_ITERATIONS       =           10 : Running 10  timed iteration(s)
NUM_WARMUPS          =            3 : Running 3 warmup iteration(s) per Test
SHARED_MEM_BYTES     =        32769 : Using 32769 shared mem per threadblock
SHOW_ITERATIONS      =            0 : Hiding per-iteration timing
USE_INTERACTIVE      =            0 : Running in non-interactive mode
USE_PCIE_INDEX       =            0 : Use HIP GPU device indexing
USE_PREP_KERNEL      =            0 : Using hipMemcpy to initialize source data
USE_SINGLE_STREAM    =            1 : Using single stream per device
USE_XCC_FILTER       =            0 : XCC filtering disabled
VALIDATE_DIRECT      =            0 : Validate GPU destination memory via CPU staging buffer

[P2P Related]
NUM_CPU_SE           =            4 : Using 4 CPU subexecutors
NUM_GPU_SE           =          104 : Using 104 GPU subexecutors
P2P_MODE             =            0 : Running Unidirectional + Bidirectional
USE_FINE_GRAIN       =            0 : Using coarse-grained memory
USE_GPU_DMA          =            0 : Using GPU-GFX as GPU executor
USE_REMOTE_READ      =            0 : Using SRC as executor

Bytes Per Direction 2147483648
Unidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX)
 SRC+EXE\DST    CPU 00       GPU 00    GPU 01
  CPU 00  ->     43.48        22.59     22.49

  GPU 00  ->     26.74       625.92     26.72
  GPU 01  ->     26.74        26.72    628.14
                           CPU->CPU  CPU->GPU  GPU->CPU  GPU->GPU
Averages (During UniDir):       N/A     22.54     26.74     26.72

Bidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX)
     SRC\DST    CPU 00       GPU 00    GPU 01
  CPU 00  ->       N/A        22.41     22.39
  CPU 00 <-        N/A        26.49     26.49
  CPU 00 <->       N/A        48.90     48.88


  GPU 00  ->     26.49          N/A     26.48
  GPU 00 <-      22.32          N/A     26.48
  GPU 00 <->     48.81          N/A     52.96

  GPU 01  ->     26.49        26.48       N/A
  GPU 01 <-      22.42        26.48       N/A
  GPU 01 <->     48.91        52.96       N/A
                           CPU->CPU  CPU->GPU  GPU->CPU  GPU->GPU
Averages (During  BiDir):       N/A     24.44     24.43     26.48

Transferbench - PCIe and XGMI Bandwidth for AMD ROCm

Output

Navigation menu

Search