Transferbench - PCIe and XGMI Bandwidth for AMD ROCm
Jump to navigation
Jump to search
# rocm 6.0.2 mi210 on ubuntu 2204
sudo apt install libnuma-dev
git clone https://github.com/ROCm/TransferBench.git
cd TransferBench
make
# Execution for performance :
NUM_CPU_PER_TRANSFER=8 USE_MEMSET=1 ./TransferBench p2p 200M 26
# Execution for stability test :
./TransferBench p2p 2G
Output
# run without params - shows we have 2 pice connected devices
Detected topology: 1 configured CPU NUMA node(s) [1 total] 2 GPU device(s)
|NUMA 00| #Cpus | Closest GPU(s)
------------+-------+-------+---------------
NUMA 00 (00)| 10 | 32 | 0,1
| gfx90a | gfx90a |
| GPU 00 | GPU 01 | PCIe Bus ID | #CUs | Closest NUMA | DMA engines
--------+--------+--------+--------------+------+-------------+------------
GPU 00 | - | PCIE-2 | 0000:0a:00.0 | 104 | 0 |1,2,3,4
GPU 01 | PCIE-2 | - | 0000:0b:00.0 | 104 | 0 |1,2,3,4
david@amin-dev-mi210:~/benchmarks/TransferBench$ ./TransferBench p2p 2G
TransferBench v1.50
===============================================================
[Common] (Suppress by setting HIDE_ENV=1)
ALWAYS_VALIDATE = 0 : Validating after all iterations
BLOCK_BYTES = 256 : Each CU gets a multiple of 256 bytes to copy
BLOCK_ORDER = 0 : Transfer blocks order: Sequential
BYTE_OFFSET = 0 : Using byte offset of 0
CONTINUE_ON_ERROR = 0 : Stop after first error
CU_MASK = 0 : All
FILL_PATTERN = 0 : Element i = ((i * 517) modulo 383 + 31) * (srcBufferIdx + 1)
GFX_BLOCK_SIZE = 256 : Threadblock size of 256
GFX_SINGLE_TEAM = 1 : Combining CUs to work across entire data array
GFX_UNROLL = 8 : Using GFX unroll factor of 8
GFX_WAVE_ORDER = 0 : Using GFX wave ordering of Unroll,Wavefront,CU
NUM_CPU_DEVICES = 1 : Using 1 CPU devices
NUM_GPU_DEVICES = 2 : Using 2 GPU devices
NUM_ITERATIONS = 10 : Running 10 timed iteration(s)
NUM_WARMUPS = 3 : Running 3 warmup iteration(s) per Test
SHARED_MEM_BYTES = 32769 : Using 32769 shared mem per threadblock
SHOW_ITERATIONS = 0 : Hiding per-iteration timing
USE_INTERACTIVE = 0 : Running in non-interactive mode
USE_PCIE_INDEX = 0 : Use HIP GPU device indexing
USE_PREP_KERNEL = 0 : Using hipMemcpy to initialize source data
USE_SINGLE_STREAM = 1 : Using single stream per device
USE_XCC_FILTER = 0 : XCC filtering disabled
VALIDATE_DIRECT = 0 : Validate GPU destination memory via CPU staging buffer
[P2P Related]
NUM_CPU_SE = 4 : Using 4 CPU subexecutors
NUM_GPU_SE = 104 : Using 104 GPU subexecutors
P2P_MODE = 0 : Running Unidirectional + Bidirectional
USE_FINE_GRAIN = 0 : Using coarse-grained memory
USE_GPU_DMA = 0 : Using GPU-GFX as GPU executor
USE_REMOTE_READ = 0 : Using SRC as executor
Bytes Per Direction 2147483648
Unidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX)
SRC+EXE\DST CPU 00 GPU 00 GPU 01
CPU 00 -> 43.48 22.59 22.49
GPU 00 -> 26.74 625.92 26.72
GPU 01 -> 26.74 26.72 628.14
CPU->CPU CPU->GPU GPU->CPU GPU->GPU
Averages (During UniDir): N/A 22.54 26.74 26.72
Bidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX)
SRC\DST CPU 00 GPU 00 GPU 01
CPU 00 -> N/A 22.41 22.39
CPU 00 <- N/A 26.49 26.49
CPU 00 <-> N/A 48.90 48.88
GPU 00 -> 26.49 N/A 26.48
GPU 00 <- 22.32 N/A 26.48
GPU 00 <-> 48.81 N/A 52.96
GPU 01 -> 26.49 26.48 N/A
GPU 01 <- 22.42 26.48 N/A
GPU 01 <-> 48.91 52.96 N/A
CPU->CPU CPU->GPU GPU->CPU GPU->GPU
Averages (During BiDir): N/A 24.44 24.43 26.48