Linux: Checking the Infiniband Fabric
Jump to navigation
Jump to search
- Assuming Platform Cluster Manager 3.2 is installed, otherwise please make sure the latest version of OFED which will include openmpi
Check the IB links
Use a command called ibstatus to the current state of the IB link
david@compute000 imb]$ ibstatus
Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:0030:48ff:ffff:e57d
base lid: 0x0
sm lid: 0x0
state: 2: INIT
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: InfiniBandIn this instance we can see that the state: is only in an INIT stage. This typically means that the IB link is having trouble with the subnet manager. This will result in warning where running MPI performance tests (check the output from openmpi mpirun for clues:
WARNING: There is at least one OpenFabrics device found but there are
no active ports detected (or Open MPI was unable to use them). This
is most certainly not what you wanted. Check your cables, subnet
manager configuration, etc. The openib BTL will be ignored for this
job.OpenMPI will default back to using Ethernet, you can tell by the high latency and low bandwidth:
david@compute000 imb]$ module load openmpi-x86_64
[david@compute000 imb]$ which mpirun
/usr/lib64/openmpi/bin/mpirun
[david@compute000 imb]$ pwd
/home/david/benchmarks/imb
[david@compute000 imb]$ cat hosts
compute000
compute001
[david@compute000 imb]$ /usr/lib64/openmpi/bin/mpirun -np 2 -hostfile ./hosts /usr/lib64/openmpi/bin/mpitests-IMB-MPI1
# lots of warning cut out
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 47.79 0.00 # <-- This is high ethernet latency, typical 1GB eth0 latency can be as low as 25usec
1 1000 44.85 0.02
2 1000 45.24 0.04
4 1000 45.87 0.08
8 1000 44.51 0.17
16 1000 43.21 0.35
32 1000 43.76 0.70
64 1000 43.92 1.39
128 1000 43.48 2.81
256 1000 48.91 4.99
512 1000 52.95 9.22
1024 1000 96.30 10.14
2048 1000 403.23 4.84
4096 1000 262.84 14.86
8192 1000 279.54 27.95
16384 1000 333.65 46.83
32768 1000 686.98 45.49
65536 640 1364.94 45.79
131072 320 1668.31 74.93
262144 160 2683.26 93.17
524288 80 5044.39 99.12
1048576 40 9498.91 105.28
2097152 20 18256.90 109.55
4194304 10 36169.60 110.59 # <-- Typical 1GB bandwidth