Linux: Checking the Infiniband Fabric

From Define Wiki
Revision as of 13:10, 4 January 2013 by Michael (talk | contribs)
Jump to navigation Jump to search
  • Assuming Platform Cluster Manager 3.2 is installed, otherwise please make sure the latest version of OFED which will include openmpi

Check the IB links

Use a command called ibstatus to the current state of the IB link

david@compute000 imb]$ ibstatus
Infiniband device 'mlx4_0' port 1 status:
	default gid:	 fe80:0000:0000:0000:0030:48ff:ffff:e57d
	base lid:	 0x0
	sm lid:		 0x0
	state:		 2: INIT
	phys state:	 5: LinkUp
	rate:		 40 Gb/sec (4X QDR)
	link_layer:	 InfiniBand

In this instance we can see that the state: is only in an INIT stage. This typically means that the IB link is having trouble with the subnet manager.

Warning form mpirun?

This will result in warnings where running MPI performance tests (check the output from openmpi mpirun for clues:

WARNING: There is at least one OpenFabrics device found but there are
no active ports detected (or Open MPI was unable to use them).  This
is most certainly not what you wanted.  Check your cables, subnet
manager configuration, etc.  The openib BTL will be ignored for this
job.

Check the fabric performance

OpenMPI will default back to using Ethernet, you can tell by the high latency and low bandwidth:

david@compute000 imb]$ module load openmpi-x86_64
[david@compute000 imb]$ which mpirun
/usr/lib64/openmpi/bin/mpirun
[david@compute000 imb]$ pwd
/home/david/benchmarks/imb
[david@compute000 imb]$ cat hosts 
compute000
compute001
[david@compute000 imb]$ /usr/lib64/openmpi/bin/mpirun -np 2 -hostfile ./hosts /usr/lib64/openmpi/bin/mpitests-IMB-MPI1 
# lots of warning cut out
#---------------------------------------------------
# Benchmarking PingPong 
# #processes = 2 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000        47.79         0.00    # <-- This is high ethernet latency, typical 1GB eth0 latency can be as low as 25usec
            1         1000        44.85         0.02
            2         1000        45.24         0.04
            4         1000        45.87         0.08
            8         1000        44.51         0.17
           16         1000        43.21         0.35
           32         1000        43.76         0.70
           64         1000        43.92         1.39
          128         1000        43.48         2.81
          256         1000        48.91         4.99
          512         1000        52.95         9.22
         1024         1000        96.30        10.14
         2048         1000       403.23         4.84
         4096         1000       262.84        14.86
         8192         1000       279.54        27.95
        16384         1000       333.65        46.83
        32768         1000       686.98        45.49
        65536          640      1364.94        45.79
       131072          320      1668.31        74.93
       262144          160      2683.26        93.17
       524288           80      5044.39        99.12
      1048576           40      9498.91       105.28
      2097152           20     18256.90       109.55
      4194304           10     36169.60       110.59   # <-- Typical 1GB bandwidth

Make sure the subnet manager is running

Sometime the subnet manager will be running on the switch, other times it will need to be started manually on one of the hosts on the IB fabric. OFED provides a utility to run a subnet manager on a host (from the opensm pacakge)

/etc/init.d/opensmd restart
# checking the ibstatus output, we have an ACTIVE link!
[david@compute000 imb]$ ibstatus
Infiniband device 'mlx4_0' port 1 status:
	default gid:	 fe80:0000:0000:0000:0030:48ff:ffff:e57d
	base lid:	 0x1
	sm lid:		 0x1
	state:		 4: ACTIVE
	phys state:	 5: LinkUp
	rate:		 40 Gb/sec (4X QDR)
	link_layer:	 InfiniBand

Ok now we are looking much better, test performance again:

QDR MPI Performance Figures

Here are some figures for QDR IB (FDR cards, running at QDR speed because of switch)

#---------------------------------------------------
# Benchmarking PingPong 
# #processes = 2 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         1.29         0.00
            1         1000         1.15         0.83
            2         1000         1.16         1.65
            4         1000         1.16         3.30
            8         1000         1.17         6.50
           16         1000         1.19        12.79
           32         1000         1.22        24.97
           64         1000         1.25        48.93
          128         1000         1.85        66.03
          256         1000         1.96       124.60
          512         1000         2.15       227.37
         1024         1000         2.50       390.62
         2048         1000         2.90       673.74
         4096         1000         3.68      1061.62
         8192         1000         5.36      1457.39
        16384         1000         7.81      1999.63
        32768         1000        12.21      2560.43
        65536          640        20.84      2999.41
       131072          320        38.02      3288.14
       262144          160        75.01      3332.78
       524288           80       146.31      3417.37
      1048576           40       289.19      3457.94
      2097152           20       574.40      3481.87
      4194304           10      1144.80      3494.05


Going too fast

To check that it is running between multiple nodes check the latency and bandwidth achieved.

Latencies lower than 1 or bandwidths higher than 4000 would suggest another issue as these speeds are higher than would be expected for infiniband. It is likely that the processes are running on the same node.