Install and configure Intel Omni Path OPA Fabric

From Define Wiki
Revision as of 14:56, 29 November 2016 by David (talk | contribs)
Jump to navigation Jump to search

Install the OPA Fabric Software

Check kernel-devel on your build host is same as kernel / or pull from;

wget http://mirror.centos.org/centos/7/updates/x86_64/Packages/kernel-devel-3.10.0-327.22.2.el7.x86_64.rpm
# for lustre kernel (from ee-3.0.0.0)
wget http://mirror.centos.org/centos/7/updates/x86_64/Packages/kernel-devel-3.10.0-327.13.1.el7.x86_64.rpm

Install the IFS software

# tested on an centos 7
yum install expect sysfsutils kernel-devel libibmad libibumad rdma libibverbs bc
yum install pciutils tcsh atlas sysfsutils infinipath-psm 
tar zxvf IntelOPA-IFS.RHEL72-x86_64.10.1.1.0.9.tgz 
cd IntelOPA-IFS.RHEL72-x86_64.10.1.1.0.9/
./INSTALL \
          -i opa_stack -i opa_stack_dev -i intel_hfi \
          -i delta_ipoib -i ibacm -i fastfabric \
          -i mvapich2_gcc_hfi -i mvapich2_intel_hfi \
          -i openmpi_gcc_hfi  -i openmpi_intel_hfi \
          -i opafm -i oftools -D opafm

# once installed its recommended that you reboot - NOTE OpenHPC nodes/make sure re-install isnt set. 
systemctl disable srpd
reboot

# Note when installing in a chroot environment: (like Bright or OpenHPC)
  mount -t proc /proc /cm/images/default-image-hfi/proc/
  mount --rbind /dev /cm/images/default-image-hfi/dev/
  mount --rbind /sys /cm/images/default-image-hfi/sys/

Verify the Fabric/Adaptor

Check the kernel module is loaded etc

[root@node001 ~]# lsmod | grep hfi
hfi1                  655730  1 
ib_mad                 47817  4 hfi1,ib_cm,ib_sa,ib_umad
ib_core                98787  14 hfi1,rdma_cm,ib_cm,ib_sa,iw_cm,xprtrdma,ib_mad,ib_ucm,ib_iser,ib_umad,ib_uverbs,ib_ipoib,ib_isert
[root@node001 ~]# modinfo hfi1
filename:       /lib/modules/3.10.0-327.22.2.el7.x86_64/updates/hfi1.ko
version:        0.11-162
description:    Intel Omni-Path Architecture driver
license:        Dual BSD/GPL
rhelversion:    7.2
srcversion:     A9F55090C67176C5B9120E1
alias:          pci:v00008086d000024F1sv*sd*bc*sc*i*
alias:          pci:v00008086d000024F0sv*sd*bc*sc*i*
depends:        ib_core,ib_mad
vermagic:       3.10.0-327.22.2.el7.x86_64 SMP mod_unload modversions

Make sure the subnet manager is running

systemctl status opafm

Example output if the fabric manager is not running

[root@node001 IntelOPA-IFS.RHEL72-x86_64.10.1.1.0.9]# opainfo 
hfi1_0:1                           PortGUID:0x00117501017bfb57
   PortState:     Init (LinkUp)
   LinkSpeed      Act: 25Gb         En: 25Gb        
   LinkWidth      Act: 4            En: 4           
   LinkWidthDnGrd ActTx: 4  Rx: 4   En: 1,2,3,4     
   LCRC           Act: 14-bit       En: 14-bit,16-bit,48-bit       Mgmt: True 
   QSFP: PassiveCu, 2m   Hitachi Metals    P/N IQSFP26C-20       Rev 02
   Xmit Data:                  0 MB Pkts:                    0
   Recv Data:                  0 MB Pkts:                    0
   Link Quality: 5 (Excellent)

With the subnet manager running, link goes from INIT to ACTIVE

[root@node001 IntelOPA-IFS.RHEL72-x86_64.10.1.1.0.9]# opainfo 
hfi1_0:1                           PortGID:0xfe80000000000000:00117501017bfb57
   PortState:     Active
   LinkSpeed      Act: 25Gb         En: 25Gb        
   LinkWidth      Act: 4            En: 4           
   LinkWidthDnGrd ActTx: 4  Rx: 4   En: 3,4         
   LCRC           Act: 14-bit       En: 14-bit,16-bit,48-bit       Mgmt: True 
   LID: 0x00000001-0x00000001       SM LID: 0x00000001 SL: 0 
   QSFP: PassiveCu, 2m   Hitachi Metals    P/N IQSFP26C-20       Rev 02
   Xmit Data:                  1 MB Pkts:                 4355
   Recv Data:                  1 MB Pkts:                 4472
   Link Quality: 5 (Excellent)

Check the fabric details

[root@node001 ~]# opafabricinfo 
Fabric 0:0 Information:
SM: node001 hfi1_0 Guid: 0x00117501017bfb57 State: Master
Number of HFIs: 51
Number of Switches: 2
Number of Links: 67
Number of HFI Links: 51             (Internal: 0   External: 51)
Number of ISLs: 16                  (Internal: 0   External: 16)
Number of Degraded Links: 1         (HFI Links: 0   ISLs: 1)
Number of Omitted Links: 0          (HFI Links: 0   ISLs: 0)
-------------------------------------------------------------------------------

Performance Tests

Quick step to verify, assumes ssh passwordless access between hosts

# Note; set the CPUs to performance on all nodes
# cpupower frequency-set --governor performance
source /usr/mpi/gcc/mvapich2-*-hfi/bin/mpivars.sh
cd /usr/mpi/gcc/mvapich2-*-hfi/tests/osu_benchmarks-*
# verify latency 
mpirun -hosts node001,node002 ./osu_latency
# verify bandwidth 
mpirun -hosts node001,node002 ./osu_bw
# deviation 
cd /usr/mpi/gcc/mvapich2-*-hfi/tests/intel
seq -f 'node0%02.0f' 1 16 > /tmp/mpi_hosts
mpirun -hostfile /tmp/mpi_hosts ./deviation

Latency OSU

[root@node001 osu_benchmarks-3.1.1]# mpirun -hosts node001,node002 ./osu_latency
# OSU MPI Latency Test v3.1.1
# Size            Latency (us)
0                         1.02
1                         1.00
2                         0.99
4                         0.97
8                         0.97
16                        1.09
32                        1.09
64                        1.09
128                       1.10
256                       1.14
512                       1.22
1024                      1.33
2048                      1.57
4096                      1.99
8192                      3.12
16384                     5.78
32768                     7.73
65536                    14.27
131072                   20.81
262144                   31.25
524288                   51.42
1048576                  94.65
2097152                 178.55
4194304                 347.68

Bandwidth OSU

[root@node001 osu_benchmarks-3.1.1]# mpirun -hosts node001,node002 ./osu_bw 
# OSU MPI Bandwidth Test v3.1.1
# Size        Bandwidth (MB/s)
1                         3.03
2                         6.36
4                        12.66
8                        26.04
16                       47.82
32                       95.72
64                      191.98
128                     379.83
256                     715.83
512                    1393.06
1024                   2484.21
2048                   4106.95
4096                   6075.32
8192                   7772.87
16384                  8021.97
32768                 10206.65
65536                 11830.66
131072                12121.88
262144                12240.60
524288                12324.96
1048576               12366.55
2097152               12378.66
4194304               12382.88

Some other quick tests:

  # openmpi
  module load openmpi/gcc/64/1.10.0-hfi
  mpicc test-mpi.c -o test-mpi-ompi_intel
  mpirun -np 2 -H compute023,compute024 ./test-mpi-ompi_intel 

  # intelmpi
  module load intel/mpi/64/5.1.3/2016.4.258 
  vi hostsfile 
  mpirun -np 2 -hostfile ./hostsfile -perhost 1 ./test-mpi-ompi_intel

Intel Deviation Tests

[root@node001 intel]# mpirun -hosts node001,node002,node003,node005 ./deviation 

Trial runs of 4 hosts are being performed to find
the best host since no baseline host was specified.

Baseline host is node001.plymouth.net (0)

Running Sequential MPI Latency Tests - Pairs 3   Testing     3
Running Sequential MPI Bandwidth Tests - Pairs 3   Testing     3

Sequential MPI Performance Test Results
  Latency Summary:
    Min: 0.98 usec, Max: 1.10 usec, Avg: 1.05 usec
    Range: +12.3% of Min, Worst: +4.4% of Avg
    Cfg: Tolerance: +50% of Avg, Delta: 0.80 usec, Threshold: 1.85 usec
         Message Size: 0, Loops: 4000

  Bandwidth Summary:
    Min: 12318.9 MB/s, Max: 12375.5 MB/s, Avg: 12341.2 MB/s
    Range: -0.5% of Max, Worst: -0.2% of Avg
    Cfg: Tolerance: -20% of Avg, Delta: 150.0 MB/s, Threshold: 9873.0 MB/s
         Message Size: 2097152, Loops: 30 BiDir: no

Latency: PASSED
Bandwidth: PASSED