Difference between revisions of "Mellanox:training"

From Define Wiki
Jump to navigation Jump to search
Line 204: Line 204:
 
ibdiagnet -P all=1 --get_cable_info
 
ibdiagnet -P all=1 --get_cable_info
 
</syntaxhighlight>
 
</syntaxhighlight>
 +
 +
 +
== Firmware updates ==
 +
 +
<syntaxhighlight>
 +
mst start
 +
mst status
 +
</syntaxhighlight>
 +
 +
 +
<syntaxhighlight>
 +
[root@nodeB MLNX_OFED_LINUX-2.1-1.0.6-rhel6.5-x86_64]# mst status
 +
MST modules:
 +
------------
 +
    MST PCI module loaded
 +
    MST PCI configuration module loaded
 +
 +
MST devices:
 +
------------
 +
/dev/mst/mt4099_pciconf0        - PCI configuration cycles access.
 +
                                  domain:bus:dev.fn=0000:06:00.0 addr.reg=88 data.reg=92
 +
                                  Chip revision is: 01
 +
/dev/mst/mt4099_pci_cr0          - PCI direct access.
 +
                                  domain:bus:dev.fn=0000:06:00.0 bar=0xdf900000 size=0x100000
 +
                                  Chip revision is: 01
 +
</syntaxhighlight>
 +
 +
 +
 +
<syntaxhighlight>
 +
flint -d <dev> -i <file>
 +
</syntaxhighlight>
 +
 +
 +
== OEM firmware ==
 +
 +
mellanox.com
 +
support
 +
OEM firmwre
 +
supermicro

Revision as of 16:08, 7 April 2014

Check the card is detected

[root@nodeA ~]# lspci | grep -i Mellanox
06:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]


Mellanox OFED

Don't change the kernel. Its build for the default version of the kernel.

If the kernel changes MLNX_OFED must be rebuilt for the running kernel


Installation

There are a number of options available. To see them all run:

./mlnxofedinstall --l
yum install tcl tk libnl-devel gcc-gfortran
./mlnxofedinstall

It will try to update the firmware at the end of the install:

Device #1:
----------

  Device:        0000:06:00.0
  Part Number:
  Description:
  PSID:          MT_1060110019

  Versions:      Current        Available
     FW          2.10.0000      N/A

  Status:        No matching image found


Restart the driver

Either reboot the node or run:

/etc/init.d/openibd restart


Check the state

[root@nodeB MLNX_OFED_LINUX-2.1-1.0.6-rhel6.5-x86_64]# ibstatus
Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0030:48ff:ffff:e535
        base lid:        0x0
        sm lid:          0x0
        state:           1: DOWN
        phys state:      4: PortConfigurationTraining
        rate:            10 Gb/sec (4X)
        link_layer:      InfiniBand

Start the Subnet manager

the subnet manager must be running somewhere - the switch, a node or a service

Start it on the swtich

IB SM Management
Base SM
SM enable
apply

The state of the connection will become active and a LID will be assigned.

CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 1
        Firmware version: 2.10.0
        Hardware version: 0
        Node GUID: 0x003048ffffffe534
        System image GUID: 0x003048ffffffe537
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 56
                Base lid: 2
                LMC: 0
                SM lid: 1
                Capability mask: 0x02514868
                Port GUID: 0x003048ffffffe535
                Link layer: InfiniBand

Each Link will have a separate GUID - these are basically the equivalent of a MAC address. They should be unique to rvery device unless someone has been messing around.

Communation is based of the LID - this


Subnet manager

Only one subnet manager needs to be running. An extra instances will be used if the running on fails. If there are multiple back ups there is an election to decide who takes over.

The subnet manager assigns the LIDs and builds the routing table. This can take a while depending on how complicated the topology is.

If the SM is running on the switch it can be managed under the IB SM MGMT tab.


Testing

Again there are numerous options, but they must be the same on both sides.

ib_read_bw <any ip on system>
[root@nodeB MLNX_OFED_LINUX-2.1-1.0.6-rhel6.5-x86_64]# ib_read_bw

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Read BW Test
 Dual-port       : OFF          Device         : mlx4_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 CQ Moderation   : 100
 Mtu             : 2048[B]
 Link type       : IB
 Outstand reads  : 16
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x02 QPN 0x0058 PSN 0xfd98dc OUT 0x10 RKey 0x001900 VAddr 0x007f247ebd0000
 remote address: LID 0x03 QPN 0x0058 PSN 0xcdbe1f OUT 0x10 RKey 0x001900 VAddr 0x007f6012610000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      1000           6041.81            6037.05              0.096593
---------------------------------------------------------------------------------------


ib_read_bw -a -b. If warnings stop cpuspeed, bios cpu max perf, should be around 14


ib_read_lat -a shows latnecies


Bad perforance?

Verify the fabric using melanox tools. It should be version 2 or higher.

ibdiagnet 
</bahs>


<syntaxhighlight>
#clear the counters
ibdiagnet  -pc
# run 
ibdiagnet -P all=1
Summary
-I- Stage                     Warnings   Errors     Comment
-I- Discovery                 0          0
-I- Lids Check                0          0
-I- Links Check               0          0
-I- Subnet Manager            0          0
-I- Port Counters             2          0
-I- Nodes Information         0          2
-I- Speed / Width checks      0          0
-I- Partition Keys            0          0
-I- Alias GUIDs               0          0


vim /var/tmp/ibdiagnet2/ibdiagnet2.log
vim /var/tmp/ibdiagnet2/ibdiagnet2.pm
ibdiagnet -P all=1 --ber_test --pm_pause_time 30
ibdiagnet -P all=1 --get_cable_info


Firmware updates

mst start
mst status


[root@nodeB MLNX_OFED_LINUX-2.1-1.0.6-rhel6.5-x86_64]# mst status
MST modules:
------------
    MST PCI module loaded
    MST PCI configuration module loaded

MST devices:
------------
/dev/mst/mt4099_pciconf0         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:06:00.0 addr.reg=88 data.reg=92
                                   Chip revision is: 01
/dev/mst/mt4099_pci_cr0          - PCI direct access.
                                   domain:bus:dev.fn=0000:06:00.0 bar=0xdf900000 size=0x100000
                                   Chip revision is: 01


flint -d <dev> -i <file>


OEM firmware

mellanox.com support OEM firmwre supermicro