Lustre: General steps for debugging lustre (IEEL) problems

In this situation we had the following Lustre setup;

2x MDS nodes in a HA configuration
4x OSS nodes not in a HA configuration (direct attached storage)

Verify Network Connectivity

Can the systems ping one another?
Can the client ping all the of the lustre nodes?
Check the LNET (Lustre Network) to ensure networking is working on all nodes also.

# check the IPs are reported correctly on each node
[root@lustre01-mds1 ~]# lctl list_nids 
10.10.17.193@tcp
[root@lustre02-mds1 ~]# lctl list_nids
10.10.17.194@tcp

# Can we ping through LNET/lctl
[root@lustre02-mds1 ~]# lctl ping 10.10.17.194
12345-0@lo
12345-10.10.17.194@tcp
[root@lustre02-mds1 ~]# lctl ping 10.10.17.195
failed to ping 10.10.17.195@tcp: Input/output error
# note .195 doesnt exist on the fabric so the above is just to demonstrate the output to expect

Check the disks / arrays are reported and mounted

Verify the RAID arrays are being reported correctly and healthy (using the LSI storcli utility)
Depending on where StorCli was installed and if its setup in your $PATH, the commands below may need to be updated.

# check everything the controller reports. (LOT of output)
/usr/local/MegaRAID\ Storage\ Manager/StorCLI/storcli64 /c0 show all

# check the drives and their status
/usr/local/MegaRAID\ Storage\ Manager/StorCLI/storcli64 /c0 /eall /sall show

# Note
# their state should be ONLINE 
#

# check if there are any rebuilds in place 
/usr/local/MegaRAID\ Storage\ Manager/StorCLI/storcli64 /c0  /eall /sall show rebuild

Verify that the MDT / MGT and OST are all mounted on the systems

[root@lustre01-mds1 ~]# df -h | grep lustre 
/dev/sda              9.5G   24M  9.0G   1% /lustre/mgt
/dev/sdc              1.3T   92M  1.2T   1% /lustre/lfs2-mdt
/dev/sdb              1.3T   92M  1.2T   1% /lustre/lfs1-mdt
[root@lustre02-oss1 ~]# df -h | grep lustre 
/dev/sdb               59T   27G   56T   1% /lustre/lfs2-ost00
[root@lustre02-oss2 ~]# df -h | grep lustre 
/dev/sdb               59T   31G   56T   1% /lustre/lfs2-ost01

Example process for replacing drives

In this scenario we ended up with 4x UBAD drives and 1x UGOOD which was a replacement drive inserted. (If a disk is improperly removed then re-attached to the RAID controller, it will be recognised as UBAD (Unconfigured Bad). This does not mean the drive is bad but means the configuration state is (or both) trying to re-attach it if the disk you are re-connecting is new or was working should have no negative effect but before using it you need to change it to good)

# get the IDs of the UBAD drives / IDs are reported as 4:16 which represents the controller enclosure and slot. 
[root@lustre02-oss2 ~]# /usr/local/MegaRAID\ Storage\ Manager/StorCLI/storcli64 /c0 /eall /sall show | grep UBad
4:16     21 UBad   -   3.637 TB SATA HDD N   N  512B HGST HUS724040ALA640  U  
4:17     22 UBad   -   3.637 TB SATA HDD N   N  512B HGST HUS724040ALA640  U  
4:18     23 UBad   -   3.637 TB SATA HDD N   N  512B HGST HUS724040ALA640  U  
4:19     24 UBad   -   3.637 TB SATA HDD N   N  512B HGST HUS724040ALA640  U 

# The enclosure above is 4 and slots 16-19. So we set the disks to GOOD using /e4 and /s16 etc
[root@lustre02-oss2 ~]# /usr/local/MegaRAID\ Storage\ Manager/StorCLI/storcli64 /c0 /e4 /s16 set good
Controller = 0
Status = Success
Description = Set Drive Good Succeeded.
# repeat for other disks 
storcli64 /c0 /e4 /s17 set good
storcli64 /c0 /e4 /s18 set good
storcli64 /c0 /e4 /s19 set good

Lustre: General steps for debugging lustre (IEEL) problems

Verify Network Connectivity

Check the disks / arrays are reported and mounted

Example process for replacing drives

Navigation menu

Search