Lustre: Problems with df on lustre clients

From Define Wiki
Jump to navigation Jump to search

df hangs because OST is not accessible

  • This will occur when OSTs are offline or inactive (in this instance they were added through IML and then removed, but not removed fully.
[root@hyalite ~]# lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
lustrefs-MDT0000_UUID        1.2T       14.6G        1.1T   1% /mnt/lustrefs[MDT:0]
lustrefs-OST0000_UUID       36.4T        7.5T       27.0T  22% /mnt/lustrefs[OST:0]
lustrefs-OST0001_UUID       36.4T        8.2T       26.4T  24% /mnt/lustrefs[OST:1]
lustrefs-OST0002_UUID       36.4T        7.2T       27.3T  21% /mnt/lustrefs[OST:2]
lustrefs-OST0003_UUID       36.4T        8.0T       26.5T  23% /mnt/lustrefs[OST:3]
lustrefs-OST0004_UUID       36.4T        6.8T       27.8T  20% /mnt/lustrefs[OST:4]
lustrefs-OST0005_UUID       36.4T        6.6T       28.0T  19% /mnt/lustrefs[OST:5]
lustrefs-OST0006_UUID       36.4T        5.1T       29.4T  15% /mnt/lustrefs[OST:6]
lustrefs-OST0007_UUID       36.4T        5.8T       28.7T  17% /mnt/lustrefs[OST:7]
lustrefs-OST0008_UUID       54.6T      168.5G       51.7T   0% /mnt/lustrefs[OST:8]
lustrefs-OST0009_UUID       54.6T      146.9G       51.7T   0% /mnt/lustrefs[OST:9]
OST000a             : inactive device
OST000b             : inactive device

filesystem summary:       400.1T       55.6T      324.4T  15% /mnt/lustrefs

* It’ll be because of the inactive devices, to correct this; 
* Run; 
<syntaxhighlight>
 lctl set_param osc.lustrefs-OST000a-*.active=0
 lctl set_param osc.lustrefs-OST000b-*.active=0
  • And its worked ok again
[root@hyalite ~]#  lctl set_param osc.lustrefs-OST000a-*.active=0
osc.lustrefs-OST000a-osc-ffff881070ee7000.active=0
[root@hyalite ~]#  lctl set_param osc.lustrefs-OST000b-*.active=0
osc.lustrefs-OST000b-osc-ffff881070ee7000.active=0
[root@hyalite ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md126            867G  365G  458G  45% /
tmpfs                  32G   76K   32G   1% /dev/shm
/dev/md127            496M   27M  444M   6% /boot
/dev/md125            7.9G  152M  7.4G   2% /tmp
/dev/md123             16G  3.1G   12G  21% /var
/dev/md122            9.9G  501M  8.9G   6% /var/lib/mysql/cmdaemon_mon
172.23.19.42@tcp1:172.23.19.41@tcp1:/lustrefs
                      401T   56T  325T  15% /mnt/lustrefs

Another command to check out the devices within lustre is:

[root@hyalite ~]# lctl dl 
  0 UP mgc MGC172.23.19.42@tcp1 0d48eca7-fb5f-d53f-3bee-e6b1a6745dcc 5
  1 UP lov lustrefs-clilov-ffff881070ee7000 32bcb3c7-3977-99f9-f3f4-0c1914ccec79 4
  2 UP lmv lustrefs-clilmv-ffff881070ee7000 32bcb3c7-3977-99f9-f3f4-0c1914ccec79 4
  3 UP mdc lustrefs-MDT0000-mdc-ffff881070ee7000 32bcb3c7-3977-99f9-f3f4-0c1914ccec79 5
  4 UP osc lustrefs-OST0000-osc-ffff881070ee7000 32bcb3c7-3977-99f9-f3f4-0c1914ccec79 5
  5 UP osc lustrefs-OST0002-osc-ffff881070ee7000 32bcb3c7-3977-99f9-f3f4-0c1914ccec79 5
  6 UP osc lustrefs-OST0003-osc-ffff881070ee7000 32bcb3c7-3977-99f9-f3f4-0c1914ccec79 5
  7 UP osc lustrefs-OST0001-osc-ffff881070ee7000 32bcb3c7-3977-99f9-f3f4-0c1914ccec79 5
  8 UP osc lustrefs-OST0005-osc-ffff881070ee7000 32bcb3c7-3977-99f9-f3f4-0c1914ccec79 5
  9 UP osc lustrefs-OST0007-osc-ffff881070ee7000 32bcb3c7-3977-99f9-f3f4-0c1914ccec79 5
 10 UP osc lustrefs-OST0004-osc-ffff881070ee7000 32bcb3c7-3977-99f9-f3f4-0c1914ccec79 5
 11 UP osc lustrefs-OST0006-osc-ffff881070ee7000 32bcb3c7-3977-99f9-f3f4-0c1914ccec79 5
 12 UP osc lustrefs-OST0009-osc-ffff881070ee7000 32bcb3c7-3977-99f9-f3f4-0c1914ccec79 5
 13 UP osc lustrefs-OST0008-osc-ffff881070ee7000 32bcb3c7-3977-99f9-f3f4-0c1914ccec79 5
 14 UP osc lustrefs-OST000a-osc-ffff881070ee7000 32bcb3c7-3977-99f9-f3f4-0c1914ccec79 5
 15 UP osc lustrefs-OST000b-osc-ffff881070ee7000 32bcb3c7-3977-99f9-f3f4-0c1914ccec79 5

ptlrpcd_rcv loop CPU 100%

  • Seems related to: https://jira.hpdd.intel.com/browse/LU-5787
  • Drop caches to resolve; *NOTE* Only do this when no users are running jobs - something wierd happened when we did this on a compute node with a user job. Clear jobs first then drop caches
    lctl set_param ldlm.namespaces.*.lru_size=clear
    or
    echo 1 > /proc/sys/vm/drop_caches