Lots of Rogue Process - ldlm bl FIX

From Define Wiki
Revision as of 10:00, 10 March 2016 by David (talk | contribs) (Created page with "See: https://jira.hpdd.intel.com/browse/BOS-27 approximately 15 of their nodes all seem to have crashed out, they are all mostly recovered now and everything running normally...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

See: https://jira.hpdd.intel.com/browse/BOS-27

approximately 15 of their nodes all seem to have crashed out, they are all mostly recovered now and everything running normally, however there is a persistent trent on all nodes which crashed out, many *** "ldlm_bl_*" *** processes on each node

The ldlm_bl thread problem looks like LU-7330, which is landed to a number of branches already, but it doesn't look like there is an EE 2.3 release that contains this patch. There is a workaround for this problem however. On the clients, add the following line to /etc/modprobe.d/lustre.conf:

 options ptlrpc ldlm_num_threads=16

This will limit the number of ldlm_bl service threads started on the client, which doesn't need to be very high compared to the number of such threads on servers. This should resolve the crashing on the clients. If not, please provide further information about the problem being hit, such as the serial console logs with the stack trace from the client when it is crashing. </syntaxhighlight>