Difference between revisions of "Check disk failure and send alert"

Latest revision as of 15:04, 2 July 2015

when no disks failed, the output of storcli is:

[root@disk-test-node1 ~]# storcli64 /c0/vall show 
Controller = 0
Status = Success
Description = None


Virtual Drives :
==============

-------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name 
-------------------------------------------------------------
0/0   RAID6 Optl  RW     No      RWBD  -   ON  1.063 TB      
-------------------------------------------------------------

Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|B=Blocked|Consist=Consistent|
R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

when disk failure occurs, the output is:

[root@disk-test-node1 ~]# storcli64 /c0/vall show 
Controller = 0
Status = Success
Description = None


Virtual Drives :
==============

-------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name 
-------------------------------------------------------------
0/0   RAID6 Pdgd  RW     No      RWBD  -   ON  1.063 TB      
-------------------------------------------------------------

Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|B=Blocked|Consist=Consistent|
R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

To check if any disk failure occurs, we can use this command:

[root@disk-test-node1 ~]# storcli64 /c0/vall show |grep '\ Optl\ '
0/0   RAID6 Optl  RW     No      RWBD  -   ON  1.063 TB

To add health check in bright:

1- Adding the healthcheck.

# cmsh
% monitoring healthchecks
% add <healthcheck_name>
% set command <path_to_your_script>
% commit

2- Configuring the healthcheck

% monitoring setup healthconf <category_name>
% add <healthcheck_name>
% set checkinterval <interval>
% commit

You can then add a fail action if the healthcheck fails like getting an email alert or powering the node off. You can find more information about metrics and monitoring in Bright in chapter 9 of Bright 7.0 admin manual.

We can use this scirpt for checking the RAID:

#!/bin/sh

allVD=`storcli64 /call/vall show |grep 'RAID[0-9]'|wc -l`
optVD=`storcli64 /call/vall show |grep '\ Optl\ '|wc -l`
logPath="/tmp/raidCheck.log";
hostname=`hostname`

if [ $optVD -eq $allVD ]; then 
	echo PASS;
else
	echo FAIL;
	storcli64 /call/eall/sall show > $logPath
	mail -a $logPath -s "RAID ERROR! at $hostname" pol.llovet@gmail.com,HPC@boston.co.uk < /dev/null;
fi

Difference between revisions of "Check disk failure and send alert"

Latest revision as of 15:04, 2 July 2015

Navigation menu

Search