Difference between revisions of "Check disk failure and send alert"

From Define Wiki
Jump to navigation Jump to search
(Created page with "when no disks failed, the output of storcli is: <bash [root@disk-test-node1 ~]# storcli64 /c0/vall show Controller = 0 Status = Success Description = None Virtual Drives...")
 
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
when no disks failed, the output of storcli is:  
 
when no disks failed, the output of storcli is:  
  
<bash
+
<syntaxhighlight>
 
[root@disk-test-node1 ~]# storcli64 /c0/vall show  
 
[root@disk-test-node1 ~]# storcli64 /c0/vall show  
 
Controller = 0
 
Controller = 0
Line 24: Line 24:
  
 
</syntaxhighlight>
 
</syntaxhighlight>
 +
  
 
when disk failure occurs, the output is:
 
when disk failure occurs, the output is:
Line 53: Line 54:
 
[root@disk-test-node1 ~]# storcli64 /c0/vall show |grep '\ Optl\ '
 
[root@disk-test-node1 ~]# storcli64 /c0/vall show |grep '\ Optl\ '
 
0/0  RAID6 Optl  RW    No      RWBD  -  ON  1.063 TB       
 
0/0  RAID6 Optl  RW    No      RWBD  -  ON  1.063 TB       
 +
</syntaxhighlight>
 +
 +
 +
To add health check in bright:
 +
 +
1- Adding the healthcheck.
 +
 +
<syntaxhighlight>
 +
# cmsh
 +
% monitoring healthchecks
 +
% add <healthcheck_name>
 +
% set command <path_to_your_script>
 +
% commit
 +
</syntaxhighlight>
 +
2- Configuring the healthcheck
 +
<syntaxhighlight>
 +
% monitoring setup healthconf <category_name>
 +
% add <healthcheck_name>
 +
% set checkinterval <interval>
 +
% commit
 +
</syntaxhighlight>
 +
You can then add a fail action if the healthcheck fails like getting an email alert or powering the node off. You can find more information about metrics and monitoring in Bright in chapter 9 of Bright 7.0 admin manual.
 +
 +
We can use this scirpt for checking the RAID:
 +
<syntaxhighlight>
 +
#!/bin/sh
 +
 +
allVD=`storcli64 /call/vall show |grep 'RAID[0-9]'|wc -l`
 +
optVD=`storcli64 /call/vall show |grep '\ Optl\ '|wc -l`
 +
logPath="/tmp/raidCheck.log";
 +
hostname=`hostname`
 +
 +
if [ $optVD -eq $allVD ]; then
 +
echo PASS;
 +
else
 +
echo FAIL;
 +
storcli64 /call/eall/sall show > $logPath
 +
mail -a $logPath -s "RAID ERROR! at $hostname" pol.llovet@gmail.com,HPC@boston.co.uk < /dev/null;
 +
fi
 +
 
</syntaxhighlight>
 
</syntaxhighlight>

Latest revision as of 15:04, 2 July 2015

when no disks failed, the output of storcli is:

[root@disk-test-node1 ~]# storcli64 /c0/vall show 
Controller = 0
Status = Success
Description = None


Virtual Drives :
==============

-------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name 
-------------------------------------------------------------
0/0   RAID6 Optl  RW     No      RWBD  -   ON  1.063 TB      
-------------------------------------------------------------

Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|B=Blocked|Consist=Consistent|
R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency


when disk failure occurs, the output is:

[root@disk-test-node1 ~]# storcli64 /c0/vall show 
Controller = 0
Status = Success
Description = None


Virtual Drives :
==============

-------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name 
-------------------------------------------------------------
0/0   RAID6 Pdgd  RW     No      RWBD  -   ON  1.063 TB      
-------------------------------------------------------------

Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|B=Blocked|Consist=Consistent|
R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

To check if any disk failure occurs, we can use this command:

[root@disk-test-node1 ~]# storcli64 /c0/vall show |grep '\ Optl\ '
0/0   RAID6 Optl  RW     No      RWBD  -   ON  1.063 TB


To add health check in bright:

1- Adding the healthcheck.

# cmsh
% monitoring healthchecks
% add <healthcheck_name>
% set command <path_to_your_script>
% commit

2- Configuring the healthcheck

% monitoring setup healthconf <category_name>
% add <healthcheck_name>
% set checkinterval <interval>
% commit

You can then add a fail action if the healthcheck fails like getting an email alert or powering the node off. You can find more information about metrics and monitoring in Bright in chapter 9 of Bright 7.0 admin manual.

We can use this scirpt for checking the RAID:

#!/bin/sh

allVD=`storcli64 /call/vall show |grep 'RAID[0-9]'|wc -l`
optVD=`storcli64 /call/vall show |grep '\ Optl\ '|wc -l`
logPath="/tmp/raidCheck.log";
hostname=`hostname`

if [ $optVD -eq $allVD ]; then 
	echo PASS;
else
	echo FAIL;
	storcli64 /call/eall/sall show > $logPath
	mail -a $logPath -s "RAID ERROR! at $hostname" pol.llovet@gmail.com,HPC@boston.co.uk < /dev/null;
fi