Difference between revisions of "Check disk failure and send alert"

From Define Wiki
Jump to navigation Jump to search
 
(One intermediate revision by the same user not shown)
Line 25: Line 25:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
To add health check in bright:
 
 
1- Adding the healthcheck.
 
 
<syntaxhighlight>
 
# cmsh
 
% monitoring healthchecks
 
% add <healthcheck_name>
 
% set command <path_to_your_script>
 
% commit
 
</syntaxhighlight>
 
2- Configuring the healthcheck
 
<syntaxhighlight>
 
% monitoring setup healthconf <category_name>
 
% add <healthcheck_name>
 
% set checkinterval <interval>
 
% commit
 
</syntaxhighlight>
 
You can then add a fail action if the healthcheck fails like getting an email alert or powering the node off. You can find more information about metrics and monitoring in Bright in chapter 9 of Bright 7.0 admin manual.
 
  
 
when disk failure occurs, the output is:
 
when disk failure occurs, the output is:
Line 73: Line 54:
 
[root@disk-test-node1 ~]# storcli64 /c0/vall show |grep '\ Optl\ '
 
[root@disk-test-node1 ~]# storcli64 /c0/vall show |grep '\ Optl\ '
 
0/0  RAID6 Optl  RW    No      RWBD  -  ON  1.063 TB       
 
0/0  RAID6 Optl  RW    No      RWBD  -  ON  1.063 TB       
 +
</syntaxhighlight>
 +
 +
 +
To add health check in bright:
 +
 +
1- Adding the healthcheck.
 +
 +
<syntaxhighlight>
 +
# cmsh
 +
% monitoring healthchecks
 +
% add <healthcheck_name>
 +
% set command <path_to_your_script>
 +
% commit
 +
</syntaxhighlight>
 +
2- Configuring the healthcheck
 +
<syntaxhighlight>
 +
% monitoring setup healthconf <category_name>
 +
% add <healthcheck_name>
 +
% set checkinterval <interval>
 +
% commit
 +
</syntaxhighlight>
 +
You can then add a fail action if the healthcheck fails like getting an email alert or powering the node off. You can find more information about metrics and monitoring in Bright in chapter 9 of Bright 7.0 admin manual.
 +
 +
We can use this scirpt for checking the RAID:
 +
<syntaxhighlight>
 +
#!/bin/sh
 +
 +
allVD=`storcli64 /call/vall show |grep 'RAID[0-9]'|wc -l`
 +
optVD=`storcli64 /call/vall show |grep '\ Optl\ '|wc -l`
 +
logPath="/tmp/raidCheck.log";
 +
hostname=`hostname`
 +
 +
if [ $optVD -eq $allVD ]; then
 +
echo PASS;
 +
else
 +
echo FAIL;
 +
storcli64 /call/eall/sall show > $logPath
 +
mail -a $logPath -s "RAID ERROR! at $hostname" pol.llovet@gmail.com,HPC@boston.co.uk < /dev/null;
 +
fi
 +
 
</syntaxhighlight>
 
</syntaxhighlight>

Latest revision as of 15:04, 2 July 2015

when no disks failed, the output of storcli is:

[root@disk-test-node1 ~]# storcli64 /c0/vall show 
Controller = 0
Status = Success
Description = None


Virtual Drives :
==============

-------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name 
-------------------------------------------------------------
0/0   RAID6 Optl  RW     No      RWBD  -   ON  1.063 TB      
-------------------------------------------------------------

Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|B=Blocked|Consist=Consistent|
R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency


when disk failure occurs, the output is:

[root@disk-test-node1 ~]# storcli64 /c0/vall show 
Controller = 0
Status = Success
Description = None


Virtual Drives :
==============

-------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name 
-------------------------------------------------------------
0/0   RAID6 Pdgd  RW     No      RWBD  -   ON  1.063 TB      
-------------------------------------------------------------

Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|B=Blocked|Consist=Consistent|
R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

To check if any disk failure occurs, we can use this command:

[root@disk-test-node1 ~]# storcli64 /c0/vall show |grep '\ Optl\ '
0/0   RAID6 Optl  RW     No      RWBD  -   ON  1.063 TB


To add health check in bright:

1- Adding the healthcheck.

# cmsh
% monitoring healthchecks
% add <healthcheck_name>
% set command <path_to_your_script>
% commit

2- Configuring the healthcheck

% monitoring setup healthconf <category_name>
% add <healthcheck_name>
% set checkinterval <interval>
% commit

You can then add a fail action if the healthcheck fails like getting an email alert or powering the node off. You can find more information about metrics and monitoring in Bright in chapter 9 of Bright 7.0 admin manual.

We can use this scirpt for checking the RAID:

#!/bin/sh

allVD=`storcli64 /call/vall show |grep 'RAID[0-9]'|wc -l`
optVD=`storcli64 /call/vall show |grep '\ Optl\ '|wc -l`
logPath="/tmp/raidCheck.log";
hostname=`hostname`

if [ $optVD -eq $allVD ]; then 
	echo PASS;
else
	echo FAIL;
	storcli64 /call/eall/sall show > $logPath
	mail -a $logPath -s "RAID ERROR! at $hostname" pol.llovet@gmail.com,HPC@boston.co.uk < /dev/null;
fi