Difference between revisions of "Check disk failure and send alert"
Jump to navigation
Jump to search
(Created page with "when no disks failed, the output of storcli is: <bash [root@disk-test-node1 ~]# storcli64 /c0/vall show Controller = 0 Status = Success Description = None Virtual Drives...") |
|||
| (2 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
when no disks failed, the output of storcli is: | when no disks failed, the output of storcli is: | ||
| − | < | + | <syntaxhighlight> |
[root@disk-test-node1 ~]# storcli64 /c0/vall show | [root@disk-test-node1 ~]# storcli64 /c0/vall show | ||
Controller = 0 | Controller = 0 | ||
| Line 24: | Line 24: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
| + | |||
when disk failure occurs, the output is: | when disk failure occurs, the output is: | ||
| Line 53: | Line 54: | ||
[root@disk-test-node1 ~]# storcli64 /c0/vall show |grep '\ Optl\ ' | [root@disk-test-node1 ~]# storcli64 /c0/vall show |grep '\ Optl\ ' | ||
0/0 RAID6 Optl RW No RWBD - ON 1.063 TB | 0/0 RAID6 Optl RW No RWBD - ON 1.063 TB | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | |||
| + | To add health check in bright: | ||
| + | |||
| + | 1- Adding the healthcheck. | ||
| + | |||
| + | <syntaxhighlight> | ||
| + | # cmsh | ||
| + | % monitoring healthchecks | ||
| + | % add <healthcheck_name> | ||
| + | % set command <path_to_your_script> | ||
| + | % commit | ||
| + | </syntaxhighlight> | ||
| + | 2- Configuring the healthcheck | ||
| + | <syntaxhighlight> | ||
| + | % monitoring setup healthconf <category_name> | ||
| + | % add <healthcheck_name> | ||
| + | % set checkinterval <interval> | ||
| + | % commit | ||
| + | </syntaxhighlight> | ||
| + | You can then add a fail action if the healthcheck fails like getting an email alert or powering the node off. You can find more information about metrics and monitoring in Bright in chapter 9 of Bright 7.0 admin manual. | ||
| + | |||
| + | We can use this scirpt for checking the RAID: | ||
| + | <syntaxhighlight> | ||
| + | #!/bin/sh | ||
| + | |||
| + | allVD=`storcli64 /call/vall show |grep 'RAID[0-9]'|wc -l` | ||
| + | optVD=`storcli64 /call/vall show |grep '\ Optl\ '|wc -l` | ||
| + | logPath="/tmp/raidCheck.log"; | ||
| + | hostname=`hostname` | ||
| + | |||
| + | if [ $optVD -eq $allVD ]; then | ||
| + | echo PASS; | ||
| + | else | ||
| + | echo FAIL; | ||
| + | storcli64 /call/eall/sall show > $logPath | ||
| + | mail -a $logPath -s "RAID ERROR! at $hostname" pol.llovet@gmail.com,HPC@boston.co.uk < /dev/null; | ||
| + | fi | ||
| + | |||
</syntaxhighlight> | </syntaxhighlight> | ||
Latest revision as of 15:04, 2 July 2015
when no disks failed, the output of storcli is:
[root@disk-test-node1 ~]# storcli64 /c0/vall show
Controller = 0
Status = Success
Description = None
Virtual Drives :
==============
-------------------------------------------------------------
DG/VD TYPE State Access Consist Cache Cac sCC Size Name
-------------------------------------------------------------
0/0 RAID6 Optl RW No RWBD - ON 1.063 TB
-------------------------------------------------------------
Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|B=Blocked|Consist=Consistent|
R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency
when disk failure occurs, the output is:
[root@disk-test-node1 ~]# storcli64 /c0/vall show
Controller = 0
Status = Success
Description = None
Virtual Drives :
==============
-------------------------------------------------------------
DG/VD TYPE State Access Consist Cache Cac sCC Size Name
-------------------------------------------------------------
0/0 RAID6 Pdgd RW No RWBD - ON 1.063 TB
-------------------------------------------------------------
Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|B=Blocked|Consist=Consistent|
R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check ConsistencyTo check if any disk failure occurs, we can use this command:
[root@disk-test-node1 ~]# storcli64 /c0/vall show |grep '\ Optl\ '
0/0 RAID6 Optl RW No RWBD - ON 1.063 TB
To add health check in bright:
1- Adding the healthcheck.
# cmsh
% monitoring healthchecks
% add <healthcheck_name>
% set command <path_to_your_script>
% commit2- Configuring the healthcheck
% monitoring setup healthconf <category_name>
% add <healthcheck_name>
% set checkinterval <interval>
% commitYou can then add a fail action if the healthcheck fails like getting an email alert or powering the node off. You can find more information about metrics and monitoring in Bright in chapter 9 of Bright 7.0 admin manual.
We can use this scirpt for checking the RAID:
#!/bin/sh
allVD=`storcli64 /call/vall show |grep 'RAID[0-9]'|wc -l`
optVD=`storcli64 /call/vall show |grep '\ Optl\ '|wc -l`
logPath="/tmp/raidCheck.log";
hostname=`hostname`
if [ $optVD -eq $allVD ]; then
echo PASS;
else
echo FAIL;
storcli64 /call/eall/sall show > $logPath
mail -a $logPath -s "RAID ERROR! at $hostname" pol.llovet@gmail.com,HPC@boston.co.uk < /dev/null;
fi