Hardware Monitoring
Monitoring Concept and Definitions
A discussion of the concepts of monitoring, along with definitions of terms used, is appropriate at this point. The features of the monitoring framework covered later on in this chapter will then be understood more clearly.
Metric
A metric is a property of a device that can be monitored. It has a numeric value and can have units, unless it is unknown, i.e. has a null value. Examples are: temperature, load average, free space. A metric can be a built-in, which means it is an integral part of the monitoring framework, or it can be a standalone script. The word metric is often used to mean the script or object associated with a metric as well as a metric value. The context makes it clear which is meant.
Action
An action is a standalone script or a built-in command that is executed when a condition is met. This condition can be:
- health checking
- threshold checking
- state fliping
Threshhold
A threshold is a particular value in a sampled metric. A sample can cross the threshold, thereby entering or leaving a zone that is demarcated by the threshold.
A threshold can be configured to launch an action according to threshold crossing conditions. The "New Threshold" dialog of cmgui has three action launch configuration options:
- Enter: if the sample has entered into the zone and the previous sample was not in the zone
- Leave: if the sample has left the zone and the previous sample was in the zone
- During: if the sample was in the zone, and the previous sample was also in the zone
A threshold zone also has a settable severity associated with it. This value is processed for the AlertLevel metric when an action is triggered by a threshold event.
Health Check
A health check value is ta state that response to running a health check script at a regular time interval with three possible response values: PASS, FAIL or UNKNOWN.
A health check has a settable severity associated with a FAIL or UNKNOWN response. This value is processed for the AlertLevel metric when the health check runs.
A health check can also launch an action based on any of the response values, similar to the way that an action is launched by a metric with a threshold condition.
Severity
Severity is a positive integer value that the administrator assigns to a threshold-crossing event or to a health check status event. It takes one of these 5 suggested values:
AlertLevel
AlertLevel is a special metric. It is not sampled, but it is re-calculated when an event with an associated occurs. There are two types of AlertLevel metrics:
- AlertLevel (max) , which is simply the maximum severity of the latest value of all the events. The aim of this metric is to alert the administrator to the severity of the most important issue.
- AlertLevel (sum) which is the sum of the latest severity values of all the events. The aim of this metric is to alert the administrator to the overall severity of issues.
InfoMessages
InfoMessages are messages that inform the administrator of the reason for a health status event change in the cluster. These show up in the Overview tab of nodes.
Flapping
Flapping, or State Flapping, is when a state transition occurs too many times over a number of samples. If the CPUUser metric crossed the threshold zone 7 times within 12 samples (the default values for flap detection), then it would by default be detected as flapping. A flapping alert would then be recorded in the event viewer, and a flapping action could also be launched if configured to do so. Flapping configuration for cmgui is covered for thresholds crossing events in section 10.4.2, when the metric configuration tab's Edit and Add dialogs are explained; and also covered for health check state changes when the health check configuration tab's Edit and Add dialogs are explained.