Bright:config
Monitoring Configuration With cmgui
This section is about the configuration of monitoring for health checks and metrics, along with setting up the actions which are triggered from a health check or a metric threshold check. Selecting Monitoring Configuration from the resources section of cmgui makes the following tabs available (Fig 1):Overview,Metric,Configuration,Health,Check,Configuration,Metrics,Health Checks.
The Overview Tab
The Overview tab of figure 1 shows an overview of custom thresh-old actions and custom health check actions that are active in the system. Each row of conditions in the list that decides if an action is launched is called a rule.
The Add rule button runs a convenient wizard that guides an administrator in setting up a condition, and thereby avoids having to go through the other tabs separately. The Remove button removes a selected rule. The Edit button edits aspects of a selected rule. It opens a dialog that edits a metric threshold configuration or a health check configuration. These configuration dialog options are also accessible from within the Metric Configuration and Health Check Configuration tabs. The Revert button reverts a modified state of the tab to the last saved state. The Save button saves a modified state of the tab.
The Metric Configuration Tab
The Metric Configuration tab allows device categories to be selected for the sampling of metrics. Properties of metrics related to the taking of samples can then be configured from this tab for the selected device category. These properties are the configuration of the sampling parameters themselves (for example, frequency and length of logs), but also the con- figuration of related properties such as thresholds, consolidation, actions launched when a threshold is crossed, and actions launched when a met- ric state is flapping.
With the screen displaying a list of metrics as in figure 2, the metrics in the Metric Configuration tab can now be configured and manipulated. The buttons used to do this are: Edit, Add, Remove, Thresholds, Consolidators, Revert and Save. The Save button saves as-yet-uncommitted changes made via the Add or Edit buttons. The Revert button discards unsaved edits made via the Edit button. The reversion goes back to the last save. The Remove button removes a selected metric from the metrics listed. The remaining buttons, Edit, Add, Thresholds and Consolidators, open up options dialogs. These options are now discussed.
Metric Configuration Tab: Edit And Add Options
The Metric Configuration tab of figure 2 has Add and Edit buttons. The Add button opens up a dialog to add a new metric to the list, and the Edit button opens up a dialog to edit a selected metric from the list. The dialogs allow logging options for a metric to be set or adjusted. For example, a new metric could be set for sampling by adding it to the device category from the available list of all metrics, or the sampling frequency could be changed on an existing metric, or an action could be set for a metric that has a tendency to flap.
- Gap size: The number of missing samples allowed before a null value is stored as a sample value. 2 by default
- Threshold duration: Number of samples in the threshold zone before a threshold event is decided to have occurred. 1 by default. The failbeforedown option to the open command is actually a special use of this option.
- State Flapping: The first selection box decides what action to launch if state flapping is detected. The next box is a plain text entry box that allows a parameter to be passed to the action. The third box is a selection box again, which decides when to launch the action, depending on which of these following states is set:
- Enter: if the flapping has just started. That is, the current sample is in a flapping state, and the previous sample was not in a flapping state.
- During: if the flapping is ongoing. That is, the current and previous flapping sample are both in a flapping state.
- Leave: if the flapping has just stopped. That is, the current sample is not in a flapping state, and the previous sample was in a flapping state.
Health Check Configuration Tab
The Health Check Configuration tab allows device categories to be selected for the evaluating the states of health checks. Properties of health checks related to the evaluating these states can then be con- figured from this tab for the selected device category. These properties are the configuration of the state evaluation parameters themselves (for example, frequency and length of logs), but also the configuration of related properties such as severity levels based on the evaluated state, the actions to launch based on the evaluated state, or the action to launch if the evaluated state is flapping. The Health Check Configuration tab is initially a blank tab until the device category is selected by using the Health Check Configuration selection box. The selection box selects a device category from a list of built-in categories and user-defined node categories. On selection, the health checks of the selected device category are listed in the Health Check Configuration tab. Properties of the health checks related to the evaluation of states are only available for configuration and manipulation after the health checks list is displayed. Handling health checks in this manner via groups of devices, is slightly awkward for just a few machines, but for larger clusters it keeps administration scalable and thus manageable.
With the screen displaying a list of health checks as in figure 3, the health checks in the Health Check Configuration tab can now be configured and manipulated. The buttons used to do this are: Edit, Add, Remove, Revert and Save. These Health Configuration tab buttons behave just like the corresponding Metric Configuration tab buttons, that is: The Save button saves as-yet-uncommitted changes made via the Add or Edit buttons. The Revert button discards unsaved edits made via the Edit button. The reversion goes back to the last save. The Remove button removes a selected health check from the health checks listed.
- Pass action, Fail action, Unknown action, State Flapping: These are all action launchers, which launch an action for a given health state (PASS, FAIL, UNKNOWN) or for a flapping state, depending on whether these states are true or false. Each action launcher is associated with three input boxes. The first selection box decides what action to launch if the state is true. The next box is a plain text-entry box that allows a parameter to be passed to the action. The third box is a selection box again, which decides when to launch the action, depending on which of the following conditions is met:
- Enter: if the state has just started being true. That is, the cur- rent sample is in that state, and the previous sample was not in that state.
- During: if the state is true, and ongoing. That is, the current and previous state sample are both in the same state.
- Leave: if the state has just stopped being true. That is, the current sample is not in that state, and the previous sample was in that state.
Metrics Tab
The Metrics tab displays the list of metrics that can be set in the cluster. Some of these metrics are built-ins. Other metrics are standalone scripts. New custom metrics can also be built and added as standalone commands or scripts. Metrics can be manipulated and configured. The Save button saves as-yet-uncommitted changes made via the Add or Edit buttons. The Revert button discards unsaved edits made via the Edit button. The reversion goes back to the last save. The Remove button removes a selected metric from the list. The remaining buttons, Edit and Add, open up options dialogs. These are now discussed.
Health Checks Tab
The Health Checks tab lists available health checks (Fig 5).These can be set to run from the system by configuring them from the Health Check Configuration tab.
Actions Tab
The Actions tab lists available actions that can be set to run on the system from metrics thresholds configuration.