...
Attribute | Description/ Purpose | |
---|---|---|
1 | Sid | Unique ID |
2 | Name | Name of the Rule |
3 | Description | Description and purpose of Rule |
4 | Enabled | Identifies if a rule must be executed (Values: true/false) |
5 | LastRun | Stores the last time the rule was executed |
6 | LastStatus | Stores the result of the last run (Values: Success/ ERRCON1/ ERRCON2.. ) In case of error, the time between subsequent executions can be inceased increased and Rule disabled with fatal message at point. Value will be the most severe failure in case of multiple conditions |
7 | CheckCondition | 1..* Contitions Conditions to be executed for this rule (e.g. A presence of a facet and a http connection check for a task group) |
8 | isRecoverable | Indicates if the rule has a recovery action |
9 | RecoveryAction | Recovery Action |
10 | Type | The type of condition to be tested |
11 | ComponentName | This is a logical grouping of Alarm conditions based on the target application needs |
12 | AlertCount | Count of successive Alerts raised for this Condition (Reset when Condition is successful) |
...
Attribute | Description/ Purpose | |
---|---|---|
1 | Sid | Unique Id |
2 | Protocol | http or https |
3 | URLHost | HostName or IP Address |
4 | URLPort | Port Number |
5 | URLEndpoint | Target endpoint (e.g. fid-TopicFacet) |
Types and
...
attributes:
The ConditionDef will have each of these attributes but will be populated based on the type of the Condition
...
Attribute | Description/ Purpose | |
---|---|---|
1 | Sid | Unique Id |
2 | RuleId | Identifies the Rule this condition belongs to |
3 | Type | The type of recovery action to be taken |
4 | Active | (True/False) A recovery action becomes active when the Condition is successful, and inactive when the a condition fails. This prevent the recovery from running before the condition is up and running. Also it prevents multiple attempts at recovery, specially when previous recovery is still in progessprogress |
5 | AttemptAfterCount | The number of alerts after which recovery must be attempted. If value is 1, recovery will be attempted instantly. |
...
Attribute | Description/ Purpose | |
---|---|---|
1 | Id | Unique Identifier for Alert |
2 | AlarmId | Id of the Alarm that failed, caued caused an alert to be raised |
3 | ClusterId | Id of the daemon on which the Alert was created (Only on the Mgmt backend, absent on the daemon) |
4 | InstanceName | Name of the node for which the Alert was raised |
5 | RaisedDate | The timestamp when of the alert |
6 | Cause | Reason of the failuerfailure |
7 | DetailedMessage | Detailed message if there is one. |
8 | Level | Level of the error (Error/ Fatal) |
9 | HasRead | Indicates whether the Alert is new or has been read. |
...
The following steps will be performed on the deamondaemon:
- Rules are added by the use through the API or csv file upload
- A job is executed every minute that does the following:
- Picks up a AlarmCondition
- Executes the condition
- On Success
- Updates the LastRun with current time and LastStatus to 'Success'
- On Failure
- Updates the LastRun with current time and LastStatus to 'Failure'
- Generates an alert with Condition, ClusterId, InstanceName, and Error Reason
- If Recovery Action is present try and execute the action recovery could be
- Try and restart failing component
- Try and restart the A-Stack engine on the failing node
- Find new alerts (Newer than the last run) Sends email notification for the alert
...
Column Name | Mandatory | Description |
---|---|---|
ClusterId | Yes | This column defines the cluster the AlarmCondition belongs to and it deployed on the corresponding daemon. It is the only positional field and must always be the first column. All other columns can be shuffled, as long as the name is correct. An alarm with a non-existent ClusterId will be ignored. |
InstanceName | Yes | Defines the instance this AlarmCondition belongs to in the Cluster. |
ComponentName | Yes | This is a logical grouping of AlarmConditions within an Instance. It can be any name that is meaningful to the application. All AlarmConditions with the same component will be grouped together on the dashboard. |
Name | Yes | This is a meaningful name for an AlarmCondition it must be unique within an instance in a Cluster. The duplicate entry will simply be ignored. |
Description | No | This is a meaningful description for an AlarmCondition. |
Enabled | Yes | This is a boolean field that indicates if this AlarmCondition is enabled. Only Conditions that are enabled will be executed to check for success or failures. An alarm condition can be enabled or disabled using the update AlarmCondition mechanism. |
Type | Yes | This column defined the type of condition being tested. The currently supported types are: "Facet","HTTP","Sequence","WS","Log","Info" (All case sensitive) - Facet: Test if a facet is present and active. - HTTP: Test if an endpoint is up and responding - Sequence: Test if a scheduled job is present - WS: Test if a websocket is up and responsive - Log: Monitor the log files for error and fatal conditions - Info: Monitor critical parameters like NullChannel, FreeChannel |
CheckCondition.Protocol | Yes | The protocol for the test. |
CheckCondition.URLHost | Yes | The IP Address of the host to be tested. |
CheckCondition.URLPort | Yes | The port at which the services running. |
CheckCondition.URLEndpoint | Conditional | The endpoint at which the test is to be performed. Mandatory for types: HTTP, Sequence, Log, WS. |
CheckCondition.FacetName | Conditional | The Facet whose presence is t to be tested. Mandatory for type: Facet. |
CheckCondition.SequenceName | Conditional | The Scheduled job whose presence is t to be tested. Mandatory for type: Sequence. |
CheckCondition.Query | Conditional | The query that must be executed on the endpoint. Mandatory for type: HTTP. |
CheckCondition.Timeout | Conditional | The amount of time query will wait for the server to respond, before declaring it a failure. Mandatory for type: Facet, HTTP, Sequence, Info, WS. |
CheckCondition.HeadersData | No | Any header information that is required for a query to execute successfully. Applicable to HTTP type. |
IsRecoverable | Yes | This boolean field defines if the failure of an AlarmCondition will trigger a recovery action. |
RecoveryAction.Active | Yes | When a recovery action is present this field defines if the recovery action is true when it is created. If it is set to false, the system will change it to true once the Alarm Condition is active. |
RecoveryAction.Type | Conditional | Mandatory if AlarmCondition is recoverable. There are types of recovery actions, "HTTP" and "RESTART". HTTP: tries to execute a query on an endpoint in attempt to recover. RESTART: Restart the target applications A-Stack engine in attempt to recover. |
RecoveryAction.AttemptAfterCount | Conditional | Mandatory if AlarmCondition is recoverable. Number of failures after which recovery must be attempted. (Even for multiple failures, Alerts will only be raised on the first transition from "Success" to "Failure". |
RecoveryAction.Protocol | Conditional | The protocol for the recovery. Mandatory if recovery type is HTTP. |
RecoveryAction.URLHost | The IP Address of the host to be recovered. Mandatory if recovery type is HTTP. | |
RecoveryAction.URLPort | The port at which the services running. Mandatory if recovery type is HTTP. | |
RecoveryAction.URLEndpoint | The endpoint at which the recovery is to be performed. Mandatory if recovery type is HTTP. | |
RecoveryAction.Query | Query to be executed at the endpoint. Mandatory if recovery type is HTTP. | |
RecoveryAction.Timeout | Amount of time system waits for the server to respond. Mandatory if recovery type is HTTP. |
...
- FACET_FREQ: Frequency with which Facet type AlarmConditions are executed.
- HTTP_FREQ: Frequency with which HTTP type AlarmConditions are executed.
- SEQ_FREQ: Frequency with which Sequence type AlarmConditions are executed.
- LOG_FREQ: Frequency with which Log type AlarmConditions are executed.
- INFO_FREQ: Frequency with which Info type AlarmConditions are executed.
- WS_FREQ: Frequency with which WS type AlarmConditions are executed.
- ALERT_PURGE_FREQ: Frequency with which stale alerts are purged.
- ALERT_PURGE_LIMIT_DAYS: How old sould should an alert be before it is considered stale.
- EMAIL_FREQ: Frequency with which Alert emails are sent.
...
- Purgefreq: Frequency with which stale alerts are purged.
- Purgelimit: How old sould should an alert be before it is considered stale
...