Gliffy |
---|
imageAttachmentId | att314638341 |
---|
baseUrl | https://atomiton.atlassian.net/wiki |
---|
name | Alarm State Transition Copy |
---|
migration | 1 |
---|
diagramAttachmentId | att314638337 |
---|
containerId | 293372058 |
---|
|
Gliffy |
---|
imageAttachmentId | att261554178 |
---|
baseUrl | https://atomiton.atlassian.net/wiki |
---|
migration | 1 |
---|
name | Alarm Heirarchy |
---|
diagramAttachmentId | att261750785 |
---|
containerId | 236748960 |
---|
timestamp | 1521175914647 |
---|
|
Cluster Monitoring Model Description
AlarmCondition Model
| Attribute | Description/ Purpose |
---|
1 | Sid | Unique ID |
2 | Name | Name of the Rule |
3 | Description | Description and purpose of Rule |
4 | Enabled | Identifies if a rule must be executed (Values: true/false) |
5 | LastRun | Stores the last time the rule was executed |
6 | LastStatus | Stores the result of the last run (Values: Success/ ERRCON1/ ERRCON2.. ) In case of error, the time between subsequent executions can be inceased and Rule disabled with fatal message at point. Value will be the most severe failure in case of multiple conditions |
7 | CheckCondition | 1..* Contitions to be executed for this rule (e.g. A presence of a facet and a http connection check for a task group) |
8 | isRecoverable | Indicates if the rule has a recovery action |
9 | RecoveryAction | Recovery Action |
10 | Type | The type of condition to be tested |
11 | ComponentName | This is a logical grouping of Alarm conditions based on the target application needs |
12 | AlertCount | Count of successive Alerts raised for this Condition (Reset when Condition is successful) |
ConditionDef
| Attribute | Description/ Purpose |
---|
1 | Sid | Unique Id |
2 | Protocol | http or https |
3 | URLHost | HostName or IP Address |
4 | URLPort | Port Number |
5 | URLEndpoint | Target endpoint (e.g. fid-TopicFacet) |
Types and atributes:
The ConditionDef will have each of these attributes but will be populated based on the type of the Condition
Type: Facet
Test: Check if the Facet is running
| Attribute | Description/ Purpose |
---|
1 | FacetName | Name of the facet to be checked |
2 | URL | The URL to be called |
Type: HTTP
Test: Check if the HTTP endpoint is UP
| Attribute | Description/ Purpose |
---|
1 | URL | The URL to be called |
2 | Query | The query to be executed |
3 | TimeOut | AutoConnectionTimeout for the request |
4 | Headers? | 0..* header as key value pairs |
Type: WS
Test: Check if the websocket endpoint is UP
| Attribute | Description/ Purpose |
---|
1 | URL | The URL to be called |
2 | Query | The query to be executed |
3 | TimeOut | AutoConnectionTimeout for the request |
Type: Sequence
Test: Check if the sequence is executing
| Attribute | Description/ Purpose |
---|
1 | URL | The URL to be called |
2 | SequenceName | The sequence to be tested |
Type: Log
Test: Check if the sequence is executing
| Attribute | Description/ Purpose |
---|
1 | Message | Message to be checked |
2 | Source | Source of the message |
3 | Level | Level of the message |
RecoveryActionModel
| Attribute | Description/ Purpose |
---|
1 | Sid | Unique Id |
2 | RuleId | Identifies the Rule this condition belongs to |
3 | Type | The type of recovery action to be taken |
4 | Active | (True/False) A recovery action becomes active when the Condition is successful, and inactive when the a condition fails. This prevent the recovery from running before the condition is up and running. Also it prevents multiple attempts at recovery, specially when previous recovery is still in progess |
5 | AttemptAfterCount | The number of alerts after which recovery must be attempted. If value is 1, recovery will be attempted instantly. |
Type: ExecuteQuery
Test: Check if the sequence is executing
| Attribute | Description/ Purpose |
---|
1 | URL | The URL to be called |
2 | Query | Query to be executed |
Type: Script
Test: Check if the sequence is executing
| Attribute | Description/ Purpose |
---|
1 | ScriptLocation | Location of the script to be executed |
Alert Model
| Attribute | Description/ Purpose |
---|
1 | Id | Unique Identifier for Alert |
2 | AlarmId | Id of the Alarm that failed, caued an alert to be raised |
3 | ClusterId | Id of the daemon on which the Alert was created (Only on the Mgmt backend, absent on the daemon) |
4 | InstanceName | Name of the node for which the Alert was raised |
5 | RaisedDate | The timestamp when of the alert |
6 | Cause | Reason of the failuer |
7 | DetailedMessage | Detailed message if there is one. |
8 | Level | Level of the error (Error/ Fatal) |
9 | HasRead | Indicates whether the Alert is new or has been read. |
...
- Rules are added by the use through the API or csv file upload
- A job is executed every minute that does the following:
- Picks up a AlarmCondition
- Executes the condition
- On Success
- Updates the LastRun with current time and LastStatus to 'Success'
- On Failure
- Updates the LastRun with current time and LastStatus to 'Failure'
- Generates an alert with Condition, ClusterId, InstanceName, and Error Reason
- If Recovery Action is present try and execute the action recovery could be
- Try and restart failing component
- Try and restart the A-Stack engine on the failing node
- Find new alerts (Newer than the last run) Sends email notification for the alert
Alarm State Transition Diagram
Gliffy |
---|
imageAttachmentId | att314638341 |
---|
baseUrl | https://atomiton.atlassian.net/wiki |
---|
migration | 1 |
---|
name | Alarm State Transition Copy |
---|
diagramAttachmentId | att314638337 |
---|
containerId | 293372058 |
---|
|
Command Line API
Load Alarm Conditions
...