Cluster Monitoring Model Description
AlarmCondition Model
Attribute | Description/ Purpose | |
---|---|---|
1 | Sid | Unique ID |
2 | Name | Name of the Rule |
3 | Description | Description and purpose of Rule |
4 | Enabled | Identifies if a rule must be executed (Values: true/false) |
5 | LastRun | Stores the last time the rule was executed |
6 | LastStatus | Stores the result of the last run (Values: Success/ ERRCON1/ ERRCON2.. ) In case of error, the time between subsequent executions can be inceased and Rule disabled with fatal message at point. Value will be the most severe failure in case of multiple conditions |
7 | CheckCondition | 1..* Contitions to be executed for this rule (e.g. A presence of a facet and a http connection check for a task group) |
8 | isRecoverable | Indicates if the rule has a recovery action |
9 | RecoveryAction | Recovery Action |
10 | Type | The type of condition to be tested |
11 | ComponentName | This is a logical grouping of Alarm conditions based on the target application needs |
12 | AlertCount | Count of successive Alerts raised for this Condition (Reset when Condition is successful) |
ConditionDef
Attribute | Description/ Purpose | |
---|---|---|
1 | Sid | Unique Id |
2 | Protocol | http or https |
3 | URLHost | HostName or IP Address |
4 | URLPort | Port Number |
5 | URLEndpoint | Target endpoint (e.g. fid-TopicFacet) |
Types and atributes:
The ConditionDef will have each of these attributes but will be populated based on the type of the Condition
Type: Facet
Test: Check if the Facet is running
Attribute | Description/ Purpose | |
---|---|---|
1 | FacetName | Name of the facet to be checked |
2 | URL | The URL to be called |
Type: HTTP
Test: Check if the HTTP endpoint is UP
Attribute | Description/ Purpose | |
---|---|---|
1 | URL | The URL to be called |
2 | Query | The query to be executed |
3 | TimeOut | AutoConnectionTimeout for the request |
4 | Headers? | 0..* header as key value pairs |
Type: WS
Test: Check if the websocket endpoint is UP
Attribute | Description/ Purpose | |
---|---|---|
1 | URL | The URL to be called |
2 | Query | The query to be executed |
3 | TimeOut | AutoConnectionTimeout for the request |
Type: Sequence
Test: Check if the sequence is executing
Attribute | Description/ Purpose | |
---|---|---|
1 | URL | The URL to be called |
2 | SequenceName | The sequence to be tested |
Type: Log
Test: Check if the sequence is executing
Attribute | Description/ Purpose | |
---|---|---|
1 | Message | Message to be checked |
2 | Source | Source of the message |
3 | Level | Level of the message |
RecoveryActionModel
Attribute | Description/ Purpose | |
---|---|---|
1 | Sid | Unique Id |
2 | RuleId | Identifies the Rule this condition belongs to |
3 | Type | The type of recovery action to be taken |
4 | Active | (True/False) A recovery action becomes active when the Condition is successful, and inactive when the a condition fails. This prevent the recovery from running before the condition is up and running. Also it prevents multiple attempts at recovery, specially when previous recovery is still in progess |
5 | AttemptAfterCount | The number of alerts after which recovery must be attempted. If value is 1, recovery will be attempted instantly. |
Type: ExecuteQuery
Test: Check if the sequence is executing
Attribute | Description/ Purpose | |
---|---|---|
1 | URL | The URL to be called |
2 | Query | Query to be executed |
Type: Script
Test: Check if the sequence is executing
Attribute | Description/ Purpose | |
---|---|---|
1 | ScriptLocation | Location of the script to be executed |
Alert Model
Attribute | Description/ Purpose | |
---|---|---|
1 | Id | Unique Identifier for Alert |
2 | AlarmId | Id of the Alarm that failed, caued an alert to be raised |
3 | ClusterId | Id of the daemon on which the Alert was created (Only on the Mgmt backend, absent on the daemon) |
4 | InstanceName | Name of the node for which the Alert was raised |
5 | RaisedDate | The timestamp when of the alert |
6 | Cause | Reason of the failuer |
7 | DetailedMessage | Detailed message if there is one. |
8 | Level | Level of the error (Error/ Fatal) |
9 | HasRead | Indicates whether the Alert is new or has been read. |
Implementation
Daemon Configuration
The following steps will be performed on the deamon:
- Rules are added by the use through the API or csv file upload
- A job is executed every minute that does the following:
- Picks up a AlarmCondition
- Executes the condition
- On Success
- Updates the LastRun with current time and LastStatus to 'Success'
- On Failure
- Updates the LastRun with current time and LastStatus to 'Failure'
- Generates an alert with Condition, ClusterId, InstanceName, and Error Reason
- If Recovery Action is present try and execute the action recovery could be
- Try and restart failing component
- Try and restart the A-Stack engine on the failing node
- Find new alerts (Newer than the last run) Sends email notification for the alert
Alarm State Transition Diagram
Command Line API
Load Alarm Conditions
tql -monitoringconfig -load Path=E:/Atomiton/Builds/TQLEngineConfigurator/resources/atomiton/configurator/spaces/AlarmConditions.csv
Delete Alarm Conditions
tql -monitoringconfig -delete ClusterId=Cluster-1,Name=Test_Endpoint_Federation
tql -monitoringconfig -delete ClusterId=Cluster-1,Instance=Instance-1
Get Alarm Conditions
tql -monitoringconfig -get ClusterId=Cluster-1,Instance=Instance-1,Name=Test_Endpoint_Federation,Path=E:\Atomiton\Downloads\Result.json
tql -monitoringconfig -get ClusterId=Cluster-1,Instance=Instance-1,Path=E:\Atomiton\Downloads\Result.json
Update Alarm Conditions
tql -monitoringconfig -update ClusterId=Cluster-1,Path=E:\Atomiton\Downloads\Result.json
Update Email Configuration
tql -monitoringconfig -config Path=E:\Atomiton\Builds\TQLEngineConfiguratorNode1\resources\atomiton\configurator\spaces\EmailConfig.json,Type=setemail,ClusterId=Cluster-1
Sample Configuration:
{ "NotificationConfig":{ "Host":"HOST_NAME", "Port":"PORT_NUMBER", "Username":"UNAME", "Password":"PWD", "From":"from@domain.com", "To":"recepient@domain.com,recepient2@domain1.com", "Subject":"Alert Generated Notifications" } }
Get Email Configuration
tql -monitoringconfig -config Type=getemail,ClusterId=Cluster-1
Update Schedule Configuration
tql -monitoringconfig -config Path=E:\Atomiton\Builds\TQLEngineConfiguratorNode1\resources\atomiton\configurator\spaces\SysConfig.json,Type=setschedule,ClusterId=Cluster-1
{ "SysConfig":[ {"EMAIL_FREQ":"1min"}, {"FACET_FREQ":"1min"}, {"HTTP_FREQ":"1min"}, {"SEQ_FREQ":"1min"}, {"LOG_FREQ":"1min"}, {"INFO_FREQ":"1min"}, {"WS_FREQ":"1min"}, {"ALERT_PURGE_FREQ":"1min"}, {"ALERT_PURGE_LIMIT_DAYS":"25"} ] }
Get Schedule Configuration
tql -monitoringconfig -config Type=getschedule,ClusterId=Cluster-1
Stop Monitoring
tql -monitoringconfig -stopmonitoring ClusterId=Cluster-1
Start Moniting
tql -monitoringconfig -startmonitoring ClusterId=Cluster-1
Update purge configuration for management dashboard
tql -monitoringconfig -alertpurgedashboard Purgefreq=1min,Purgelimit=20
Management Dashboard
A scheduled job will run on the dashboard back-end that will periodically (Every minute) pull the AlarmConditions, Alerts and Notifications sent and store it onto itself for displaying on the UI.
FAQs
Common
Q: Monitoring Page you had shared earlier was running on port 9000. Is this port configurable?
A: No, we cannot change port from 9000 to any other port.
Q: What are possible states of the alert – is there an Alert Lifecycle?
A: An alert is just a notification of a failure it will just have ‘New’ and ‘Read’ states. This in set using the ‘HasRead’ flag in the Alerts model
Q: How does alert correspond to email notifications ? where is this configured ?
A: Emails will consolidate and send Alerts in a fixed (Configurable schedule). It will only include Alerts generated since the last email.
For Fatal messages, email will be sent instantly
Configuration CLI command:
tql –dashboard –configure <Config file Path>
This is the design level config file:
{ "DashBoardConfig":{ "NotificationConfig":{ "EmailTo":"abc@xyz.com", "Frequency":"1min" }, "ExecutionConfig":[ { "ClusterId":"Cluster-1", "ExecConfig":{ "FacetExecFrequency":"30sec", "HttpExecFrequency":"1min" } }, { "ClusterId":"Cluster-2", "ExecConfig":{ "FacetExecFrequency":"45sec", "HttpExecFrequency":"2min" } } ] } }
Q: How are we ensuring that we are not sending too many mails when an alert condition arises (for e.g. are we going to have one mail per minute)
A: Emails will consolidate and send Alerts in a fixed (Configurable schedule with "NotificationConfig"."Frequency" in the above JSON). It will only include Alerts generated since the last email.
For Fatal messages, email will be sent instantly
Q: How is the pause / unpause of the monitoring / alerting aspect going to be handled – this is in relation to query sent earlier on maintenance windows.
A: This will be a CLI option. On the daemons:
tql –cluster –monitoring -start
tql –cluster –monitoring –stop
This will basically stop the scheduled jobs that execute the alarms