Cluster Monitoring Model Description

AlarmCondition Model

	Attribute	Description/ Purpose
1	Sid	Unique ID
2	Name	Name of the Rule
3	Description	Description and purpose of Rule
4	Enabled	Identifies if a rule must be executed (Values: true/false)
5	LastRun	Stores the last time the rule was executed
6	LastStatus	Stores the result of the last run (Values: Success/ ERRCON1/ ERRCON2.. ) In case of error, the time between subsequent executions can be increased and Rule disabled with fatal message at point. Value will be the most severe failure in case of multiple conditions
7	CheckCondition	1..* Conditions to be executed for this rule (e.g. A presence of a facet and a http connection check for a task group)
8	isRecoverable	Indicates if the rule has a recovery action
9	RecoveryAction	Recovery Action
10	Type	The type of condition to be tested
11	ComponentName	This is a logical grouping of Alarm conditions based on the target application needs
12	AlertCount	Count of successive Alerts raised for this Condition (Reset when Condition is successful)

ConditionDef

	Attribute	Description/ Purpose
1	Sid	Unique Id
2	Protocol	http or https
3	URLHost	HostName or IP Address
4	URLPort	Port Number
5	URLEndpoint	Target endpoint (e.g. fid-TopicFacet)

Types and attributes:

The ConditionDef will have each of these attributes but will be populated based on the type of the Condition

Test: Check if the Facet is running

	Attribute	Description/ Purpose
1	FacetName	Name of the facet to be checked
2	URL	The URL to be called

Type: HTTP

Test: Check if the HTTP endpoint is UP

	Attribute	Description/ Purpose
1	URL	The URL to be called
2	Query	The query to be executed
3	TimeOut	AutoConnectionTimeout for the request
4	Headers?	0..* header as key value pairs

Type: WS

Test: Check if the websocket endpoint is UP

	Attribute	Description/ Purpose
1	URL	The URL to be called
2	Query	The query to be executed
3	TimeOut	AutoConnectionTimeout for the request

Type: Sequence

Test: Check if the sequence is executing

	Attribute	Description/ Purpose
1	URL	The URL to be called
2	SequenceName	The sequence to be tested

Type: Log

Test: Check if the sequence is executing

	Attribute	Description/ Purpose
1	Message	Message to be checked
2	Source	Source of the message
3	Level	Level of the message

RecoveryActionModel

	Attribute	Description/ Purpose
1	Sid	Unique Id
2	RuleId	Identifies the Rule this condition belongs to
3	Type	The type of recovery action to be taken
4	Active	(True/False) A recovery action becomes active when the Condition is successful, and inactive when the a condition fails. This prevent the recovery from running before the condition is up and running. Also it prevents multiple attempts at recovery, specially when previous recovery is still in progress
5	AttemptAfterCount	The number of alerts after which recovery must be attempted. If value is 1, recovery will be attempted instantly.

Type: ExecuteQuery

Test: Check if the sequence is executing

	Attribute	Description/ Purpose
1	URL	The URL to be called
2	Query	Query to be executed

Type: Script

Test: Check if the sequence is executing

	Attribute	Description/ Purpose
1	ScriptLocation	Location of the script to be executed

Alert Model

	Attribute	Description/ Purpose
1	Id	Unique Identifier for Alert
2	AlarmId	Id of the Alarm that failed, caused an alert to be raised
3	ClusterId	Id of the daemon on which the Alert was created (Only on the Mgmt backend, absent on the daemon)
4	InstanceName	Name of the node for which the Alert was raised
5	RaisedDate	The timestamp when of the alert
6	Cause	Reason of the failuer
7	DetailedMessage	Detailed message if there is one.
8	Level	Level of the error (Error/ Fatal)
9	HasRead	Indicates whether the Alert is new or has been read.

Implementation

Daemon Configuration

The following steps will be performed on the deamon:

Rules are added by the use through the API or csv file upload
A job is executed every minute that does the following:
1. Picks up a AlarmCondition
2. Executes the condition
3. On Success
  1. Updates the LastRun with current time and LastStatus to 'Success'
4. On Failure
  1. Updates the LastRun with current time and LastStatus to 'Failure'
  2. Generates an alert with Condition, ClusterId, InstanceName, and Error Reason
  3. If Recovery Action is present try and execute the action recovery could be
    1. Try and restart failing component
    2. Try and restart the A-Stack engine on the failing node
Find new alerts (Newer than the last run) Sends email notification for the alert

Alarm State Transition Diagram

Provisioning Alarms Using CSV File

Column Name	Mandatory	Description
ClusterId	Yes	This column defines the cluster the AlarmCondition belongs to and it deployed on the corresponding daemon. It is the only positional field and must always be the first column. All other columns can be shuffled, as long as the name is correct. An alarm with a non-existent ClusterId will be ignored.
InstanceName	Yes	Defines the instance this AlarmCondition belongs to in the Cluster.
ComponentName	Yes	This is a logical grouping of AlarmConditions within an Instance. It can be any name that is meaningful to the application. All AlarmConditions with the same component will be grouped together on the dashboard.
Name	Yes	This is a meaningful name for an AlarmCondition it must be unique within an instance in a Cluster. The duplicate entry will simply be ignored.
Description	No	This is a meaningful description for an AlarmCondition.
Enabled	Yes	This is a boolean field that indicates if this AlarmCondition is enabled. Only Conditions that are enabled will be executed to check for success or failures. An alarm condition can be enabled or disabled using the update AlarmCondition mechanism.
Type	Yes	This column defined the type of condition being tested. The currently supported types are: "Facet","HTTP","Sequence","WS","Log","Info" (All case sensitive) - Facet: Test if a facet is present and active. - HTTP: Test if an endpoint is up and responding - Sequence: Test if a scheduled job is present - WS: Test if a websocket is up and responsive - Log: Monitor the log files for error and fatal conditions - Info: Monitor critical parameters like NullChannel, FreeChannel
CheckCondition.Protocol	Yes	The protocol for the test.
CheckCondition.URLHost	Yes	The IP Address of the host to be tested.
CheckCondition.URLPort	Yes	The port at which the services running.
CheckCondition.URLEndpoint	Conditional	The endpoint at which the test is to be performed. Mandatory for types: HTTP, Sequence, Log, WS.
CheckCondition.FacetName	Conditional	The Facet whose presence is t be tested. Mandatory for type: Facet.
CheckCondition.SequenceName	Conditional	The Scheduled job whose presence is t be tested. Mandatory for type: Sequence.
CheckCondition.Query	Conditional	The query that must be executed on the endpoint. Mandatory for type: HTTP.
CheckCondition.Timeout	Conditional	The amount of time query will wait for the server to respond, before declaring it a failure. Mandatory for type: Facet, HTTP, Sequence, Info, WS.
CheckCondition.HeadersData	No	Any header information that is required for a query to execute successfully. Applicable to HTTP type.
IsRecoverable	Yes	This boolean field defines if the failure of an AlarmCondition will trigger a recovery action.
RecoveryAction.Active	Yes	When a recovery action is present this field defines if the recovery action is true when it is created. If it is set to false, the system will change it to true once the Alarm Condition is active.
RecoveryAction.Type	Conditional	Mandatory if AlarmCondition is recoverable. There are types of recovery actions, "HTTP" and "RESTART". HTTP: tries to execute a query on an endpoint in attempt to recover. RESTART: Restart the target applications A-Stack engine in attempt to recover.
RecoveryAction.AttemptAfterCount	Conditional	Mandatory if AlarmCondition is recoverable. Number of failures after which recovery must be attempted. (Even for multiple failures, Alerts will only be raised on the first transition from "Success" to "Failure".
RecoveryAction.Protocol	Conditional	The protocol for the recovery. Mandatory if recovery type is HTTP.
RecoveryAction.URLHost		The IP Address of the host to be recovered. Mandatory if recovery type is HTTP.
RecoveryAction.URLPort		The port at which the services running. Mandatory if recovery type is HTTP.
RecoveryAction.URLEndpoint		The endpoint at which the recovery is to be performed. Mandatory if recovery type is HTTP.
RecoveryAction.Query		Query to be executed at the endpoint. Mandatory if recovery type is HTTP.
RecoveryAction.Timeout		Amount of time system waits for the server to respond. Mandatory if recovery type is HTTP.

Command Line API

Load Alarm Conditions

This Command loads the AlarmConditions into the monitoring system.

Command: tql -monitoringconfig -load Path=<Path of the File>/AlarmConditions.csv

Delete Alarm Conditions

This command lets you delete one or all AlarmConditions from the system based on the inputs provided.

Command: tql -monitoringconfig -delete ClusterId=Cluster-1,Instance=Instance-1,Name=Test_Endpoint_Facet

Deletes AlarmCondtion called Test_Endpoint_Facet from instance Instance-1 on cluster Cluster-1

Command: tql -monitoringconfig -delete ClusterId=Cluster-1,Instance=Instance-1

Deletes all AlarmCondtions from instance Instance-1 on cluster Cluster-1

Get Alarm Conditions

This command gets AlarmCondtions from the system based on the inputs provided. And writes it to the file provided in the Path variable

Command: tql -monitoringconfig -get ClusterId=Cluster-1,Instance=Instance-1,Name=Test_Endpoint_Facet,Path=<File Path>\Result.json

Gets AlarmCondtion called Test_Endpoint_Facet from instance Instance-1 on cluster Cluster-1

Command: tql -monitoringconfig -get ClusterId=Cluster-1,Instance=Instance-1,Path=<File Path>\Result.json

Gets all AlarmCondtions from instance Instance-1 on cluster Cluster-1

Update Alarm Conditions

Updates an AlarmCondition values. First run the get Command to get the AlarmCondition, make the Necessary changes and provide this file path in the "Path" variable.

Command: tql -monitoringconfig -update ClusterId=Cluster-1,Path=<File Path>\Result.json

Update Email Configuration

Updates the Email configuration. ClusterId is optional, if not specified updates will happen on all cluster.

Command: tql -monitoringconfig -config Path=<File Path>\EmailConfig.json,Type=setemail,ClusterId=Cluster-1

Updates configuration on cluster Cluster-1

Command: tql -monitoringconfig -config Path=<File Path>\EmailConfig.json,Type=setemail,ClusterId=Cluster-1

Updates configuration on all clusters

Sample Configuration:

{  
   "NotificationConfig":{
	  "Host":"HOST_NAME",
      "Port":"PORT_NUMBER",
      "Username":"UNAME",
      "Password":"PWD",
      "From":"from@domain.com",
      "To":"recepient@domain.com,recepient2@domain1.com",
      "Subject":"Alert Generated Notifications"
   }
}

Get Email Configuration

Get the Email configuration and prints it on the screen. ClusterId is mandatory.

Command: tql -monitoringconfig -config Type=getemail,ClusterId=Cluster-1

Update Schedule Configuration

Updates the schedule configuration. ClusterId is optional, if not specified updates will happen on all cluster.

Command: tql -monitoringconfig -config Path=E:\Atomiton\Builds\TQLEngineConfiguratorNode1\resources\atomiton\configurator\spaces\SysConfig.json,Type=setschedule,ClusterId=Cluster-1

{  
   "SysConfig":[
		{"EMAIL_FREQ":"1min"},
		{"FACET_FREQ":"1min"},
		{"HTTP_FREQ":"1min"},
		{"SEQ_FREQ":"1min"},
		{"LOG_FREQ":"1min"},
		{"INFO_FREQ":"1min"},
		{"WS_FREQ":"1min"},
		{"ALERT_PURGE_FREQ":"1min"},
		{"ALERT_PURGE_LIMIT_DAYS":"25"}
   ]
}

FACET_FREQ: Frequency with which Facet type AlarmConditions are executed.
HTTP_FREQ: Frequency with which HTTP type AlarmConditions are executed.
SEQ_FREQ: Frequency with which Sequence type AlarmConditions are executed.
LOG_FREQ: Frequency with which Log type AlarmConditions are executed.
INFO_FREQ: Frequency with which Info type AlarmConditions are executed.
WS_FREQ: Frequency with which WS type AlarmConditions are executed.
ALERT_PURGE_FREQ: Frequency with which stale alerts are purged.
ALERT_PURGE_LIMIT_DAYS: How old should an alert be before it is considered stale.
EMAIL_FREQ: Frequency with which Alert emails are sent.

Get Schedule Configuration

Get the schedule configuration and prints it on the screen. ClusterId is mandatory.

Command: tql -monitoringconfig -config Type=getschedule,ClusterId=Cluster-1

Stop Monitoring

Pause a monitoring service. ClusterId is mandatory.

Command: tql -monitoringconfig -stopmonitoring ClusterId=Cluster-1

Start Moniting

Resume a monitoring service. ClusterId is mandatory.

Command: tql -monitoringconfig -startmonitoring ClusterId=Cluster-1

Update purge configuration for management dashboard'

Update Alert purging configuration on the Dashboard.

Command: tql -monitoringconfig -alertpurgedashboard Purgefreq=1min,Purgelimit=20

Purgefreq: Frequency with which stale alerts are purged.

Purgelimit: How old should an alert be before it is considered stale

Management Dashboard

A scheduled job will run on the dashboard back-end that will periodically (Every minute) pull the AlarmConditions, Alerts and Notifications sent and store it onto itself for displaying on the UI.

FAQs

Common

Q: Monitoring Page you had shared earlier was running on port 9000. Is this port configurable?

A: No, we cannot change port from 9000 to any other port.

Q: What are possible states of the alert – is there an Alert Lifecycle?

A: An alert is just a notification of a failure it will just have ‘New’ and ‘Read’ states. This in set using the ‘HasRead’ flag in the Alerts model

Q: How does alert correspond to email notifications ? where is this configured ?

A: Emails will consolidate and send Alerts in a fixed (Configurable schedule). It will only include Alerts generated since the last email.

For Fatal messages, email will be sent instantly

Configuration CLI command:

tql –dashboard –configure <Config file Path>

This is the design level config file:

{  
   "DashBoardConfig":{  
      "NotificationConfig":{  
         "EmailTo":"abc@xyz.com",
         "Frequency":"1min"
      },
      "ExecutionConfig":[  
         {  
            "ClusterId":"Cluster-1",
            "ExecConfig":{  
               "FacetExecFrequency":"30sec",
               "HttpExecFrequency":"1min"
            }
         },
         {  
            "ClusterId":"Cluster-2",
            "ExecConfig":{  
               "FacetExecFrequency":"45sec",
               "HttpExecFrequency":"2min"
            }
         }
      ]
   }
}

Q: How are we ensuring that we are not sending too many mails when an alert condition arises (for e.g. are we going to have one mail per minute)

A: Emails will consolidate and send Alerts in a fixed (Configurable schedule with "NotificationConfig"."Frequency" in the above JSON). It will only include Alerts generated since the last email.

For Fatal messages, email will be sent instantly

Q: How is the pause / unpause of the monitoring / alerting aspect going to be handled – this is in relation to query sent earlier on maintenance windows.

A: This will be a CLI option. On the daemons:

tql –cluster –monitoring -start

tql –cluster –monitoring –stop

This will basically stop the scheduled jobs that execute the alarms

A-Stack Cluster Monitoring Dashboard

Cluster Monitoring Model Description

AlarmCondition Model

ConditionDef

Types and attributes:

Type: Facet

Type: HTTP

Type: WS

Type: Sequence

Type: Log

RecoveryActionModel

Type: ExecuteQuery

Type: Script

Alert Model

Implementation

Daemon Configuration

Alarm State Transition Diagram

Provisioning Alarms Using CSV File

Command Line API

Load Alarm Conditions

Delete Alarm Conditions

Get Alarm Conditions

Update Alarm Conditions

Update Email Configuration

Get Email Configuration

Update Schedule Configuration

Get Schedule Configuration

Stop Monitoring

Start Moniting

Update purge configuration for management dashboard'

Management Dashboard

FAQs

Common