Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »


Cluster Monitoring Model Description

AlarmCondition Model


AttributeDescription/ Purpose
1SidUnique ID
2NameName of the Rule
3DescriptionDescription and purpose of Rule
4EnabledIdentifies if a rule must be executed (Values: true/false)
5LastRunStores the last time the rule was executed
6LastStatusStores the result of the last run (Values: Success/ ERRCON1/ ERRCON2.. ) In case of error, the time between subsequent executions can be inceased and Rule disabled with fatal message at  point. Value will be the most severe failure in case of multiple conditions
7CheckCondition1..* Contitions to be executed for this rule (e.g. A presence of a facet and a http connection check for a task group)
8isRecoverableIndicates if the rule has a recovery action
9RecoveryActionRecovery Action
10TypeThe type of condition to be tested
11ComponentNameThis is a logical grouping of Alarm conditions based on the target application needs
12AlertCountCount of successive Alerts raised for this Condition (Reset when Condition is successful)

ConditionDef


AttributeDescription/ Purpose
1SidUnique Id
2Protocolhttp or https
3URLHostHostName or IP Address
4URLPortPort Number
5URLEndpointTarget endpoint (e.g. fid-TopicFacet)

Types and atributes:

The ConditionDef will have each of these attributes but will be populated based on the type of the Condition

Type: Facet

Test: Check if the Facet is running


AttributeDescription/ Purpose
1FacetNameName of the facet to be checked
2URLThe URL to be called

Type: HTTP

Test: Check if the HTTP endpoint is UP


AttributeDescription/ Purpose
1URLThe URL to be called
2QueryThe query to be executed
3TimeOutAutoConnectionTimeout for the request
4Headers?0..* header as key value pairs

Type: WS

Test: Check if the websocket endpoint is UP


AttributeDescription/ Purpose
1URLThe URL to be called
2QueryThe query to be executed
3TimeOutAutoConnectionTimeout for the request

Type: Sequence

Test: Check if the sequence is executing


AttributeDescription/ Purpose
1URLThe URL to be called
2SequenceNameThe sequence to be tested

Type: Log

Test: Check if the sequence is executing


AttributeDescription/ Purpose
1MessageMessage to be checked
2SourceSource of the message
3LevelLevel of the message

RecoveryActionModel


AttributeDescription/ Purpose
1SidUnique Id
2RuleIdIdentifies the Rule this condition belongs to
3TypeThe type of recovery action to be taken
4Active(True/False) A recovery action becomes active when the Condition is successful, and inactive when the a condition fails. This prevent the recovery from running before the condition is up and running. Also it prevents multiple attempts at recovery, specially when previous recovery is still in progess
5AttemptAfterCountThe number of alerts after which recovery must be attempted. If value is 1, recovery will be attempted instantly.

Type: ExecuteQuery

Test: Check if the sequence is executing


AttributeDescription/ Purpose
1URLThe URL to be called
2QueryQuery to be executed

Type: Script

Test: Check if the sequence is executing


AttributeDescription/ Purpose
1ScriptLocationLocation of the script to be executed

Alert Model


AttributeDescription/ Purpose
1IdUnique Identifier for Alert
2AlarmIdId of the Alarm that failed, caued an alert to be raised
3ClusterIdId of the daemon on which the Alert was created (Only on the Mgmt backend, absent on the daemon)
4InstanceNameName of the node for which the Alert was raised
5RaisedDateThe timestamp when of the alert
6CauseReason of the failuer
7DetailedMessageDetailed message if there is one.
8LevelLevel of the error (Error/ Fatal)
9HasReadIndicates whether the Alert is new or has been read.

Implementation

Daemon Configuration

The following steps will be performed on the deamon:

  1. Rules are added by the use through the API or csv file upload
  2. A job is executed every minute that does the following:
    1. Picks up a AlarmCondition
    2. Executes the condition
    3. On Success
      1. Updates the LastRun with current time and LastStatus to 'Success'
    4. On Failure
      1. Updates the LastRun with current time and LastStatus to 'Failure'
      2. Generates an alert with Condition, ClusterId, InstanceName, and Error Reason
      3. If Recovery Action is present try and execute the action recovery could be
        1. Try and restart failing component
        2. Try and restart the A-Stack engine on the failing node
  3. Find new alerts (Newer than the last run) Sends email notification for the alert

Alarm State Transition Diagram

Command Line API

Load Alarm Conditions

tql -monitoringconfig -load Path=E:/Atomiton/Builds/TQLEngineConfigurator/resources/atomiton/configurator/spaces/AlarmConditions.csv

Delete Alarm Conditions

tql -monitoringconfig -delete ClusterId=Cluster-1,Name=Test_Endpoint_Federation
tql -monitoringconfig -delete ClusterId=Cluster-1,Instance=Instance-1

Get Alarm Conditions

tql -monitoringconfig -get ClusterId=Cluster-1,Instance=Instance-1,Name=Test_Endpoint_Federation,Path=E:\Atomiton\Downloads\Result.json
tql -monitoringconfig -get ClusterId=Cluster-1,Instance=Instance-1,Path=E:\Atomiton\Downloads\Result.json

Update Alarm Conditions

tql -monitoringconfig -update ClusterId=Cluster-1,Path=E:\Atomiton\Downloads\Result.json

Update Email Configuration

tql -monitoringconfig -config Path=E:\Atomiton\Builds\TQLEngineConfiguratorNode1\resources\atomiton\configurator\spaces\EmailConfig.json,Type=setemail,ClusterId=Cluster-1

Sample Configuration:


{  
   "NotificationConfig":{
	  "Host":"HOST_NAME",
      "Port":"PORT_NUMBER",
      "Username":"UNAME",
      "Password":"PWD",
      "From":"from@domain.com",
      "To":"recepient@domain.com,recepient2@domain1.com",
      "Subject":"Alert Generated Notifications"
   }
}


Get Email Configuration

tql -monitoringconfig -config Type=getemail,ClusterId=Cluster-1

Update Schedule Configuration

tql -monitoringconfig -config Path=E:\Atomiton\Builds\TQLEngineConfiguratorNode1\resources\atomiton\configurator\spaces\SysConfig.json,Type=setschedule,ClusterId=Cluster-1


{  
   "SysConfig":[
		{"EMAIL_FREQ":"1min"},
		{"FACET_FREQ":"1min"},
		{"HTTP_FREQ":"1min"},
		{"SEQ_FREQ":"1min"},
		{"LOG_FREQ":"1min"},
		{"INFO_FREQ":"1min"},
		{"WS_FREQ":"1min"},
		{"ALERT_PURGE_FREQ":"1min"},
		{"ALERT_PURGE_LIMIT_DAYS":"25"}
   ]
}


Get Schedule Configuration

tql -monitoringconfig -config Type=getschedule,ClusterId=Cluster-1

Stop Monitoring

tql -monitoringconfig -stopmonitoring ClusterId=Cluster-1

Start Moniting

tql -monitoringconfig -startmonitoring ClusterId=Cluster-1

Update purge configuration for management dashboard

tql -monitoringconfig -alertpurgedashboard Purgefreq=1min,Purgelimit=20

Management Dashboard

A scheduled job will run on the dashboard back-end that will periodically (Every minute) pull the AlarmConditions, Alerts and Notifications sent and store it onto itself for displaying on the UI.


FAQs

Common


Q: Monitoring Page you had shared earlier was running on port 9000. Is this port configurable?

A: No, we cannot change port from 9000 to any other port.


Q: What are possible states of the alert – is there an Alert Lifecycle?

A: An alert is just a notification of a failure it will just have ‘New’ and ‘Read’ states. This in set using the ‘HasRead’ flag in the Alerts model


Q: How does alert correspond to email notifications ? where is this configured ?

A: Emails will consolidate and send Alerts in a fixed (Configurable schedule). It will only include Alerts generated since the last email.

     For Fatal messages, email will be sent instantly

    Configuration CLI command:

     tql –dashboard –configure <Config file Path>

      This is the design level config file:

       

{  
   "DashBoardConfig":{  
      "NotificationConfig":{  
         "EmailTo":"abc@xyz.com",
         "Frequency":"1min"
      },
      "ExecutionConfig":[  
         {  
            "ClusterId":"Cluster-1",
            "ExecConfig":{  
               "FacetExecFrequency":"30sec",
               "HttpExecFrequency":"1min"
            }
         },
         {  
            "ClusterId":"Cluster-2",
            "ExecConfig":{  
               "FacetExecFrequency":"45sec",
               "HttpExecFrequency":"2min"
            }
         }
      ]
   }
}



Q: How are we ensuring that we are not sending too many mails when an alert condition arises (for e.g. are we going to have one mail per minute)

A: Emails will consolidate and send Alerts in a fixed (Configurable schedule with "NotificationConfig"."Frequency" in the above JSON). It will only include Alerts generated since the last email. 

     For Fatal messages, email will be sent instantly


Q: How is the pause / unpause of the monitoring / alerting aspect going to be handled – this is in relation to query sent earlier on maintenance windows.

A: This will be a CLI option. On the daemons:


tql –cluster –monitoring -start

tql –cluster –monitoring –stop


This will basically stop the scheduled jobs that execute the alarms

  • No labels