Monitoring Services

Introduction

TQLConsole provides users ability to turn on/off TQLEngine monitoring services. As of Release 1.0.5 there are two types of monitoring services that are currently provided.

a) Network Change Detection

b) TQLEngine Memory and other key parameters

The services menu is accessible from the profile dropdown link.

Network Change Detection

The purpose of Network change detection service is to monitor the change the Network (IP address) on a running TQLEngine. When you deploy projects on TQLEngine the TQL endpoints that are generated has the IP address that was detected at the start time of the TQLEngine. If there is any change in the network and the underlying IP address of the host on which TQLEngine is running is changed, than the project endpoints which has old IP address may no longer work. Network change detection service detects this change and automatically updates all the project endpoints to have the new IP address.

You can disable monitoring of the Network Change detection service by toggling of the button on the menu.

Memory Monitoring and Recovery in A-Stack

  1. Main purpose of Memory Monitoring and Recovery is to ensure survival of the A-Stack Runtime.

  2. Two new config parameters are introduced (default values are shown):

    Monitoring Parameters
    <sff.monitor.memory.levels>0.8,0.9</sff.monitor.memory.levels>
    <sff.monitor.memory.interval>PT1S</sff.monitor.memory.interval>
    

     It means, that memory monitor will establish two memory watermarks: low at 80% (i.e. 0.8 factor) and high at 90% (i.e. factor 0.9) of your biggest memory pool (which is normally old generation). It will also use 1 second interval for memory recovery monitoring. (Surprisingly, there is very poor support from JVM for memory monitoring so some things can only be done in a dedicated thread which will sleeps for the given time between attempts to recover some memory).

    10% reserved memory (i.e. 1.0 – 0.9) is chosen based on some research on the minimum amount of memory required for the GC to run.

  3. It is NOT recommended to set this level above 90% as it may diminish GC ability to run efficiently if at all  80% low watermark seems like a reasonable number. You can set it lower if you want a better safety margin or slightly higher if you’re adventurous and willing to risk. In no event it can be greater than the high watermark. You can also make both watermarks equal, but it will make the system operate at only two alarm levels 0.0 and 1.0 with nothing in between which will go from “everything is OK” directly to “kill everyone right now”

  4. Once the monitor detects that low watermark is crossed (i.e. your memory consumption goes above 80%) it will first nicely suggest to JVM to run GC and then recheck the memory level without raising the alarm. If you’re lucky, GC will collect all the garbage and you’ll see “Memory recovery” message in the log. No harm done to your application and no hard feelings. This is a happy path.

  5. Should you memory consumption remain above the low watermark, bad things will start to happen. Monitor will raise memory alarm which is expressed as floating number between 0.0 and 1.0. The value of the alarm will be linearly interpolated between low and high watermarks. That is if your memory consumption is below 80% the alarm value is 0.0 (i.e. no alarm) if it is 85% then alarm value will be set to 0.5 (i.e. half-way between 0.8 and 0.9) and at 90% alarm will go to 1.0. Alarm level will never go below 0.0 and it will never exceed 1.0.

    o   Then alarm value will be compared with importance value of each engine resource or activity. Should importance value be less than alarm value, resource will be discarded and activity denied. The following importance values are currently hardcoded:

    • New server pipelines: 0.0
    • New client pipelines: 0.1
    • New background (i.e. action/wf) pipelines: 0.2

    o   For example, your alarm level goes to 0.05. At this moment server will stop accepting incoming connections (i.e. create new server pipelines, since 0.0 < 0.05) and will try to close any existing server pipelines with lower level of importance. You will still be able to initiate new client connections and run background activities.  At alarm level 0.15 you’ll be denied new client connections (since 0.1 < 0.15, it will appear as if client connection failed) and at alarm level 0.25 engine will start to skip your queued actions like TQL reports processing and topic publishing. It will also skip any <DoRequest target=”…”> (or any other event) to a facet with importance less than alarm level. (It will appear as if you had an optional target facet which was not deployed). Apparently, the consequences for your application can be quite severe. It is more than likely that it will be broken for good and will need to be re-deployed unless you are prepared to cope with memory alarms to begin with

    o   Now about how you can interact with the system.

    • You can assign importance values to facets, pipelines and processes. That is, you can specify importance=”<value from 0.0 to 1.0>” parameter on your facets and in Create/ModifyPipeline as well as use new <ProcessImportance>value from 0.0 to 1.0</ProcessImportance> FS instruction to set up importance of your process. Copied contexts will inherit this setting.
    • Nothing prevents you from assigning importance greater than 1.0 (in fact, this is what engine does for things which cannot be killed like Manager access or inter-cluster communications). However assigning high importance to everything will essentially disable the system which you can do much more easily by configuring

              <sff.monitor.memory.levels>disabled</sff.monitor.memory.levels>

    in your config. Use your best judgment and keep only things you really can’t live without like a component which would try to redeploy your dead application when alarm is cleared


    You also have access to $AlarmCount, $AlarmLevel and $ProcessImportance via TP so you can check those and decide what to do. Alarm count will be incremented every time alarm level is increased. It will never go down so if you deal with periodic alarms you can compare your last stored alarm counter with the current one to determine whether there was a new alarm raised since your last recovery and you need to recover from its consequences again.
  6. The monitoring service parameters needs to be finalized after extensive testing of the applications under anticipated load scenarios.

Simple Monitoring Service

Enable Monitoring Service with A-Stack Prime

  • Monitoring Service is not packaged with A-Stack Prime.
  • The user can download A-Stack regular and extract the project by creating a zip file (archive the contents of: resources/atomiton/mntrsrvc folder.
  • Extract the zip file on deployed A-Stack prime and add resources/atomiton/mntrsrvc/deploy to sff.local.config.xml 

Monitoring Service Cost

  • HsqlDB Storage: To store a single instance of last monitored parameters. Size of the DB does not grow more than 5KB.
  • Schedule Job:  Pull data from TQLEngine Management layers. Default is 30s interval.

Monitoring Service Usage

  • Monitoring Service exposes HTTP / WS Endpoints to receive the values of monitored parameters
  • Monitroing Services as a heatbeat in a Peer: It is recommended to embed the service with each peer in the cluster

Extending Monitoring Service

  • Add TQLPolicy to Monitoring Service and write Actions to Log messages

TQLEngine Key Parameters (Memory, Queue) Monitoring

TQLEngine key runtime parameters can be monitored by enabling the monitoring service.

Some of the parameters that are monitored are:

ParameterDescription
AlarmCountTQL engine has internal memory management program. This program raises alarm when memory utilization is found to be continuously crossing the threshold. This variable displays the count of how many times Alarm has been triggered
AlarmLevelThis denotes the level of memory consumption reached by the engine at which Alarm was triggered. That means if AlarmLevel is 0.6 then it means Alarm was triggered when memory consumption had reached up to 60%
FreeChannelThis is the number of free / available Sockets or Channels in the system
FreeMemoryThis is the available or free memory which JVM can still use without any issues
JarsIncludedList of all the jars that are in CLASSPATH or included or leveraged by TQL engine
JavaVersionJava version that is being used by TQL engine
MaxMemoryThis is the size of maximum memory that is currently allocated to TQL engine or its JVM
MaxQueueCheckThis is an action which will be taken by the engine should any internal queue exceed maximum allowed size. Default action will give you a warning including queue name and size. Queue name corresponds to subscriber ID so you should be able to determine overloaded subscribers (i.e. subscribers which process jobs slower than new jobs are coming)
MaxQueueSizeThis is maximum internal queue size. Internal queues are used to serialize background jobs like TQL reporting and topic publishing processing (i.e. each subscriber would have its own queue and process all publishing requests in order)
NioAcceptedSocketChannelNumber of Accepted SocketChannels. Higher is this number higher are the chances of memory leak. Hence monitoring this parameter can be crucial
NioServerSocketChannelNumber of channels that are listening for incoming TCP connections
NullChannelNumber of Socket Channels that are not yet instantiated
OSOperating System, its version and architecture, on which TQL engine is hosted and running
ProcessImportanceThis field is maintained by internal memory management program of TQL engine. It denotes the importance of processes involved TQL engine. This importance can range between 0.0 to 1.0. If this value is 0.4 then, all processes below 0.4 will be considered for garbage collection when its triggered due to Alarm going off
ProcessorsNumber of logical Processors or CPUs with which engine is running
ServerIPIP address of server on which TQL engine is hosted or running
ServerNameName of the server or box hosting the engine
TCPclientInfoThis field provides information on TCP connections formed at client side, read and written bytes, total reads and writes and average time taken by the connection for reads
TCPserverInfoThis field provides information on TCP connections formed at server side, read and written bytes, total reads and writes and average time taken by the connection for reads
TotalChannelsThis is total number of Sockets or Channels in the system
TotalMemoryTotal memory out of which heap memory is picked by JVM
UsedMemoryMemory currently used by JVM or TQL engine
UserCountryCountry or location where TQL engine is running
UserDirectoryTQL_HOME directory for current user, from where TQL engine is running
UserLanguageLanguage preferred and set by user on the system TQL is running
UserTimeZoneTime Zone where TQL engine is running

You can receive notifications about key parameter changes by Subscribing to a Monitoring Service over Websocket.

Monitoring Service Websocket End Point

Topic Name: Atomiton.MonitorService.ProjectMonitor

Monitoring Service Request
<Query Storage='TqlSubscription'>
  <Save>
    <TqlSubscription Label='TQEMonitor' sid='20'>
      <Topic>
        *Atomiton.MonitorService.ProjectMonitor*
      </Topic>
    </TqlSubscription>
  </Save>
</Query>

The Response is a delete/creation of new monitoring record:

Monitoring Service Response
<TqlNotification>
  <Create>
    <K7KVTOVXAAAH6AAAAHMC7TXN>
      <Atomiton.MonitorService.ProjectMonitor.projectSysId Value="OnChange" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.AlarmLevel Value="0.0" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.AlarmCount Value="0" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.ProcessImportance Value="0.0" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.OS Value="Mac OS X 10.1 x86_64" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.Processors Value="8" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.MaxMemory Value="3,817,865,216" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.UsedMemory Value="387,791,880" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.FreeMemory Value="64,144,376" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.TotalMemory Value="451,936,256" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.ServerName Value="bkhan" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.ServerIP Value="127.0.0.1" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.TCPserverInfo Value="Total(connections/reads/writes): (1,173/1,191/5,789); Time(total/per read): (3,770/3.165) ms; read: 546,077 bytes; written: 7,071,973 bytes" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.TCPclientInfo Value="Total(connections/reads/writes): (419/419/418); Time(total/per read): (372/0.888) ms; read: 5,378,577 bytes; written: 115,024 bytes" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.NioServerSocketChannel Value="1" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.NioAcceptedSocketChannel Value="5" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.NullChannel Value="2" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.FreeChannel Value="" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.TotalChannels Value="9" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.JavaVersion Value="1.8.0_91-b14" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.JarsIncluded Value="jar/OdaLib-2.0.0.jar,jar/com.atomiton.sff.api.jar,jar/com.atomiton.sff.dataflow.jar,jar/com.atomiton.sff.imp.base.jar,jar/com.atomiton.sff.imp.facet.jar,jar/com.atomiton.sff.imp.netty.jar,jar/com.atomiton.sff.storage.mongo.jar,jar/com.google.guava-12.0.1.jar,jar/org.apache.commons.jexl-2.1.1.jar,jar/org.apache.commons.lang-2.6.0.jar,jar/org.apache.felix.configadmin-1.8.10.jar,jar/org.apache.felix.log-1.0.1.jar,jar/org.apache.felix.metatype-1.1.2.jar,jar/org.apache.felix.scr-2.0.6.jar,jar/org.hsqldb.hsqldb-2.3.4.jar,jar/org.jboss.netty-3.10.6.jar,jar/org.ops4j.pax.logging.pax-logging-api-1.8.3.jar,jar/org.ops4j.pax.logging.pax-logging-service-1.8.3.jar" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.UserCountry Value="US" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.UserTimeZone Value="America/Los_Angeles" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.UserLanguage Value="en" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.UserDirectory Value="/Users/baseerkhan/iot/atomiton/production/rel-105" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.MaxQueueSize Value="10000" Version="1" Timestamp="1476753210039"/>
      <Atomiton.MonitorService.ProjectMonitor.MaxQueueCheck Value="warning" Version="1" Timestamp="1476753210039"/>
    </K7KVTOVXAAAH6AAAAHMC7TXN>
  </Create>
</TqlNotification>

Taking Monitoring Actions 

Monitoring the A-Stack runtime and logging the alarm conditions in log files is a the simplistic action. Actions can be taken at multiple levels. Below are some of use cases of when actions can be taken.


Restart Parameter
<sff.monitor.memory.restart>0.8</sff.monitor.memory.restart>

Infrastructure Monitoring Action