...
You can disable monitoring of the Network Change detection service by toggling of the button on the menu.
Memory Monitoring and Recovery in A-Stack
Main purpose of Memory Monitoring and Recovery is to ensure survival of the A-Stack Runtime.
Two new config parameters are introduced (default values are shown):
Code Block language xml theme Emacs title Monitoring Parameters linenumbers true <sff.monitor.memory.levels>0.8,0.9</sff.monitor.memory.levels> <sff.monitor.memory.interval>PT1S</sff.monitor.memory.interval>
It means, that memory monitor will establish two memory watermarks: low at 80% (i.e. 0.8 factor) and high at 90% (i.e. factor 0.9) of your biggest memory pool (which is normally old generation). It will also use 1 second interval for memory recovery monitoring. (Surprisingly, there is very poor support from JVM for memory monitoring so some things can only be done in a dedicated thread which will sleeps for the given time between attempts to recover some memory).
10% reserved memory (i.e. 1.0 – 0.9) is chosen based on some research on the minimum amount of memory required for the GC to run.
It is NOT recommended to set this level above 90% as it may diminish GC ability to run efficiently if at all 80% low watermark seems like a reasonable number. You can set it lower if you want a better safety margin or slightly higher if you’re adventurous and willing to risk. In no event it can be greater than the high watermark. You can also make both watermarks equal, but it will make the system operate at only two alarm levels 0.0 and 1.0 with nothing in between which will go from “everything is OK” directly to “kill everyone right now”
Once the monitor detects that low watermark is crossed (i.e. your memory consumption goes above 80%) it will first nicely suggest to JVM to run GC and then recheck the memory level without raising the alarm. If you’re lucky, GC will collect all the garbage and you’ll see “Memory recovery” message in the log. No harm done to your application and no hard feelings. This is a happy path.
Should you memory consumption remain above the low watermark, bad things will start to happen. Monitor will raise memory alarm which is expressed as floating number between 0.0 and 1.0. The value of the alarm will be linearly interpolated between low and high watermarks. That is if your memory consumption is below 80% the alarm value is 0.0 (i.e. no alarm) if it is 85% then alarm value will be set to 0.5 (i.e. half-way between 0.8 and 0.9) and at 90% alarm will go to 1.0. Alarm level will never go below 0.0 and it will never exceed 1.0.
o Then alarm value will be compared with importance value of each engine resource or activity. Should importance value be less than alarm value, resource will be discarded and activity denied. The following importance values are currently hardcoded:
- New server pipelines: 0.0
- New client pipelines: 0.1
- New background (i.e. action/wf) pipelines: 0.2
o For example, your alarm level goes to 0.05. At this moment server will stop accepting incoming connections (i.e. create new server pipelines, since 0.0 < 0.05) and will try to close any existing server pipelines with lower level of importance. You will still be able to initiate new client connections and run background activities. At alarm level 0.15 you’ll be denied new client connections (since 0.1 < 0.15, it will appear as if client connection failed) and at alarm level 0.25 engine will start to skip your queued actions like TQL reports processing and topic publishing. It will also skip any <DoRequest target=”…”> (or any other event) to a facet with importance less than alarm level. (It will appear as if you had an optional target facet which was not deployed). Apparently, the consequences for your application can be quite severe. It is more than likely that it will be broken for good and will need to be re-deployed unless you are prepared to cope with memory alarms to begin with
o Now about how you can interact with the system.
- You can assign importance values to facets, pipelines and processes. That is, you can specify importance=”<value from 0.0 to 1.0>” parameter on your facets and in Create/ModifyPipeline as well as use new <ProcessImportance>value from 0.0 to 1.0</ProcessImportance> FS instruction to set up importance of your process. Copied contexts will inherit this setting.
- Nothing prevents you from assigning importance greater than 1.0 (in fact, this is what engine does for things which cannot be killed like Manager access or inter-cluster communications). However assigning high importance to everything will essentially disable the system which you can do much more easily by configuring
<sff.monitor.memory.levels>disabled</sff.monitor.memory.levels>
in your config. Use your best judgment and keep only things you really can’t live without like a component which would try to redeploy your dead application when alarm is cleared
You also have access to $AlarmCount, $AlarmLevel and $ProcessImportance via TP so you can check those and decide what to do. Alarm count will be incremented every time alarm level is increased. It will never go down so if you deal with periodic alarms you can compare your last stored alarm counter with the current one to determine whether there was a new alarm raised since your last recovery and you need to recover from its consequences again.- The monitoring service parameters needs to be finalized after extensive testing of the applications under anticipated load scenarios.
Simple Monitoring Service
Enable Monitoring Service with A-Stack Prime
- Monitoring Service is not packaged with A-Stack Prime.
- The user can download A-Stack regular and extract the project by creating a zip file (archive the contents of: resources/atomiton/mntrsrvc folder.
- Extract the zip file on deployed A-Stack prime and add resources/atomiton/mntrsrvc/deploy to sff.local.config.xml
Monitoring Service Cost
- HsqlDB Storage: To store a single instance of last monitored parameters. Size of the DB does not grow more than 5KB.
- Schedule Job: Pull data from TQLEngine Management layers. Default is 30s interval.
Monitoring Service Usage
- Monitoring Service exposes HTTP / WS Endpoints to receive the values of monitored parameters
- Monitroing Services as a heatbeat in a Peer: It is recommended to embed the service with each peer in the cluster
Extending Monitoring Service
- Add TQLPolicy to Monitoring Service and write Actions to Log messages
TQLEngine Key Parameters (Memory, Queue) Monitoring
...
Some of the parameters that are monitored are:
Parameter | Description |
---|---|
AlarmCount | TQL engine has internal memory management program. This program raises alarm when memory utilization is found to be continuously crossing the threshold. This variable displays the count of how many times Alarm has been triggered |
AlarmLevel | This denotes the level of memory consumption reached by the engine at which Alarm was triggered. That means if AlarmLevel is 0.6 then it means Alarm was triggered when memory consumption had reached up to 60% |
FreeChannel | This is the number of free / available Sockets or Channels in the system |
FreeMemory | This is the available or free memory which JVM can still use without any issues |
JarsIncluded | List of all the jars that are in CLASSPATH or included or leveraged by TQL engine |
JavaVersion | Java version that is being used by TQL engine |
MaxMemory | This is the size of maximum memory that is currently allocated to TQL engine or its JVM |
MaxQueueCheck | This is an action which will be taken by the engine should any internal queue exceed maximum allowed size. Default action will give you a warning including queue name and size. Queue name corresponds to subscriber ID so you should be able to determine overloaded subscribers (i.e. subscribers which process jobs slower than new jobs are coming) |
MaxQueueSize | This is maximum internal queue size. Internal queues are used to serialize background jobs like TQL reporting and topic publishing processing (i.e. each subscriber would have its own queue and process all publishing requests in order) |
NioAcceptedSocketChannel | Number of Accepted SocketChannels. Higher is this number higher are the chances of memory leak. Hence monitoring this parameter can be crucial |
NioServerSocketChannel | Number of channels that are listening for incoming TCP connections |
NullChannel | Number of Socket Channels that are not yet instantiated |
OS | Operating System, its version and architecture, on which TQL engine is hosted and running |
ProcessImportance | This field is maintained by internal memory management program of TQL engine. It denotes the importance of processes involved TQL engine. This importance can range between 0.0 to 1.0. If this value is 0.4 then, all processes below 0.4 will be considered for garbage collection when its triggered due to Alarm going off |
Processors | Number of logical Processors or CPUs with which engine is running |
ServerIP | IP address of server on which TQL engine is hosted or running |
ServerName | Name of the server or box hosting the engine |
TCPclientInfo | This field provides information on TCP connections formed at client side, read and written bytes, total reads and writes and average time taken by the connection for reads |
TCPserverInfo | This field provides information on TCP connections formed at server side, read and written bytes, total reads and writes and average time taken by the connection for reads |
TotalChannels | This is total number of Sockets or Channels in the system |
TotalMemory | Total memory out of which heap memory is picked by JVM |
UsedMemory | Memory currently used by JVM or TQL engine |
UserCountry | Country or location where TQL engine is running |
UserDirectory | TQL_HOME directory for current user, from where TQL engine is running |
UserLanguage | Language preferred and set by user on the system TQL is running |
UserTimeZone | Time Zone where TQL engine is running |
You can receive notifications about key parameter changes by Subscribing to a Monitoring Service over Websocket.
Monitoring Service Websocket End Point:
Topic Name: Atomiton.MonitorService.ProjectMonitor
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
<Query Storage='TqlSubscription'>
<Save>
<TqlSubscription Label='TQEMonitor' sid='20'>
<Topic>
*Atomiton.MonitorService.ProjectMonitor*
</Topic>
</TqlSubscription>
</Save>
</Query> |
The Response is a delete/creation of new monitoring record:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
<TqlNotification>
<Create>
<K7KVTOVXAAAH6AAAAHMC7TXN>
<Atomiton.MonitorService.ProjectMonitor.projectSysId Value="OnChange" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.AlarmLevel Value="0.0" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.AlarmCount Value="0" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.ProcessImportance Value="0.0" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.OS Value="Mac OS X 10.1 x86_64" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.Processors Value="8" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.MaxMemory Value="3,817,865,216" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.UsedMemory Value="387,791,880" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.FreeMemory Value="64,144,376" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.TotalMemory Value="451,936,256" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.ServerName Value="bkhan" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.ServerIP Value="127.0.0.1" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.TCPserverInfo Value="Total(connections/reads/writes): (1,173/1,191/5,789); Time(total/per read): (3,770/3.165) ms; read: 546,077 bytes; written: 7,071,973 bytes" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.TCPclientInfo Value="Total(connections/reads/writes): (419/419/418); Time(total/per read): (372/0.888) ms; read: 5,378,577 bytes; written: 115,024 bytes" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.NioServerSocketChannel Value="1" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.NioAcceptedSocketChannel Value="5" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.NullChannel Value="2" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.FreeChannel Value="" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.TotalChannels Value="9" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.JavaVersion Value="1.8.0_91-b14" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.JarsIncluded Value="jar/OdaLib-2.0.0.jar,jar/com.atomiton.sff.api.jar,jar/com.atomiton.sff.dataflow.jar,jar/com.atomiton.sff.imp.base.jar,jar/com.atomiton.sff.imp.facet.jar,jar/com.atomiton.sff.imp.netty.jar,jar/com.atomiton.sff.storage.mongo.jar,jar/com.google.guava-12.0.1.jar,jar/org.apache.commons.jexl-2.1.1.jar,jar/org.apache.commons.lang-2.6.0.jar,jar/org.apache.felix.configadmin-1.8.10.jar,jar/org.apache.felix.log-1.0.1.jar,jar/org.apache.felix.metatype-1.1.2.jar,jar/org.apache.felix.scr-2.0.6.jar,jar/org.hsqldb.hsqldb-2.3.4.jar,jar/org.jboss.netty-3.10.6.jar,jar/org.ops4j.pax.logging.pax-logging-api-1.8.3.jar,jar/org.ops4j.pax.logging.pax-logging-service-1.8.3.jar" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.UserCountry Value="US" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.UserTimeZone Value="America/Los_Angeles" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.UserLanguage Value="en" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.UserDirectory Value="/Users/baseerkhan/iot/atomiton/production/rel-105" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.MaxQueueSize Value="10000" Version="1" Timestamp="1476753210039"/>
<Atomiton.MonitorService.ProjectMonitor.MaxQueueCheck Value="warning" Version="1" Timestamp="1476753210039"/>
</K7KVTOVXAAAH6AAAAHMC7TXN>
</Create>
</TqlNotification> |
Taking Monitoring Actions
Monitoring the A-Stack runtime and logging the alarm conditions in log files is a the simplistic action. Actions can be taken at multiple levels. Below are some of use cases of when actions can be taken.
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
<sff.monitor.memory.restart>0.8</sff.monitor.memory.restart>
|