...
Main purpose of Memory Monitoring and Recovery is to ensure survival of the A-Stack Runtime.
Two new config parameters are introduced (default values are shown):
Code Block language xml theme Emacs title Monitoring Parameters linenumbers true <sff.monitor.memory.levels>0.8,0.9</sff.monitor.memory.levels> <sff.monitor.memory.interval>PT1S</sff.monitor.memory.interval>
It means, that memory monitor will establish two memory watermarks: low at 80% (i.e. 0.8 factor) and high at 90% (i.e. factor 0.9) of your biggest memory pool (which is normally old generation). It will also use 1 second interval for memory recovery monitoring. (Surprisingly, there is very poor support from JVM for memory monitoring so some things can only be done in a dedicated thread which will sleeps for the given time between attempts to recover some memory).
10% reserved memory (i.e. 1.0 – 0.9) is chosen based on some research on the minimum amount of memory required for the GC to run.
It is NOT recommended to set this level above 90% as it may diminish GC ability to run efficiently if at all 80% low watermark seems like a reasonable number. You can set it lower if you want a better safety margin or slightly higher if you’re adventurous and willing to risk. In no event it can be greater than the high watermark. You can also make both watermarks equal, but it will make the system operate at only two alarm levels 0.0 and 1.0 with nothing in between which will go from “everything is OK” directly to “kill everyone right now”
Once the monitor detects that low watermark is crossed (i.e. your memory consumption goes above 80%) it will first nicely suggest to JVM to run GC and then recheck the memory level without raising the alarm. If you’re lucky, GC will collect all the garbage and you’ll see “Memory recovery” message in the log. No harm done to your application and no hard feelings. This is a happy path.
Should you memory consumption remain above the low watermark, bad things will start to happen. Monitor will raise memory alarm which is expressed as floating number between 0.0 and 1.0. The value of the alarm will be linearly interpolated between low and high watermarks. That is if your memory consumption is below 80% the alarm value is 0.0 (i.e. no alarm) if it is 85% then alarm value will be set to 0.5 (i.e. half-way between 0.8 and 0.9) and at 90% alarm will go to 1.0. Alarm level will never go below 0.0 and it will never exceed 1.0.
o Then alarm value will be compared with importance value of each engine resource or activity. Should importance value be less than alarm value, resource will be discarded and activity denied. The following importance values are currently hardcoded:
- New server pipelines: 0.0
- New client pipelines: 0.1
- New background (i.e. action/wf) pipelines: 0.2
o For example, your alarm level goes to 0.05. At this moment server will stop accepting incoming connections (i.e. create new server pipelines, since 0.0 < 0.05) and will try to close any existing server pipelines with lower level of importance. You will still be able to initiate new client connections and run background activities. At alarm level 0.15 you’ll be denied new client connections (since 0.1 < 0.15, it will appear as if client connection failed) and at alarm level 0.25 engine will start to skip your queued actions like TQL reports processing and topic publishing. It will also skip any <DoRequest target=”…”> (or any other event) to a facet with importance less than alarm level. (It will appear as if you had an optional target facet which was not deployed). Apparently, the consequences for your application can be quite severe. It is more than likely that it will be broken for good and will need to be re-deployed unless you are prepared to cope with memory alarms to begin with
o Now about how you can interact with the system.
- You can assign importance values to facets, pipelines and processes. That is, you can specify importance=”<value from 0.0 to 1.0>” parameter on your facets and in Create/ModifyPipeline as well as use new <ProcessImportance>value from 0.0 to 1.0</ProcessImportance> FS instruction to set up importance of your process. Copied contexts will inherit this setting.
- Nothing prevents you from assigning importance greater than 1.0 (in fact, this is what engine does for things which cannot be killed like Manager access or inter-cluster communications). However assigning high importance to everything will essentially disable the system which you can do much more easily by configuring
<sff.monitor.memory.levels>disabled</sff.monitor.memory.levels>
in your config. Use your best judgment and keep only things you really can’t live without like a component which would try to redeploy your dead application when alarm is cleared
o This is a totally new territory for all of us, so consider this first implementation to be beta quality. Give me your feedback on how useful this is and how to make it more useful.
You also have access to $AlarmCount, $AlarmLevel and $ProcessImportance via TP so you can check those and decide what to do. Alarm count will be incremented every time alarm level is increased. It will never go down so if you deal with periodic alarms you can compare your last stored alarm counter with the current one to determine whether there was a new alarm raised since your last recovery and you need to recover from its consequences again.- The monitoring service parameters needs to be finalized after extensive testing of the applications under anticipated load scenarios.
Simple Monitoring Service
...