Hi. The symptoms you are describing seem odd. Firstly, you shouldn't be needing to increase opentsdb memory allocation (what has been making you do that?). Also you should never need to touch RegionServer and when it comes to MetricShipper you should only ever need to increase the number of instances of that at the collector level if the metric processing is falling behind (if you see the queue maxing out and not coming back down in the collectorredis service graphs). We never increase the allocation beyond the base for the opentsdb services. The only thing I've ever had to do was increase the number of reader instances to handle more load, but that was on systems monitoring thousands of devices. For your small workload it should be handling it with the default settings without even breathing hard. The next time you hit that 504 nginx error you should go look at your Zope services and see what they are doing (i.e are they currently passing healthchecks). Same for the Zenoss.core service instance listed towards the bottom of the main Zenoss.core application page since that represents zproxy which is the nginx based services that routes the web based connections to the appropriate services including the Zope instances. The only significant difference between 5.x and 6.x is that we moved from using a Lucene based catalog service to a Solr based one, which does use a bit more system resources but also offers significantly better performance. Make sure that your memory usage and your load averages look ok on your server for starters. Your load average should never exceed the number of CPU core on that host and your server should never be digging deep in to swap space. If either of those things are happening then you need to look at increasing the host resources before you continue. Hopefully that will give you something to go on.
Subject: |
RE: ZenOSS Core 6.2.1 runaway resources |
Author: |
J K |
Posted: |
2020-06-09 00:22 |
Indeed it is odd but it happened when I needed ZenOSS data the most. The reason why I specifically increased opentsdb was it was failing to start (red dot) and refused till I upped the memory as that was the red line. Right now it seems to be stable but for a simple environment like mine, it sure doesn't feel like it will hold in a stable manner when I add more devices.
In all such scenarios, the host gauges are just fine. Are there any health checks I can run besides the ones for the DB?
By the way, it's great to see a dev on the boards after so long - haven't really had much engagement from anyone from ZenOSS since the forums changed. Usually seems to be Jane holding her own.
------------------------------
J K
------------------------------
Subject: |
RE: ZenOSS Core 6.2.1 runaway resources |
Author: |
Ryan Matte |
Posted: |
2020-06-09 17:39 |
I want to make sure you're not confusing the dots in Control Center. If the healthcheck icon is red with an exclamation point (and stays that way for an extended period of time) it means the service is not starting. If that is a blue checkmark it means the service is running. The little red dots that show up when memory usage has exceeded the memory allocation is just a warning and doesn't in any way affect the operation of the service. It's basically saying "hey, you expected this service to use 2GB and it's using more than that". That's more or less just there to make it easier to identify potential memory leaks. Keep in mind that increasing the memory allocation only increases the memory configured for certain services such as mariadb and jvm / java based services where we're explicitly using that option as part of the configuration that's generated for those services. Any python based services for instance will just use as much memory as they need regardless of what's allocated because there's no configurable way of controlling / limiting that. In those cases the allocation setting is basically just a soft threshold that generates that red dot if it's exceeded but doesn't really do anything else. I doubt that the problems you are seeing have anything specifically to do with your device monitoring. Even Zenoss Core should be able to handle monitoring a few hundred devices with the stock settings (that's without changing anything out of the box). You're not monitoring anywhere near that. I'd just say be careful what you change. You may change something thinking it's helping while it's actually just making things worse. Since you're stable at the moment you can probably leave it as is. Keep in mind that with Zenoss memory performance is everything, you need enough memory on your host (ideally 32GB minimum) and you should make sure you're never digging in to swap space. You should also have a minimum of 4 cpu cores available (though 8 is preferred).
------------------------------
Ryan Matte
------------------------------