TECHZEN Zenoss User Community ARCHIVE  

Change zenhub resource in Zenoss 5

Subject: Change zenhub resource in Zenoss 5
Author: Brian Schimmoller
Posted: 2017-05-16 11:22

I have an issue where when I change a Windows Monitoring template the CPU of zenhub spikes causing zenhub to stop communicating with other services (zenpython in particular). The logs appear to show that it's attempting to check every Windows server against the monitoring template. When running "docker stats " it shows that it's running at 300-400% and will stay there for 45 minutes to an hour and a half.

During this time, I don't get any new events. The CPU on the server is acceptable (16 cores, load of ~3-5). I tried to add CPU to the zenhub container, but it didn't appear to accept the change. I did a "serviced service edit zenhub" and changed "CPUCommitment" to 4 and then restarted zenhub. I triggered the CPU spike and the container was still using 300-400% CPU.

Is this the correct way to adjust the CPU allocation to zenhub? Is there additional work I need to do to let zenhub use the additonal CPUs? Thanks


Subject: RE: Change zenhub resource in Zenoss 5
Author: Andrew Kirch
Posted: 2017-05-16 13:56

I'd look at what you added to see if it's functioning correctly (check zenhub and zenpython logs in kibana)

How many zenhubs are you running?  Does adding more resolve the issue in such a way that the higher load can be ignored?  

Usually something chewing up CPU for 45 minutes is bad though, and I'd look to fix the underlying issue.



------------------------------
Andrew Kirch
Senior Solutions Engineer
GoVanguard
------------------------------


Subject: RE: Change zenhub resource in Zenoss 5
Author: Brian Schimmoller
Posted: 2017-05-16 15:46

Thanks for the reply. I'm only running a single instance of zenhub. Do you think adding another instance would help balance the load?

What I'm adding isn't causing the issue. When triggering the event, I just add a datasource called "test" to the Active Directory monitoring template. The zenhub debug log shows that it's checking that monitoring template against all of my servers. At this point I start to see a backlog of events in zenpython, which eventually grows to 5000 before it starts purging the events. The zenhub CPU jumps up and stays there until it eventually restarts itself after the 45-90 minutes and I'll be back to normal for awhile.

I have a notification setup to restart zenhub when the localhost zenhub process CPU breaches 90% and this has kept me running well for awhile, but I would definitely prefer to solve the problem.

Edit: A couple more things: zenhub.config has "workers 15"; zenpython.config has "twistedthreadpoolsize 16"
Here's a snippet of the zenhub log at the tail end of scanning all of my Windows devices and then all workers showing "busy":

2017-05-16 19:58:41,911 DEBUG zen.hub: /zport/dmd/Devices/Server/Microsoft/Windows/<DEVICE> not bound to template /zport/dmd/Devices/Server/Microsoft/rrdTemplates/Active Directory
2017-05-16 19:58:42,007 DEBUG zen.hub: /zport/dmd/Devices/Server/Microsoft/Windows/<DEVICE> not bound to template /zport/dmd/Devices/Server/Microsoft/rrdTemplates/Active Directory
2017-05-16 19:58:42,125 DEBUG zen.hub: /zport/dmd/Devices/Server/Microsoft/Windows/<DEVICE> not bound to template /zport/dmd/Devices/Server/Microsoft/rrdTemplates/Active Directory
2017-05-16 19:58:42,228 DEBUG zen.hub: /zport/dmd/Devices/Server/Microsoft/Windows/<DEVICE> not bound to template /zport/dmd/Devices/Server/Microsoft/rrdTemplates/Active Directory
2017-05-16 19:58:42,232 DEBUG zen.hub.notify: BatchNotifier._callback: no more devices, 19 in queue
2017-05-16 19:58:42,233 DEBUG zen.publisher: metric flush to redis in progress, skipping _put
2017-05-16 19:58:42,233 DEBUG zen.publisher: trying to publish 6 metrics
2017-05-16 19:58:42,234 DEBUG zen.ZenHub: CommandPerformanceConfig.notifyAffectedDevices is interested in <Products.ZenHub.zodb.UpdateEvent object at 0x58df0190> for <RRDTemplate at /zport/dmd/Devices/Server/Microsoft/rrdTemplates/Active Directory>
2017-05-16 19:58:42,235 DEBUG zen.hub.notify: BatchNotifier.notify_subdevices: <DeviceClass at /zport/dmd/Devices/Server/Microsoft>, ('CommandPerformanceConfig', 'localhost')
2017-05-16 19:58:44,197 INFO zenUtils.AutoGCObjectReader: GC: reduced cache to 6122/1000 (total/active) objects
2017-05-16 19:58:44,205 DEBUG zen.ZenHub: worklist has 3 items
2017-05-16 19:58:44,205 DEBUG zen.ZenHub: Giving getDevicePingIssues to worker 10, (localhost:Products.ZenHub.services.EventService.getDevicePingIssues)
2017-05-16 19:58:44,207 DEBUG zen.ZenHub: worklist has 2 items
2017-05-16 19:58:44,208 DEBUG zen.ZenHub: Giving getConfigProperties to worker 11, (localhost:Products.ZenHub.services.PingPerformanceConfig.getConfigProperties)
2017-05-16 19:58:44,209 DEBUG zen.ZenHub: worklist has 1 items
2017-05-16 19:58:44,215 DEBUG zen.ZenHub: Giving getConfigProperties to worker 12, (localhost:Products.ZenHub.services.CommandPerformanceConfig.getConfigProperties)
2017-05-16 19:58:44,507 DEBUG zen.ZenHub: worklist has 5 items
2017-05-16 19:58:44,507 DEBUG zen.ZenHub: Giving sendEvents to worker 13, (localhost:Products.ZenHub.services.EventService.sendEvents)
2017-05-16 19:58:44,508 DEBUG zen.ZenHub: worklist has 4 items
2017-05-16 19:58:44,508 DEBUG zen.ZenHub: Giving getDevicePingIssues to worker 14, (localhost:Products.ZenHub.services.EventService.getDevicePingIssues)
2017-05-16 19:58:44,508 DEBUG zen.ZenHub: worklist has 3 items
2017-05-16 19:58:44,508 DEBUG zen.ZenHub: all workers are busy
2017-05-16 19:58:44,508 DEBUG zen.ZenHub: worklist has 3 items
2017-05-16 19:58:44,508 DEBUG zen.ZenHub: all workers are busy
2017-05-16 19:58:44,508 DEBUG clear.ZenHub: worklist has 3 items
2017-05-16 19:58:44,509 DEBUG zen.ZenHub: all workers are busy
2017-05-16 19:58:44,509 DEBUG zen.ZenHub: worklist has 3 items
2017-05-16 19:58:44,509 DEBUG zen.ZenHub: all workers are busy
2017-05-16 19:58:44,509 DEBUG zen.ZenHub: worklist has 3 items
2017-05-16 19:58:44,509 DEBUG zen.ZenHub: all workers are busy
2017-05-16 19:58:44,509 DEBUG zen.ZenHub: worklist has 3 items
2017-05-16 19:58:44,509 DEBUG zen.ZenHub: all workers are busy
2017-05-16 19:58:44,509 DEBUG zen.ZenHub: worklist has 3 items

I'd look at what you added to see if it's functioning correctly (check zenhub and zenpython logs in kibana)

How many zenhubs are you running?  Does adding more resolve the issue in such a way that the higher load can be ignored?  

Usually something chewing up CPU for 45 minutes is bad though, and I'd look to fix the underlying issue.



------------------------------
Andrew Kirch
Senior Solutions Engineer
GoVanguard
------------------------------


Subject: RE: Change zenhub resource in Zenoss 5
Author: Jane Curry
Posted: 2017-06-22 08:02

I have exactly the same issue.  I am monitoring about 160 Windows servers with the Microsoft Windows ZenPack.  I have just upgraded this to 2.7.7 with PythonCollector ZenPack at 1.9.0 on a Zenoss 5.2.1.

Restarting Zenoss.core or zenhub or zenpython takes hours.  Modifying a template that needs pushing for the use of Windows boxes, takes hours.  Looking at a debug log of zenhub, it looks like most of the work is going into zenhub pushing config changes and processing invalidations (ie. config changes). 

A config change should only be a communication between zenhub and the relevant daemons on the relevant collectors - it should not need to go out to target devices.  I only have a single localhost collector so there shouldn't be any comms or target delays in there.

I have setup performance graphs for zenhub and zenpython as documented here - zenhub and zenpython performance graphs  .  Scary amounts of zenhub workListLength - often between 10 and 20 - and zenpython stats show that Missed Runs and Queued Tasks just get bigger whilst zenhub consumes all the cpu.

I note that zenpython.conf (with ZP version 1.9.0) now has parameters for configsipsize (default 25 - max number of device configs to load at once) and configsipdelay (default 1 - delay in seconds between device configs loading).  I see no documentation at all for these.
I also note that the PythonCollector ZenPack 1.8.0 added a twistedconcurrenthttp parameter which is also undocumented - hacking through the code into web/semaphores.py provides a default for DEFAULT_TWISTEDCONCURRENTHTTP of 512.

I have no idea how to tweak parameters in zenhub and zenpython to alleviate the problems - can anyone help??

Further - looking at the release notes for both Core and Resource Manager, 5.2.3 documents a new feature:
Collection start
Getting device configurations from ZenHub on startup delays the start of collection. Implementation of a
configuration caching layer shortens this delay. This enhancement enables collectors to start collecting (or
continue collecting during restart) with the cached configurations while the new configurations are being
loaded.

along with fixed bug:
ZEN-26936      Device configuration loading speed is slow.

So is there any point in trying to tweak parameters fi sixeable architectural changes have been made around changing device configs?

Cheers,
Jane

------------------------------
Jane Curry
Skills 1st United Kingdom
jane.curry@skills-1st.co.uk
------------------------------


Subject: RE: Change zenhub resource in Zenoss 5
Author: Jane Curry
Posted: 2017-07-04 12:02

Something else I noticed today.  Changed one element of an existing template for Windows devices (the event class generated).  I immediately see zenhub and regionserver going to 100 - 200% of CPU (I have 8 cpus).  Settles down after about 10 minutes (this is "normal" having disabled all the IIS monitors; before that it took an hour or more).

IGNORE THIS COMMENT
====================
The weird thing is that I also have a bunch of events in the format "Modeler plugin zenoss.winrm.Software returned no results.".  All the modelers are complaining for the few devices that are not contactable.  I have my zenmodeler scheduled to only run at 01:00 and zenmodeler.log confirms that there has been no activity since about 02:30. 

So what's with the winges about modeler plugins?  Is this another spurious manifestation of a "Config Change" for a template?
================ Someone else was doing things and I thought I was the only one working on this on the fourth of July! ============

Cheers,
Jane

------------------------------
Jane Curry
Skills 1st United Kingdom
jane.curry@skills-1st.co.uk
------------------------------
Collection start
Getting device configurations from ZenHub on startup delays the start of collection. Implementation of a
configuration caching layer shortens this delay. This enhancement enables collectors to start collecting (or
continue collecting during restart) with the cached configurations while the new configurations are being
loaded.

along with fixed bug:
ZEN-26936      Device configuration loading speed is slow.

So is there any point in trying to tweak parameters fi sixeable architectural changes have been made around changing device configs?

Cheers,
Jane

------------------------------
Jane Curry
Skills 1st United Kingdom
jane.curry@skills-1st.co.uk
------------------------------

I'd look at what you added to see if it's functioning correctly (check zenhub and zenpython logs in kibana)

How many zenhubs are you running?  Does adding more resolve the issue in such a way that the higher load can be ignored?  

Usually something chewing up CPU for 45 minutes is bad though, and I'd look to fix the underlying issue.



------------------------------
Andrew Kirch
Senior Solutions Engineer
GoVanguard
------------------------------


Subject: RE: Change zenhub resource in Zenoss 5
Author: nandha K
Posted: 2017-08-01 01:31

Hi Jane,

Memory leak in the zenpython services on the windows devices is fixed in the 2.7.8 microsoft zenpack. Please try upgrading and share the result.

Regards
Nandha

------------------------------
nandha K
------------------------------


< Previous
WMI and SMBv1
  Next
Assign permissions to a new roles
>