TECHZEN Zenoss User Community ARCHIVE  

Events stop every 80 minutes

Subject: Events stop every 80 minutes
Author: [Not Specified]
Posted: 2017-04-24 10:10

Every 80 minutes SOMETHING is happening causing communication to stop. The biggest tip is the event queue length for zenpython. Like clockwork, it will climb for 10 minutes and then drop back down to 0.

The cause of this issue is with the "Config Cycle Interval" setting for the collector. I have it set at 20 minutes. Every 20 minutes device configs are pulled back by zenhub. I see a huge increase in network traffic and CPU load for a few minutes during this time. Every 80 minutes Windows device configs are pulled back. It takes the worker 3-4 minutes to pull the data. Zenpython is doing the work and is also handling regular data and events that are coming in. It takes 10 minutes for the zenpython event queue to clear out. I've reattached my zenpython event queue graph.

What effect would bumping the Config Cycle Interval up to a few hours have on Zenoss? If I made a change to a device would it take a few hours for Zenoss to recognize the change as opposed to 20 minutes as it's set now?

I'm running Zenoss Core 5.2.1, so no collectors.

Attachments:

zenpython_eventqueue.png



Subject: RE: Events stop every 80 minutes
Author: Jane Curry
Posted: 2017-05-30 09:32

I would up the Config Cycle. 

My understanding is that the Config Cycle is more to do with zenhub updating collectors with changed information in the Zenoss ZODB database.  For example, if you have had a modeler cycle and some some new interfaces have been found on a device, then collectors need to be told to monitor those interfaces.  You may have changed template definitions; again, those changes are in the Zenoss central database and the details need pushing to collectors.  Another example is if you change zProperties - like SNMP community or Windows user/password/WinRM details.

If you go back a long way - and I'm probably talking early Zenoss 3.x here - then "Config Changes" needed to be specifically "pushed" to collectors; in newer Zenoss implementation I think much of this "push" work is done automatically if something changes.  You also have the "Push Config Changes" option from the Action icon for a specific device if you know something has changed recently.

I would try changing the Config Cycle to a few hours as you suggest and then run some tests changing various details like those given above and see if they get automatically pushed.

Cheers,
Jane

------------------------------
Jane Curry
Skills 1st United Kingdom
jane.curry@skills-1st.co.uk
------------------------------


Subject: RE: Events stop every 80 minutes
Author: Brian Schimmoller
Posted: 2017-06-07 11:16

Jane,

Thanks for your response. I went ahead and changed the Config Cycle Interval from 20 minutes to 360 minutes. I am now seeing the same results, but 6 hours apart now. I also see zenhub CPU spike to 100% for ~70 minutes, which corresponds with this log entry:


2017-06-07 14:26:34,314 DEBUG zen.ZenHub: worker 7, work localhost:ZenPacks.zenoss.PythonCollector.services.PythonConfig.getDeviceConfigs finished in 3962.51
283693

Is it normal for PythonCollector to take 66 minutes to pull Windows device configs for ~400 systems?


Subject: RE: Events stop every 80 minutes
Author: Jane Curry
Posted: 2017-06-12 08:17

I have to say "surely not"??  Your original post showed that there were sometimes lots of timeouts from zenpython when collecting Windows data - could it be a long timeout on each zenpython request when lots of devices are down?

That said, I think your latest message is saying it is taking 66 minutes to pull device CONFIGS from zenhub to the zenpython daemon, I think on the localhost collector??  No remote collector here?  That shouldn't involve the target devices at all!

Cheers,
Jane

------------------------------
Jane Curry
Skills 1st United Kingdom
jane.curry@skills-1st.co.uk
------------------------------

Subject: RE: Events stop every 80 minutes
Author: Jane Curry
Posted: 2017-06-19 14:01

How did you change the Config Cycle in Zenoss 5.x??
Cheers,
Jane

------------------------------
Jane Curry
Skills 1st United Kingdom
jane.curry@skills-1st.co.uk
------------------------------


Subject: RE: Events stop every 80 minutes
Author: Brian Schimmoller
Posted: 2017-06-20 08:43

Go to Advanced -> Control Center -> localhost. This will bring up the "Details" of the localhost in the bottom pane.


Subject: RE: Events stop every 80 minutes
Author: Jane Curry
Posted: 2017-06-20 14:02

Thanks Brian - just twigged that! Must have been particularly thick that day ;(

That DOES give access to the ConfigCycle.  However, with 5.2.1 and 5.2.4 I still cannot expand the localhost menu to see the individual daemons and their configs, logs and graphs.  Get a yellow flashy saying:
"AttributeError: 'MonitorFacade' object has no attribute 'getMonitor'"

I have just spun up a 5.2.4 in the hope that this one was cured - logged as  https://jira.zenoss.com/browse/ZEN-27452   Do you get the same thing?

Have just opened this as an item in its own right -  Cannot expand localhost in Zenoss GUI -> Control Centre 

Cheers,
Jane

------------------------------
Jane Curry
Skills 1st United Kingdom
jane.curry@skills-1st.co.uk
------------------------------

Every 80 minutes SOMETHING is happening causing communication to stop. The biggest tip is the event queue length for zenpython. Like clockwork, it will climb for 10 minutes and then drop back down to 0.

I've been troubleshooting this for about a week and I can't track down the root cause. I'm thinking it's something to do with docker communication, but I can't prove it.

Occassionally during this time, I'll get the following zenpython logs:

April 24th 2017, 03:33:01.847 0 2017-04-24 07:33:01,278 WARNING zen.MicrosoftWindows: network error on : timeoutApril 24th 2017, 03:33:01.847 0 2017-04-24 07:33:01,280 WARNING zen.MicrosoftWindows: network error on : timeoutApril 24th 2017, 03:33:01.847 0 2017-04-24 07:33:01,281 WARNING zen.MicrosoftWindows: network error on : timeoutApril 24th 2017, 03:33:01.847 0 2017-04-24 07:33:01,282 WARNING zen.MicrosoftWindows: network error on : timeoutApril 24th 2017, 03:33:01.847 0 2017-04-24 07:33:01,287 WARNING zen.MicrosoftWindows: network error on : timeoutApril 24th 2017, 03:33:01.847 0 2017-04-24 07:33:01,289 WARNING zen.MicrosoftWindows: network error on : timeoutApril 24th 2017, 03:33:01.847 0 2017-04-24 07:33:01,298 WARNING zen.MicrosoftWindows: network error on : timeoutApril 24th 2017, 03:33:01.847 0 2017-04-24 07:33:01,319 WARNING zen.MicrosoftWindows: network error on : timeoutApril 24th 2017, 03:33:01.847 0 2017-04-24 07:33:01,324 WARNING zen.MicrosoftWindows: network error on : timeoutApril 24th 2017, 03:33:01.847 0 2017-04-24 07:33:01,324 WARNING zen.MicrosoftWindows: network error on : timeoutApril 24th 2017, 03:33:01.847 0 2017-04-24 07:33:01,325 WARNING zen.MicrosoftWindows: network error on : timeoutApril 24th 2017, 03:33:01.847 0 2017-04-24 07:33:01,326 WARNING zen.MicrosoftWindows: network error on : timeoutApril 24th 2017, 03:33:01.847 0 2017-04-24 07:33:01,327 WARNING zen.MicrosoftWindows: network error on : timeoutApril 24th 2017, 03:32:55.846 0 2017-04-24 07:32:55,599 WARNING zen.MicrosoftWindows: network error on : timeoutApril 24th 2017, 03:32:55.846 0 2017-04-24 07:32:55,608 WARNING zen.MicrosoftWindows: network error on : timeoutApril 24th 2017, 03:32:55.846 0 2017-04-24 07:32:55,609 WARNING zen.MicrosoftWindows: network error on : timeout

But I don't get ping timeouts, the graph data doesn't seem to skip a beat, just the event counts stop for 10-15 minutes. Once the 10 minutes are up, event counts catchup and I receive new events.

After the 10 minutes is up, I get this elastic search error mentioning a 10 minutes timeout. I'm just not sure if this is the cause or the effect of my problem and I'm not sure what elastic search provides.

 

Has anyone seen this before? Anyone have any idea what might be happening every 80 minutes?

I can provide more logs, if needed, but I don't want to just dump everything out here.






< Previous
upgraded to core 5.2.3 - help
  Next
Permmission problem on Devices
>