TECHZEN Zenoss User Community ARCHIVE  

Datasource Command Timeout Clears my NRPE events! [SOLVED]

Subject: Datasource Command Timeout Clears my NRPE events! [SOLVED]
Author: [Not Specified]
Posted: 2015-02-13 08:55

Hi,

I've installed Nagios NRPE on Zenoss 4.2.3 in order to check some processes on a remote host. If I execute the command

/usr/lib64/nagios/plugins/check_nrpe -H remote_ip -c check_procs_processname

from the command line it works fine. I have created a local template with a Data Source of type Command where I've set the cycle time 60 sec. The problem is that a Datasource Command Timeout event is generated every minute and it clears my nrpe events. I've changed the zCommandCommandTimeout to 60000 and also have changed the command_timeout and connection_timeout on nrpe.cfg file to some thousand seconds, but the problem persists.
I enabled the zencommand Debug mode and I got the following error:

DEBUG zen.pbclientfactory: Lost connection to ::1:8789 - [Failure instance: Traceback (failure with no frames): : Connection to the other side was lost in a non-clean fashion: Connection lost.

I want to state that firewall is off, and the traffic passes through NAT before reaching the remote host, which has a public ip.

I have seen that lots of other people have faced the same issue, but unfortunately no one of older posts was helpful to me.
I would really appreciate if someone could help me resolve this issue.

Thnx,
Irba.



Subject: > DEBUG zen.pbclientfactory:
Author: Jan Garaj
Posted: 2015-02-13 11:31

> DEBUG zen.pbclientfactory: Lost connection to ::1:8789 - [Failure instance: Traceback (failure with no frames): : Connection to the other side was lost in a non-clean fashion: Connection lost.

It's not a error, it's feature :-)

Actually this is not problem with your device, but it's zenoss problem - ::1:8789 - it's connection to localhost port 8789 (zenhub) and it's not error because, first word is "DEBUG" - connection to zenhub was not closed properly, so twisted library raised this "error". IMHO It's normal and it's not affecting any Zenoss functionality.

Devops Monitoring Expert advice: Dockerize/automate/monitor all the things.

DevOps stack: Docker / Kubernetes / Mesos / Zabbix / Zenoss / Grafana / Puppet / Ansible / Vagrant / Terraform / Elasticsearch



Subject: I understand. But still it is
Author: [Not Specified]
Posted: 2015-02-16 02:34

I understand. But still it is affecting my monitoring, because a Datasource Command Timeout is generated every time the command is executed, and it clears the event. What can I do to get rid of it
If a service is affected for 30 minutes, I want just one event with a duration of 30 min, not 30 separated events (if I check that service every 60 sec).

Any suggestion

Thnx,
Irba



Subject: Clear event is default
Author: Jan Garaj
Posted: 2015-02-16 02:45

Clear event is default behavior if everything is OK. Pls show example of PROBLEM output from /usr/lib64/nagios/plugins/check_nrpe -H remote_ip -c check_procs_processname. Also please share your datasource config and zencommand debug output (zencommand run -d -v 10).

Devops Monitoring Expert advice: Dockerize/automate/monitor all the things.

DevOps stack: Docker / Kubernetes / Mesos / Zabbix / Zenoss / Grafana / Puppet / Ansible / Vagrant / Terraform / Elasticsearch



Subject: Do you mean If I get a reply
Author: [Not Specified]
Posted: 2015-02-16 03:12

Do you mean If I get a reply from remote host, Zenoss assumes that everything is OK Perhaps I should make a script for it, instead of just a command

My data source config is like that:

Type: Command
Parser: Auto (I tried Nagios also)
Severity: Error
Component: Radius
Event Key: check_procs_radiusd
Command Template: /usr/lib64/nagios/plugins/check_nrpe -H remote_ip -c check_procs_radiusd

This happens when I execute the command:

Preparing Command...
Executing command /usr/lib64/nagios/plugins/check_nrpe [args omitted] against <device>
PROCS CRITICAL: 1 process with args 'radiusd' | procs=1;;2:2;0;
DONE in 0 seconds

And this is the output from zencommand debug:

2015-02-16 09:54:35,870 DEBUG zen.zencommand: Starting PBDaemon initialization
2015-02-16 09:54:35,870 INFO zen.zencommand: Connecting to localhost:8789
2015-02-16 09:54:35,871 DEBUG zen.pbclientfactory: Starting connection...
2015-02-16 09:54:35,871 DEBUG zen.zencommand: Logging in as admin
2015-02-16 09:54:35,872 DEBUG zen.pbclientfactory: Connected
2015-02-16 09:54:35,872 DEBUG zen.pbclientfactory: Cancelling connect timeout
2015-02-16 09:54:35,873 DEBUG zen.pbclientfactory: Sending credentials
2015-02-16 09:54:35,876 DEBUG zen.pbclientfactory: Cancelling connect timeout
2015-02-16 09:54:35,876 INFO zen.zencommand: Connected to ZenHub
2015-02-16 09:54:35,876 DEBUG zen.zencommand: Setting up initial services: EventService, Products.ZenHub.services.CommandPerformanceConfig
2015-02-16 09:54:35,877 DEBUG zen.zencommand: Chaining getInitialServices with d2
2015-02-16 09:54:35,877 DEBUG zen.pbclientfactory: pinging perspective
2015-02-16 09:54:35,880 DEBUG zen.zencommand: Loaded service EventService from zenhub
2015-02-16 09:54:35,880 DEBUG zen.zencommand: Loaded service Products.ZenHub.services.CommandPerformanceConfig from zenhub
2015-02-16 09:54:35,880 DEBUG zen.zencommand: Queued event (total of 1) {'rcvtime': 1424076875.880773, 'severity': 0, 'component': 'zencommand', 'agent': 'zencommand', 'summary': 'started', 'manager': 'srv-zenoss5', 'device': 'localhost', 'eventClass': '/App/Start', 'monitor': 'localhost'}
2015-02-16 09:54:35,881 DEBUG zen.zencommand: Sending 1 events, 0 perf events, 0 heartbeats
2015-02-16 09:54:35,881 DEBUG zen.zencommand: Calling connected.
2015-02-16 09:54:35,882 DEBUG zen.collector.config: Heartbeat timeout set to 900s
2015-02-16 09:54:35,882 DEBUG zen.collector.scheduler: add task configLoader, <Products.ZenCollector.config.ConfigurationLoaderTask object at 0x4627710> using 1200 second interval
2015-02-16 09:54:35,882 DEBUG zen.zencommand: Performing periodic maintenance
2015-02-16 09:54:35,882 DEBUG zen.pbclientfactory: perspective ponged
2015-02-16 09:54:35,882 DEBUG zen.pbclientfactory: Cancelling ping timeout
2015-02-16 09:54:35,882 DEBUG zen.collector.scheduler: Task configLoader starting on 1200 second intervals
2015-02-16 09:54:35,883 DEBUG zen.collector.scheduler: Task configLoader changing state from IDLE to QUEUED
2015-02-16 09:54:35,883 DEBUG zen.collector.scheduler: Task configLoader changing state from QUEUED to RUNNING
2015-02-16 09:54:35,883 DEBUG zen.collector.config: configLoader gathering configuration
2015-02-16 09:54:35,883 DEBUG zen.collector.config: Fetching daemon configuration properties
2015-02-16 09:54:35,908 DEBUG zen.collector.scheduler: Task configLoader changing state from RUNNING to FETCHING_MISC_CONFIG
2015-02-16 09:54:35,908 DEBUG zen.zencommand: Updated configCycleInterval preference to 360
2015-02-16 09:54:35,908 DEBUG zen.zencommand: Changing config task interval from 20 to 360 minutes
2015-02-16 09:54:35,908 DEBUG zen.collector.scheduler: Stopping task configLoader, <Products.ZenCollector.config.ConfigurationLoaderTask object at 0x4627710>
2015-02-16 09:54:35,908 DEBUG zen.collector.scheduler: call finished LoopingCall<1200>(CallableTask: configLoader, *(), **{}) : LoopingCall<1200>(CallableTask: configLoader, *(), **{})
2015-02-16 09:54:35,909 INFO zen.collector.scheduler: Detailed Task Statistics:
configLoader Current State: FETCHING_MISC_CONFIG Successful_Runs: 1 Failed_Runs: 0 Missed_Runs: 0

Detailed Task States:
configLoader State: RUNNING Total: 1 Total Elapsed: 0.0247 Min: 0.0247 Max: 0.0247 Mean: 0.0247 StdDev: 0.0000
configLoader State: QUEUED Total: 1 Total Elapsed: 0.0004 Min: 0.0004 Max: 0.0004 Mean: 0.0004 StdDev: 0.0000

2015-02-16 09:54:35,909 DEBUG zen.collector.config: Heartbeat timeout set to 900s
2015-02-16 09:54:35,909 DEBUG zen.collector.scheduler: add task configLoader, <Products.ZenCollector.config.ConfigurationLoaderTask object at 0x4627610> using 21600 second interval
2015-02-16 09:54:35,910 DEBUG zen.zencommand: Updated defaultRRDCreateCommand preference to ('RRA:AVERAGE:0.5:1:600', 'RRA:AVERAGE:0.5:6:600', 'RRA:AVERAGE:0.5:24:600', 'RRA:AVERAGE:0.5:288:600', 'RRA:MAX:0.5:6:600', 'RRA:MAX:0.5:24:600', 'RRA:MAX:0.5:288:600')
2015-02-16 09:54:35,910 DEBUG zen.collector.config: Fetching threshold classes
2015-02-16 09:54:35,921 DEBUG zen.zencommand: Loading classes ['Products.ZenModel.MinMaxThreshold', 'Products.ZenModel.ValueChangeThreshold']
2015-02-16 09:54:35,922 DEBUG zen.collector.config: Fetching collector thresholds
2015-02-16 09:54:35,969 DEBUG zen.thresholds: Updating threshold ('high event queue', ('localhost collector', ''))
2015-02-16 09:54:35,969 DEBUG zen.thresholds: Updating threshold ('zenmodeler cycle time', ('localhost collector', ''))
2015-02-16 09:54:35,969 DEBUG zen.collector.config: Fetching configurations
2015-02-16 09:54:35,994 DEBUG zen.zencommand: updateDeviceConfigs: updatedConfigs=['<device>']
2015-02-16 09:54:35,995 DEBUG zen.zencommand: Processing configuration for <device>
2015-02-16 09:54:35,995 DEBUG zen.daemon: DummyListener: configuration <device> added
2015-02-16 09:54:35,995 DEBUG zen.collector.tasks: Splitting config <device>
2015-02-16 09:54:35,995 DEBUG zen.zencommand: Tasks for config <device>: {'<device> 60 Local': <__main__.SshPerformanceCollectionTask object at 0x49d47d0>}
2015-02-16 09:54:35,995 DEBUG zen.collector.scheduler: add task <device> 60 Local, COMMAND schedule Name: <device> 60 Local configId: <device> Datasources: 1 using 60 second interval
2015-02-16 09:54:35,995 DEBUG zen.collector.scheduler: Task <device> 60 Local starting on 60 second intervals
2015-02-16 09:54:35,996 DEBUG zen.collector.scheduler: Task <device> 60 Local changing state from IDLE to QUEUED
2015-02-16 09:54:35,996 DEBUG zen.zencommand: purgeOmittedDevices: deletedConfigs=
2015-02-16 09:54:35,996 DEBUG zen.collector.scheduler: Task configLoader finished, result: 'Configuration loaded'
2015-02-16 09:54:35,996 DEBUG zen.collector.scheduler: Task <device> 60 Local changing state from QUEUED to RUNNING
2015-02-16 09:54:35,996 DEBUG zen.zencommand: Running command(s) locally
2015-02-16 09:54:35,996 DEBUG zen.collector.scheduler: Task <device> 60 Local changing state from RUNNING to FETCH_DATA
2015-02-16 09:54:35,997 DEBUG zen.zencommand: Running /usr/lib64/nagios/plugins/check_nrpe
2015-02-16 09:54:36,115 DEBUG zen.zencommand: Datasource: Radius Received exit code: 2 Output:
"PROCS CRITICAL: 1 process with args 'radiusd' | procs=1;;2:2;0;\n"
2015-02-16 09:54:36,116 DEBUG zen.zencommand: Process Radius/Radius stopped (2), 0.12 seconds elapsed
2015-02-16 09:54:36,116 DEBUG zen.collector.scheduler: Task <device> 60 Local changing state from FETCH_DATA to PARSING_DATA
2015-02-16 09:54:36,121 DEBUG zen.collector.scheduler: Task <device> 60 Local changing state from PARSING_DATA to STORE_PERF_DATA
2015-02-16 09:54:36,121 DEBUG zen.zencommand: Queued event (total of 1) {'rcvtime': 1424076876.121818, 'manager': 'srv-zenoss5', 'eventKey': 'check_procs_radiusd', 'device': '<device>', 'eventClass': '/Radius', 'summary': 'Datasource Radius/Radius command timed out', 'component': 'Radius', 'monitor': 'localhost', 'agent': 'zencommand', 'severity': 0}
2015-02-16 09:54:36,122 DEBUG zen.zencommand: Queued event (total of 2) {'rcvtime': 1424076876.122189, 'severity': 5, 'component': 'Radius', 'eventClass': '/Radius', 'summary': "PROCS CRITICAL: 1 process with args 'radiusd' ", 'manager': 'srv-zenoss5', 'eventKey': 'check_procs_radiusd', 'agent': 'zencommand', 'device': '<device>', 'message': "PROCS CRITICAL: 1 process with args 'radiusd' ", 'performanceData': [' procs=1;;2:2;0;'], 'monitor': 'localhost'}
2015-02-16 09:54:36,122 DEBUG zen.collector.scheduler: Task <device> 60 Local finished, result: [(<Products.ZenRRD.zencommand.Cmd instance at 0x4626878>, ParsedResults
  events: [{'component': 'Radius',
  'device': '<device>',
  'eventClass': '/Radius',
  'eventKey': 'check_procs_radiusd',
  'severity': 0,
  'summary': 'Datasource Radius/Radius command timed out'},
 {'component': 'Radius',
  'device': '<device>',
  'eventClass': '/Radius',
  'eventKey': 'check_procs_radiusd',
  'message': "PROCS CRITICAL: 1 process with args 'radiusd' ",
  'performanceData': [' procs=1;;2:2;0;'],
  'severity': 5,
  'summary': "PROCS CRITICAL: 1 process with args 'radiusd' "}]
  values: []})]
2015-02-16 09:54:36,125 DEBUG zen.collector.scheduler: Task <device> 60 Local changing state from STORE_PERF_DATA to IDLE
2015-02-16 09:54:36,125 INFO zen.zencommand: 1 devices processed (0 datapoints)
2015-02-16 09:54:36,125 INFO zen.collector.scheduler: Tasks: 2 Successful_Runs: 1 Failed_Runs: 0 Missed_Runs: 0 Queued_Tasks: 0 Running_Tasks: 1
2015-02-16 09:54:36,125 DEBUG zen.collector.scheduler: In shutdown stage before
2015-02-16 09:54:36,126 DEBUG zen.collector.scheduler: Stopping running task <device> 60 Local
2015-02-16 09:54:36,126 DEBUG zen.collector.scheduler: call finished LoopingCall<60>(CallableTask: <device> 60 Local, *(), **{}) : LoopingCall<60>(CallableTask: <device> 60 Local, *(), **{})
2015-02-16 09:54:36,126 DEBUG zen.collector.scheduler: Removing task <device> 60 Local
2015-02-16 09:54:36,126 DEBUG zen.collector.scheduler: Removing task configLoader
2015-02-16 09:54:36,127 DEBUG zen.collector.scheduler: tasks to clean set([<Products.ZenCollector.config.ConfigurationLoaderTask object at 0x4627710>, <__main__.SshPerformanceCollectionTask object at 0x49d47d0>, <Products.ZenCollector.config.ConfigurationLoaderTask object at 0x4627610>])
2015-02-16 09:54:36,127 DEBUG zen.collector.scheduler: Cleanup on task configLoader <Products.ZenCollector.config.ConfigurationLoaderTask object at 0x4627710>
2015-02-16 09:54:36,127 DEBUG zen.collector.scheduler: Scheduler._cleanupTaskComplete: result=None task.name=configLoader
2015-02-16 09:54:36,127 DEBUG zen.collector.scheduler: Cleanup on task <device> 60 Local COMMAND schedule Name: <device> 60 Local configId: <device> Datasources: 1
2015-02-16 09:54:36,128 DEBUG zen.collector.scheduler: Scheduler._cleanupTaskComplete: result=None task.name=<device> 60 Local
2015-02-16 09:54:36,128 DEBUG zen.collector.scheduler: Cleanup on task configLoader <Products.ZenCollector.config.ConfigurationLoaderTask object at 0x4627610>
2015-02-16 09:54:36,128 DEBUG zen.collector.scheduler: Scheduler._cleanupTaskComplete: result=None task.name=configLoader
2015-02-16 09:54:36,129 DEBUG zen.zencommand: Tried to stop reactor that was stopped
2015-02-16 09:54:36,129 INFO zen.zencommand: Daemon CollectorDaemon shutting down
2015-02-16 09:54:36,130 DEBUG zen.zencommand: Sending 2 events, 0 perf events, 0 heartbeats
2015-02-16 09:54:36,157 DEBUG zen.zencommand: Removing service EventService
2015-02-16 09:54:36,157 DEBUG zen.zencommand: Removing service Products.ZenHub.services.CommandPerformanceConfig
2015-02-16 09:54:36,157 DEBUG zen.pbclientfactory: Lost connection to ::1:8789 - [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
]
2015-02-16 09:54:36,158 DEBUG zen.collector.scheduler: In shutdown stage during
2015-02-16 09:54:36,158 DEBUG zen.collector.scheduler: In shutdown stage after



Subject: Hi,
Author: [Not Specified]
Posted: 2015-02-17 08:17

Hi,

Can anyone suggest me what to do I'm running out of ideas. I also tried to transform the incoming events (to change the eventKey in order Datasource Command Timeout would not clear them anymore), but it didn't work.



Subject: How did you define threshold
Author: Jan Garaj
Posted: 2015-02-17 08:52

How did you define threshold in the Zenoss

Devops Monitoring Expert advice: Dockerize/automate/monitor all the things.

DevOps stack: Docker / Kubernetes / Mesos / Zabbix / Zenoss / Grafana / Puppet / Ansible / Vagrant / Terraform / Elasticsearch



Subject: I haven't added any Datapoint
Author: [Not Specified]
Posted: 2015-02-17 09:05

I haven't added any Datapoint to Data Source, and eventually haven't set any threshold.
I just want to check some processes on a remote machine. When the command is executed I can receive 3 types of reply:
Procs Crititcal, Procs Warning and Procs OK. I want the Procs OK to clear the Critical and Warning alarms.



Subject: Today IRC dev hours:
Author: Jan Garaj
Posted: 2015-02-17 11:06

Today's IRC dev hour:
[Tuesday 15:50] iancmcc the problem appears to be the eventKey on both events
[Tuesday 15:51] iancmcc the clear event is correct, and the problem is correct, but they have the same eventKey

Devops Monitoring Expert advice: Dockerize/automate/monitor all the things.

DevOps stack: Docker / Kubernetes / Mesos / Zabbix / Zenoss / Grafana / Puppet / Ansible / Vagrant / Terraform / Elasticsearch



Subject: Thnx Jan.garaj for your
Author: [Not Specified]
Posted: 2015-02-19 06:20

Thnx Jan.garaj for your replies, but I don't think it is related to eventKey. I have tried it with eventKey blank, but still hasn't worked.
The same situation happens during dns_check (or dig_check). If DNS server is unable to resolve some domain, it generates an error which is cleared by a Datasource device-name/datasource_name command timeout. When the next cycle check begins, a new event is generated and the same situation repeats itself until DNS server resolves the domain name.

Can it be a Zenoss config issue What am I doing wrong I have the Nagios plugins under : /usr/lib64/nagios/plugins/ . These files are owned by user nagios (with pass nagios). I also want to state that I haven't configured any Datapoints, just one Datasource. For nrpe installation i followed the instructions in this paper: http://www.skills-1st.co.uk/papers/jane/nrpe_paper.pdf.



Subject: I finally solved this issue.
Author: [Not Specified]
Posted: 2015-02-20 03:37

I finally solved this issue. Perhaps is not the cleanest solution, but at least is working.
I had to change the code of /opt/zenoss/Products/ZenRRD/zencommand.py , setting the Datasource Command Timeout severity to Info. Now the Datasource command timeout event, is not clearing anything, and all my Nagios related events are correctly cleared based on the returning codes.



< Previous
Event and Clear the same event
  Next
Can't seem to connect to address
>