![]() |
![]() |
Subject: | Zenoss 4.2.5 - zenoss.queues.zep.zenevents fills into the 100's of thousands, event processing stalls/crawls. |
Author: | [Not Specified] |
Posted: | 2015-06-30 21:02 |
I'm having some trouble with this one and I'm not sure what to try next. On a recently installed Zenoss 4.2.5 system, minutes or hours after starting the services (mysql, rabbitmq, and zenoss), events start trickling in or completely stop arriving. When this happens, the zenoss.queues.zep.zenevents queue starts filling up rapidly.
After the initial install, I had 300 devices on the system for about two weeks with no problems. I then meticulously recreated zenpacks and monitoring templates, followed by an import of the remaining 2700 devices (using zenbatchload). About 500 of the 3000 devices are ping only. This is all part of a migration from 3.2.1 to 4.2.5. Our 3.2.1 instance actually has 4100+ devices. It works quite well but is very slow.
First some info about the system:
Baremetal, Centos 6.6 install using autodeploy script
8 core - 3.7 ghz
128GB RAM
Versions:
mysql Ver 14.14 Distrib 5.5.37
rabbitmq-server-2.8.7-1
Java :
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)
We are using a hardware RAID 1 with two standard 500GB drives for / (including /var/lib/mysql) and a hardware RAID 1 with two high performance Hitachi Ultrastar C15K300's for /export/perf.
-- cut --
[zenoss@zen4 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/zen4-root
50G 5.1G 42G 11% /
tmpfs 64G 56K 64G 1% /dev/shm
/dev/sda1 364M 70M 276M 21% /boot
/dev/mapper/zen4-zenoss
58G 31G 25G 56% /opt/zenoss
/dev/sdb1 274G 19G 242G 8% /export/perf
/dev/mapper/zen4-snaps
197G 78G 109G 42% /snapshots
/dev/mapper/zen4-var_lib_mysql
84G 39G 42G 49% /var/lib/mysql
/dev/mapper/zen4-var_log
15G 7.0G 7.0G 51% /var/log
-- cut --
It appears that the system may eventually start processing events again. I came in today after a long weekend to find that the events were still processing and that the rabbitmq queues were empty. Three hours later, I found that the events had again stopped.
"rabbitmqctl -p /zenoss list_queues; rabbitmqctl list_connections" looks like this :
-- cut --
Listing queues ...
celery 0
zenoss.queues.zep.signal 0
zenoss.queues.zep.modelchange 1070
zenoss.queues.zep.migrated.summary 0
zenoss.queues.zep.rawevents 0
zenoss.queues.zep.heartbeats 0
zenoss.queues.zep.zenevents 799137
zen4.webpass.net.celeryd.pidbox 0
zenoss.queues.zep.migrated.archive 0
...done.
Listing connections ...
zenoss ::1 59080 running
zenoss ::1 59023 running
zenoss ::1 59064 running
zenoss ::1 59047 running
zenoss 127.0.0.1 52157 running
zenoss 127.0.0.1 52142 running
zenoss ::1 54911 running
zenoss ::1 33435 running
zenoss ::1 59046 running
zenoss ::1 59066 running
zenoss ::1 59062 running
zenoss ::1 59071 running
zenoss ::1 32841 running
zenoss ::1 59067 running
zenoss ::1 59065 running
zenoss ::1 59063 running
zenoss ::1 59072 running
...done.
-- cut --
If I run the following script, queues will clear out and events will start processing again, but it this doesn't help when the problem recurs every few hours.
-- cut --
#!/bin/bash
#Reset Zenoss
/etc/init.d/zenoss stop
rm -rf $ZENHOME/var/zeneventserver/index
# Truncate tables in mysql - if I only clear the rabbitmq queues, the problem will recur within about 2 hours. Truncating zenoss_zep gives me about 24 hours.
mysql -uroot -p zenoss_zep -e "SET foreign_key_checks = 0;truncate event_summary;truncate event_summary_index_queue;truncate event_archive;set foreign_key_checks=1;commit;"
/etc/init.d/mysql restart
# Reset RabbitMQ
export VHOST="/zenoss"
export USER="zenoss"
export PASS="{pass}"
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl start_app
rabbitmqctl add_vhost $VHOST
rabbitmqctl add_user "$USER" "$PASS"
rabbitmqctl set_permissions -p "$VHOST" "$USER" '.*' '.*' '.*'
rabbitmqctl -p $VHOST list_queues
/etc/init.d/rabbitmq-server restart
/etc/init.d/zenoss restart
-- cut --
Below are some of my configs. I've already made several modifications related to this issue.
-- cut --
** FILE: messaging.conf **
exchange.default.delivery_mode = 2
exchange.default.compression = deflate
queue.default.x-message-ttl = 86400000
queue.default.x-expires = 86400000
** FILE: zencommand.conf **
hubhost localhost
hubport 8789
hubusername admin
monitor localhost
initialHubTimeout 30
eventflushseconds 5.0
eventflushchunksize 50
maxqueuelen 5000
logseverity 10
maxlogsize 10240
maxbackuplogs 3
uid zenoss
maxparallel 50
** FILE: zeneventd.conf **
workers 8
** FILE: zeneventlog.conf **
hubhost localhost
hubport 8789
hubusername admin
hubpassword zenoss
monitor localhost
initialHubTimeout 30
eventflushseconds 5.0
eventflushchunksize 50
logseverity 20
maxlogsize 10240
maxbackuplogs 3
uid zenoss
maxparallel 500
queryTimeout 300
** FILE: zeneventserver.conf **
zep.jdbc.pool.test_on_borrow=true
zep.database.optimize_minutes=0
zep.jdbc.pool.min_idle=0
zep.jdbc.pool.time_between_eviction_runs_millis=5000
zep.jdbc.pool.min_evictable_idle_time_millis=5000
** FILE: zenhub.conf **
maxlogsize 10240
maxbackuplogs 3
uid zenoss
xmlrpcport 8081
pbport 8789
monitor localhost
workers 4
** FILE: zenmodeler.conf **
parallel 4
** FILE: zenperfsnmp.conf **
maxparallel 1000
** FILE: zentrap.conf **
hubhost localhost
hubport 8789
hubusername admin
hubpassword zenoss
monitor localhost
initialHubTimeout 30
eventflushseconds 5.0
eventflushchunksize 50
maxqueuelen 5000
logseverity 30
maxlogsize 10240
maxbackuplogs 3
uid zenoss
maxparallel 500
trapport 162
captureAll False
** FILE: zope.conf **
%define INSTANCE /opt/zenoss
default-zpublisher-encoding utf-8
instancehome $INSTANCE
effective-user zenoss
enable-product-installation off
zserver-threads 5
python-check-interval 1800
level info
path $INSTANCE/log/event.log
level info
level WARN
path $INSTANCE/log/Z2.log
format %(message)s
conflict-error-log-level debug
address 8080
mount-point /temp_folder
%import relstorage
keep-history false
%include zodb_db_session.conf
mount-point /
cache-size 90252
pool-size 10
%import relstorage
cache-servers 127.0.0.1:11211
cache-module-name memcache
cache-local-mb 8192
keep-history false
%include zodb_db_main.conf
** FILE: /etc/my.cnf **
[mysqld]
bind-address=127.0.0.1
skip_external_locking
#innodb_buffer_pool_size = 4096M
innodb_log_file_size = 512M
innodb_additional_mem_pool_size = 32M
innodb_log_buffer_size = 8M
innodb_flush_method = O_DIRECT
innodb_flush_log_at_trx_commit = 2
# in mysql 5.5+, innodb_thread_concurrency should be 0
innodb_thread_concurrency = 0
innodb_file_format = Barracuda
innodb_purge_threads = 1
# a non-zero innodb_max_purge_lag can cause significant delays in query processing, set to 0 unless you are a mysql savant
innodb_max_purge_lag = 0
query_cache_size = 0
query_cache_type = OFF
max_allowed_packet = 64M
wait_timeout = 86400
interactive_timeout = 86400
max_connections = 500
max_user_connections = 1500
# Test tweaks - dg
innodb_lock_wait_timeout = 50
innodb_buffer_pool_size = 96G
[client]
user = zenoss
[mysql]
max_allowed_packet = 64M
[mysqldump]
max_allowed_packet = 64M
-- cut --
I recently reset everything and wiped out the logs. I'll post some log data in about 24 hours.
Some of the changes that I've already made were suggested in the following posts & links:
http://community.zenoss.org/message/73748
http://community.zenoss.org/message/68912
http://community.zenoss.org/docs/DOC-2445#HowdoIstopHeartbeaterrors
http://www.zenoss.org/forum/3036
http://comments.gmane.org/gmane.comp.monitoring.zenoss.user/47915
https://jira.zenoss.com/browse/ZEN-2937
I've also gone over:
http://wiki.zenoss.org/Working_with_Queues
http://wiki.zenoss.org/Queue_Troubleshooting
and others...
Any help would be very greatly appreciated.
Subject: | What RPS? I know there is an |
Author: | Jay Stanley |
Posted: | 2015-07-06 14:37 |
What RPS I know there is an issue with zeneventserver and deadlocks with increased raw events. Supposed to be fixed in the most recent RPS(s)
Subject: | Hi Vook, you can fix that |
Author: | [Not Specified] |
Posted: | 2015-07-08 16:57 |
Hi Vook, you can fix that error about ZEP not being available by doing...
zeneventserver stop
cd $ZENHOME/var/zeneventserver/index
rm -rf summary archive
Then re-running the zenup install.
Cheers.
Subject: | After about five days of |
Author: | [Not Specified] |
Posted: | 2015-07-14 14:03 |
After about five days of seemingly adequate performance, this problem returned July 13. I applied RPS 457 on July 8. It's worth noting that prior to patching with RPS 457, the system would run for 1-2 days before this problem arrived. Also I noticed on July 13 that the zep.modelchange queue started filling up very slowly before zep.zenevents did. When I arrived on July 14, I observed that events in Zenoss had either stopped or slowed to a trickle.
Here is my current queue state:
-- cut --
[root@zen4 ~]# rabbitmqctl -p /zenoss list_queues; rabbitmqctl list_connections
Listing queues ...
celery 0
zenoss.queues.zep.signal 0
zenoss.queues.zep.modelchange 5122
zenoss.queues.zep.migrated.summary 0
zenoss.queues.zep.rawevents 1
zenoss.queues.zep.heartbeats 4
zenoss.queues.zep.zenevents 1668610
zen4.webpass.net.celeryd.pidbox 0
zenoss.queues.zep.migrated.archive 0
...done.
Listing connections ...
zenoss ::1 41096 running
zenoss ::1 41031 running
zenoss ::1 41022 running
zenoss ::1 41047 running
zenoss ::1 41066 running
zenoss ::1 41062 running
zenoss ::1 41048 running
zenoss 127.0.0.1 42608 running
zenoss ::1 55502 running
zenoss ::1 41023 running
zenoss ::1 41039 running
zenoss ::1 41095 running
zenoss ::1 41065 running
zenoss ::1 41053 running
zenoss 127.0.0.1 42588 running
...done.
-- cut --
I see this log entry in zeneventd.log and eventdeventletworker.log :
-- cut --
2015-07-13 06:38:37,864 INFO zen.Events: Transform error Unable to acquire comm
it lock; aborting transaction
2015-07-13 06:38:37,865 INFO zen.Events: Deleting old pickle files ...
2015-07-13 06:38:37,865 INFO zen.Events: Deleted 1 old pickle files.
2015-07-13 06:38:37,866 WARNING zen.Events: Error processing transform/mapping
on Event Class /Status/Perf
Problem on line 8: StorageError: Unable to acquire commit lock
updateDb()
Transform:
0 # SET OPERSTATUS ON IPINTERFACE COMPONENT
1 if component is not None and evt.eventKey == 'ifOperStatus_ifOperStatus|ifO
perStatusChange':
2 evt._action='drop'
3 operStatus = int(float(getattr(evt,'current', '0')))
4 if component.operStatus != operStatus:
5 @transact
6 def updateDb():
7 component.operStatus=operStatus
8 updateDb()
-- cut --
This entry is from three hours earlier in zenhub.log :
-- cut --
2015-07-13 03:31:33,622 INFO zen.ZenHub: Worker (17774) reports Unhandled Error
Traceback (most recent call last):
File "/opt/zenoss/lib/python/twisted/spread/pb.py", line 841, in _recvMessage
netResult = object.remoteMessageReceived(self, message, netArgs, netKw)
File "/opt/zenoss/lib/python/twisted/spread/flavors.py", line 114, in remoteM
essageReceived
state = method(*args, **kw)
File "/opt/zenoss/Products/ZenHub/PBDaemon.py", line 97, in inner
return callable(*args, **kw)
File "/opt/zenoss/lib/python/twisted/internet/defer.py", line 1141, in unwind
Generator
return _inlineCallbacks(None, f(*args, **kwargs), Deferred())
---
File "/opt/zenoss/lib/python/twisted/internet/defer.py", line 1020, in _inlin
eCallbacks
result = g.send(result)
File "/opt/zenoss/Products/ZenHub/zenhubworker.py", line 204, in remote_execu
te
result = runOnce()
File "/opt/zenoss/Products/ZenHub/zenhubworker.py", line 190, in runOnce
res = m(*args, **kw)
File "/opt/zenoss/Products/ZenHub/PBDaemon.py", line 105, in inner
traceback.format_exc())
Products.ZenHub.PBDaemon.RemoteException: Remote exception:
eback:
File "/opt/zenoss/Products/ZenHub/zenhubworker.py", line 263, in
reactor.run()
File "/opt/zenoss/lib/python/twisted/internet/base.py", line 1162, in run
self.mainLoop()
File "/opt/zenoss/lib/python/twisted/internet/base.py", line 1174, in mainLoop
self.doIteration(t)
File "/opt/zenoss/lib/python/twisted/internet/selectreactor.py", line 140, in doSelect
_logrun(selectable, _drdw, selectable, method, dict)
File "/opt/zenoss/lib/python/twisted/python/log.py", line 84, in callWithLogger
return callWithContext({"system": lp}, func, *args, **kw)
File "/opt/zenoss/lib/python/twisted/python/log.py", line 69, in callWithContext
return context.call({ILogContext: newCtx}, func, *args, **kw)
File "/opt/zenoss/lib/python/twisted/python/context.py", line 118, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/opt/zenoss/lib/python/twisted/python/context.py", line 81, in callWithContext
return func(*args,**kw)
File "/opt/zenoss/lib/python/twisted/internet/selectreactor.py", line 146, in _doReadOrWrite
why = getattr(selectable, method)()
File "/opt/zenoss/lib/python/twisted/internet/tcp.py", line 460, in doRead
rval = self.protocol.dataReceived(data)
File "/opt/zenoss/lib/python/twisted/spread/banana.py", line 223, in dataReceived
gotItem(item)
File "/opt/zenoss/lib/python/twisted/spread/banana.py", line 153, in gotItem
self.callExpressionReceived(item)
File "/opt/zenoss/lib/python/twisted/spread/banana.py", line 116, in callExpressionReceived
self.expressionReceived(obj)
File "/opt/zenoss/lib/python/twisted/spread/pb.py", line 515, in expressionReceived
method(*sexp[1:])
File "/opt/zenoss/lib/python/twisted/spread/pb.py", line 827, in proto_message
self._recvMessage(self.localObjectForID, requestID, objectID, message, answerRequired, netArgs, netKw)
File "/opt/zenoss/lib/python/twisted/spread/pb.py", line 841, in _recvMessage
netResult = object.remoteMessageReceived(self, message, netArgs, netKw)
File "/opt/zenoss/lib/python/twisted/spread/flavors.py", line 114, in remoteMessageReceived
state = method(*args, **kw)
File "/opt/zenoss/Products/ZenHub/PBDaemon.py", line 97, in inner
return callable(*args, **kw)
File "/opt/zenoss/lib/python/twisted/internet/defer.py", line 1141, in unwindGenerator
return _inlineCallbacks(None, f(*args, **kwargs), Deferred())
File "/opt/zenoss/lib/python/twisted/internet/defer.py", line 1020, in _inlineCallbacks
result = g.send(result)
File "/opt/zenoss/Products/ZenHub/zenhubworker.py", line 204, in remote_execute
result = runOnce()
File "/opt/zenoss/Products/ZenHub/zenhubworker.py", line 190, in runOnce
res = m(*args, **kw)
File "/opt/zenoss/Products/ZenHub/PBDaemon.py", line 97, in inner
return callable(*args, **kw)
File "/opt/zenoss/Products/ZenHub/services/ModelerService.py", line 157, in remote_applyDataMaps
result = inner(map)
File "/opt/zenoss/Products/ZenHub/services/ModelerService.py", line 153, in inner
return self._do_with_retries(action)
File "/opt/zenoss/Products/ZenHub/services/ModelerService.py", line 179, in _do_with_retries
return action()
File "/opt/zenoss/Products/ZenHub/services/ModelerService.py", line 144, in action
completed= bool(adm._applyDataMap(device, map))
File "/opt/zenoss/lib/python/ZODB/transact.py", line 51, in g
_commit(note)
File "/opt/zenoss/lib/python/ZODB/transact.py", line 23, in _commit
t.commit()
File "/opt/zenoss/lib/python/transaction/_transaction.py", line 340, in commit
t, v, tb = self._saveAndGetCommitishError()
File "/opt/zenoss/lib/python/transaction/_transaction.py", line 333, in commit
self._commitResources()
File "/opt/zenoss/lib/python/transaction/_transaction.py", line 473, in _commitResources
reraise(t, v, tb)
File "/opt/zenoss/lib/python/transaction/_transaction.py", line 450, in _commitResources
rm.tpc_vote(self)
File "/opt/zenoss/lib/python/ZODB/Connection.py", line 781, in tpc_vote
s = vote(transaction)
File "/opt/zenoss/lib/python/relstorage/storage.py", line 794, in tpc_vote
return self._vote()
File "/opt/zenoss/lib/python/relstorage/storage.py", line 828, in _vote
self._prepare_tid()
File "/opt/zenoss/lib/python/relstorage/storage.py", line 697, in _prepare_tid
adapter.locker.hold_commit_lock(cursor, ensure_current=True)
File "/opt/zenoss/lib/python/relstorage/adapters/locker.py", line 114, in hold_commit_lock
raise StorageError("Unable to acquire commit lock")
StorageError: Unable to acquire commit lock
: Traceback (most recent call last):
File "/opt/zenoss/Products/ZenHub/PBDaemon.py", line 97, in inner
return callable(*args, **kw)
File "/opt/zenoss/Products/ZenHub/services/ModelerService.py", line 157, in remote_applyDataMaps
result = inner(map)
File "/opt/zenoss/Products/ZenHub/services/ModelerService.py", line 153, in inner
return self._do_with_retries(action)
File "/opt/zenoss/Products/ZenHub/services/ModelerService.py", line 179, in _do_with_retries
return action()
File "/opt/zenoss/Products/ZenHub/services/ModelerService.py", line 144, in action
completed= bool(adm._applyDataMap(device, map))
File "/opt/zenoss/lib/python/ZODB/transact.py", line 44, in g
r = f(*args, **kwargs)
File "/opt/zenoss/Products/DataCollector/ApplyDataMap.py", line 217, in _applyDataMap
changed = self._updateRelationship(tobj, datamap)
File "/opt/zenoss/ZenPacks/ZenPacks.zenoss.PythonCollector-1.5.2-py2.7.egg/ZenPacks/zenoss/PythonCollector/patches/platform.py", line 38, in _updateRelationship
return original(self, device, relmap)
File "/opt/zenoss/Products/DataCollector/ApplyDataMap.py", line 280, in _updateRelationship
objchange, obj = self._createRelObject(device, objmap, rname)
File "/opt/zenoss/Products/DataCollector/ApplyDataMap.py", line 472, in _createRelObject
rel._setObject(remoteObj.id, remoteObj)
File "/opt/zenoss/Products/ZenRelations/ToManyContRelationship.py", line 128, in _setObject
self.addRelation(object)
File "/opt/zenoss/Products/ZenRelations/ToManyContRelationship.py", line 118, in addRelation
ToManyRelationshipBase.addRelation(self, obj)
File "/opt/zenoss/Products/ZenRelations/RelationshipBase.py", line 93, in addRelation
self._add(obj)
File "/opt/zenoss/Products/ZenRelations/ToManyContRelationship.py", line 155, in _add
self._objects[id] = aq_base(obj)
File "/opt/zenoss/lib/python/ZODB/Connection.py", line 979, in register
self._register(obj)
File "/opt/zenoss/lib/python/ZODB/Connection.py", line 989, in _register
self.transaction_manager.get().join(self)
File "/opt/zenoss/lib/python/transaction/_transaction.py", line 216, in join
self._prior_operation_failed() # doesn't return
File "/opt/zenoss/lib/python/transaction/_transaction.py", line 212, in _prior_operation_failed
self._failure_traceback
-- cut --
Zenmodeler.log shows no entries between after 02:30 on July 14 (until I restarted Zenoss the next morning).
< |
Previous Zenoss 3.2.1 Zenpack zenHttpComponent error |
Next can't snmpwalk from zenoss webui |
> |