TECHZEN Zenoss User Community ARCHIVE  

Zenoss 6.2.1, Zope stops answering on its own, unprovoked

Subject: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
Author: Jad Baz
Posted: 2018-12-18 09:34

Hello,

I'm having a weird problem and I can't get to the bottom of it.

Zope stops answering after some time



The failing healthcheck is (where 9080 is the Zope exposed port)
curl -A 'Zope answering healthcheck' --retry 3 --max-time 2 -s http://localhost:9080/zport/ruok | grep -q imok​
This simply times out.

So we netstat
[root@testcontroller ~]# serviced service attach zope netstat -tlnp | sort -n
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:11211           0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:11212           0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:15672           0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:3306            0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:4369            0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:44001           0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:5042            0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:5043            0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:5443            0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:5601            0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:5672            0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:6379            0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:8084            0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:8444            0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:8789            0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:8983            0.0.0.0:*               LISTEN      1/serviced-controll
tcp        0      0 0.0.0.0:9080            0.0.0.0:*               LISTEN      -
tcp6       0      0 :::22350                :::*                    LISTEN      1/serviced-controll
tcp6       0      0 :::443                  :::*                    LISTEN      1/serviced-controll


9080 is simply down.

Zope logs don't have anything unusual
The last events on Kibana (Z2.log):

December 18th 2018, 03:28:03.000
127.0.0.1 - Anonymous 18/Dec/2018:01:28:03 +0000 "GET /zport/ruok HTTP/1.1" 200 178 "" "Zope answering healthcheck"
December 18th 2018, 03:28:00.000
172.17.0.1 - Anonymous 18/Dec/2018:01:28:00 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
December 18th 2018, 03:27:55.000
172.17.0.1 - Anonymous 18/Dec/2018:01:27:55 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
December 18th 2018, 03:27:50.000
172.17.0.1 - Anonymous 18/Dec/2018:01:27:50 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
December 18th 2018, 03:27:45.000
172.17.0.1 - Anonymous 18/Dec/2018:01:27:45 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
December 18th 2018, 03:27:45.000
127.0.0.1 - Anonymous 18/Dec/2018:01:27:45 +0000 "GET /zport/ruok HTTP/1.1" 200 178 "" "Zope answering healthcheck"
December 18th 2018, 03:27:40.000
172.17.0.1 - Anonymous 18/Dec/2018:01:27:40 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
December 18th 2018, 03:27:39.000
172.17.0.1 - Anonymous 18/Dec/2018:01:27:39 +0000 "GET /zport/dmd/zenossStatsView/ HTTP/1.1" 200 806 "" "python-requests/2.6.0 CPython/2.7.5 Linux/3.10.0-957.1.3.el7.x86_64"
December 18th 2018, 03:27:39.000
172.17.0.1 - Anonymous 18/Dec/2018:01:27:39 +0000 "GET /zport/dmd/zenossStatsView/ HTTP/1.1" 200 806 "" "python-requests/2.6.0 CPython/2.7.5 Linux/3.10.0-957.1.3.el7.x86_64"
December 18th 2018, 03:27:35.000
172.17.0.1 - Anonymous 18/Dec/2018:01:27:35 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
December 18th 2018, 03:27:30.000
172.17.0.1 - Anonymous 18/Dec/2018:01:27:30 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
December 18th 2018, 03:27:30.000
127.0.0.1 - Anonymous 18/Dec/2018:01:27:30 +0000 "GET /zport/ruok HTTP/1.1" 200 178 "" "Zope answering healthcheck"
December 18th 2018, 03:27:28.000
172.17.0.1 - Anonymous 18/Dec/2018:01:27:28 +0000 "GET /robots.txt HTTP/1.1" 200 221 "" "Zenoss ready healthcheck"
December 18th 2018, 03:27:24.000
172.17.0.1 - Anonymous 18/Dec/2018:01:27:24 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
December 18th 2018, 03:27:19.000
172.17.0.1 - Anonymous 18/Dec/2018:01:27:19 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
So from the looks of it, there was some regular traffic to Zope and then it suddenly stopped.
The only other logs in /opt/zenoss/log are zeneventserver.log and the last log is one day before the crash.

From the above, I can only guess that this is a performance issue since there was not a single error anywhere.
The thing is, the system memory, serviced memory and Zope memory were all good. This is a 3-hour snapshot of the memory for Zope leading up to the crash. Again, nothing unusual.



Moreover, MetricShipper is not answering as a result of this:
The error that is coming up is:
Unable to connect to consumer ws://localhost:8080/ws/metrics/store​


The final thing I'll say is that this has happened before and a simple restart of Zope does the job. However, we can't keep waiting for it to fail in production and hit restart. But what this does indicate is that it is not a configuration issue or a state issue or some persistent error. If it were so, it would persist across restarts.

So I'm debugging some performance issue and don't know what I can do anymore.
The thing is, I've done load testing for a few hours several times this week and on all occasions, Zope was still up. I've looked at all logs and there was nothing special happening around that time. It's just so random. A bit like radioactive decay!

I'm at my wit's end here, honestly.
Any ideas?

------------------------------
Jad
------------------------------


Subject: RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
Author: Jad Baz
Posted: 2018-12-19 11:52

I think I've figured it out. It was a spike in CPU usage.

------------------------------
Jad
------------------------------


Subject: RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
Author: Ryan Matte
Posted: 2018-12-26 12:53

There is also currently a known issue with Zope's caching layer which can cause deadlocks.  I believe there's a fix coming for that in the next release but for the time being you can crontab a restart of your zope instances (zope, zauth, zenapi, zenreports) every night at midnight to help prevent it from getting to the point where a deadlock can happen.

See: https://jira.zenoss.com/browse/ZEN-30762

------------------------------
Ryan Matte
------------------------------

Subject: RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
Author: Jad Baz
Posted: 2019-03-27 14:58

Coming back to this, it is not a CPU or memory usage issue.
So far, I've experienced it as absolutely random

------------------------------
Jad
------------------------------


Subject: RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
Author: Jad Baz
Posted: 2019-03-29 07:44

Also ref Zenoss 6.1.1 graphs show no data, zenhub and MetricShipper failing some health checks


------------------------------
Jad
------------------------------


Subject: RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
Author: Adan Mendoza
Posted: 2019-05-21 09:40

I  am seeing the same issue with our Zenoss setup.  When Zope stops working, Metric shopper stops working.  This used to happen every 7 days and now its seems to just be random.

------------------------------
Adan Mendoza
VTX1
Raymondville TX
------------------------------


Subject: RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
Author: Ryan Matte
Posted: 2019-05-21 11:34

This is a known issue where over time zope threads will become unresponsive.  We've made several changes to address this in recent versions of the code base.  For the time being you can just schedule a zope restart once a night during off-hours which should prevent the issue from occurring during normal operation.  You'll want to restart all of the zope services (zope, zenapi, zenreports, zauth).

------------------------------
Ryan Matte
------------------------------


Subject: RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
Author: Jay Stanley
Posted: 2019-05-23 10:17

I have used cron to restart the UI every day at 1am using the command:

/usr/bin/serviced service restart "User Interface"

------------------------------
jstanley
------------------------------


Subject: RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
Author: Andrew Kirch
Posted: 2019-12-18 18:31

This has been my solution as well.  Restart Zope nightly.  It's Zope, this is how it is.

------------------------------
Andrew Kirch
AWS Principle Architect
ECS Tuning
------------------------------


Subject: RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
Author: Jad Baz
Posted: 2019-11-19 09:05

Hello,

Any idea if this problem should be fixed in 6.3?

------------------------------
Jad
------------------------------


Subject: RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
Author: Arthur
Posted: 2019-12-10 15:57

I still have it. Zenoss Core 6.3.2 and CC 1.6.3

------------------------------
Arthur
------------------------------


Subject: RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
Author: Shane Quinsey
Posted: 2020-01-08 21:42

Ditto here :(

------------------------------
Shane, Australia
ZenN00b
------------------------------


Subject: RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
Author: Brad
Posted: 2021-07-23 16:00

Fresh install of 6.3.2 seems to have this issue. Have not added any additional zenpacks. Only did an initial scan of our network for devices. Located 1100 devices. Checked on it a few days later, and zope was dead.

------------------------------
Brad
Principal Engineer Systems Architect
Big Tech
WA
------------------------------


Subject: RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
Author: Sam Urai
Posted: 2021-05-25 13:32

I've the same problem with recently setup of CE Core v6.3.2  and CC 1.6.3.

Any fix coming in for this issue?

------------------------------
Sam
------------------------------


Subject: RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
Author: Sam Urai
Posted: 2021-06-17 17:10

@Michael   - Can you please help here?   For me, the issue occurs every few days.

Thanks much!​

------------------------------
Sam
------------------------------


Subject: RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
Author: Paul Fielding
Posted: 2021-07-24 10:11

I've found it still to be a general issue.  As was previously suggested in this thread, the best answer is still to just schedule a zope restart nightly.  After doing that, I haven't experienced a problem with the zopes in over a year.  It's a band aid, but a very effective one.

------------------------------
Paul Fielding
AB
------------------------------


< Previous
howto extend size in /dev/mapper/docker- ?
  Next
How to get more info on a network outbound spike using Zenoss
>