-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
I have a three node cluster to test pause_minority. All nodes run RHEL 7.9, Erlang 23.3.4.5 and RabbitMQ 3.9.0. I use RPMs from https://github.com/rabbitmq (erlang-rpm and rabbitmq-server). Nodes are joined manually to the cluster, I use the same rabbitmq.conf for all three nodes (rabbit1, rabbit2, rabbit3).
When I pull the network cable from rabbit2, a minute later it detects minority status and stops the applications. 90 seconds later, systemd detects that something is wrong with rabbitmq-server and restarts it:
systemd: rabbitmq-server.service: main process exited, code=killed, status=9/KILL
systemd: Unit rabbitmq-server.service entered failed state.
systemd: rabbitmq-server.service failed.
systemd: rabbitmq-server.service holdoff time over, scheduling restart.
systemd: Stopped RabbitMQ broker.
systemd: Starting RabbitMQ broker...
After that, nothing else happens although I reconnect the cable. I would expect rabbit2 to re-join the cluster, but that seems to be sabotaged by systemd restarting RabbitMQ.
The node re-joins the cluster when I reconnect the cable before 90 seconds, but systemd mercilessly kills and restarts RabbitMQ anyway after 90 seconds.
Here is the time table for the things I did:
17:57:08 disconnect eth0
17:58:10 Node rabbit2 detects loss of connectivity
17:59:40 systemd reports: stop-sigterm timed out. Killing
18:01:41 reconnect eth0
Log files and rabbitmq.conf attached.
[email protected]
/var/log/messages
rabbitmq.conf