Skip to content

Systemd prevents pause_minority re-join by restarting RabbitMQ #3261

@robertdahlem

Description

@robertdahlem

I have a three node cluster to test pause_minority. All nodes run RHEL 7.9, Erlang 23.3.4.5 and RabbitMQ 3.9.0. I use RPMs from https://github.com/rabbitmq (erlang-rpm and rabbitmq-server). Nodes are joined manually to the cluster, I use the same rabbitmq.conf for all three nodes (rabbit1, rabbit2, rabbit3).

When I pull the network cable from rabbit2, a minute later it detects minority status and stops the applications. 90 seconds later, systemd detects that something is wrong with rabbitmq-server and restarts it:

systemd: rabbitmq-server.service: main process exited, code=killed, status=9/KILL
systemd: Unit rabbitmq-server.service entered failed state.
systemd: rabbitmq-server.service failed.
systemd: rabbitmq-server.service holdoff time over, scheduling restart.
systemd: Stopped RabbitMQ broker.
systemd: Starting RabbitMQ broker...

After that, nothing else happens although I reconnect the cable. I would expect rabbit2 to re-join the cluster, but that seems to be sabotaged by systemd restarting RabbitMQ.

The node re-joins the cluster when I reconnect the cable before 90 seconds, but systemd mercilessly kills and restarts RabbitMQ anyway after 90 seconds.

Here is the time table for the things I did:

17:57:08 disconnect eth0
17:58:10 Node rabbit2 detects loss of connectivity
17:59:40 systemd reports: stop-sigterm timed out. Killing
18:01:41 reconnect eth0

Log files and rabbitmq.conf attached.
[email protected]
/var/log/messages
rabbitmq.conf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions