-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bitnami/rabbitmq] Cluster does not recover after reboot, even with forceBoot: true #25698
Comments
Hi! In this kind of scenarios it may be necessary to perform some manual intervention. Could you try running the chart with diagnosticMode.enabled=true and try to perform the initialization steps manually? You can run
|
Well, I know how to manually fix it, but what I'm suggesting is to find a better way to handle this in the chart, as currently, I would not trust to deploy this to production, one cannot expect that the nodes will always shutdown in a specific order.
I think the main issue to solve is to make sure all 3 pods are started all the tine, no matter if the probes fail, otherwise the cluster will never recover with just 1 node.... |
Hi! We plan to change the podManagementPolicy to Parallel to avoid this kind of issues. In the meantime, you can set it in the values.yaml but we plan to change it by default. This was recommended by the upstream RabbitMQ team, you can see it here: #16081 (comment) |
@razvanphp RabbitMQ nodes do not expect any specific startup or shutdown sequence starting with 3.7.0. They do expect all peers to come online within 5 minutes by default. The (in)famous
|
@javsalgar given that this question keeps coming up and one way or another, the (completely wrong, as stated many times earlier) recommendation of using RabbitMQ can log an extra message when it runs out of attempts contacting cluster peers but we can tell from experience that virtually no one reads logs until told so explicitly. And since the core team does not have much influence over the "DIY" (Operator-less) installations on Kubernetes, this long understood and solved problem keeps popping up. |
I just want to mention that the error logs are not displayed by default, one must also set
to see what actually happens, the mnesia tables error. I would suggest we go back to basics and make things easy again, like removing the forceBoot completely, even from suggestions and align with what @michaelklishin is saying. |
@javsalgar here's a PR to get the ball rolling: #25873. Hopefully it will stop the bleeding (this kind of questions) and direct folks towards understanding what's going on with their deployments and what are the two options they have :) |
As for whether If @javsalgar and his team decide to remove @javsalgar is the logging behavior mentioned above intentional? RabbitMQ community Docker image does not suppress nodes by default, so I'm curious why this is the case. I'd personally always want more users to have easy access to RabbitMQ logs since that's the very first thing we ask for, both on GitHub and in response to commercial tickets. |
Regarding logging, if I don't add that image debug true, logs stop here and never output anything:
I thought it uses syslog, but Regarding |
We show the application and error logs to stdout by default. The one that we supress has to do with the bash initialization logic to avoid adding unnecessary noise to the initialization logs, unless it fails. However, it makes sense to revisit the logging of that specific part of the initialization to make it easier to spot any error. |
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback. |
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary. |
Name and Version
bitnami/rabbitmq 12.15.0
What architecture are you using?
amd64
What steps will reproduce the bug?
We run this chart in TrueNAS server, deployed with fluxCD. With 3 pods, restart the k3s node, and cluster will not recover.
Are you using any custom parameters or values?
What is the expected behavior?
Cluster should be able to recover, seems that
does not help.
What do you see instead?
Cluster (of 3 nodes) is not able to recover after server shutdown.
Additional information
So the readiness and liveness probes fail with:
Checking the logs I see those, only one pod is up instead of 3:
The text was updated successfully, but these errors were encountered: