-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal #1317
Conversation
/cc @ashcrow I'm not sure how to fully test this. Per [1], I'm not entirely sure that this fixes the root-cause. However, this won't hurt. |
@darkmuggle: This pull request references Bugzilla bug 1769847, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
this will have to be into 4.4 (current master) and then backported to 4.3 via cherrypick bot |
cc: @cgwalters @abhinavdahiya is this what you had in mind per the BZ? (this is outside of my wheelhouse so deferring to the above) |
I don't see the value add to the BZ 1769847 or existing code with this change?? |
The reason for it was:
This ensures that on-term, then route mapping is cleared. I don't see the value, but I was asked to look at it. This at least ensures that on TERM the node will start to fail the LB health checks a little faster... |
sounds like you are thinking gcp-routes.service should be changed directly? |
The crux of the problem is that the node's routes to the LB should not exist until the node is ready/capable of servicing the route. The |
basic q (bc i truly dont know) but wouldn't a service running on the host be better positioned to know if the host is ready/not ready?
I dont quite understand what this means? |
Also linking over to : #1031 that has a good description and discussion on this portion of the codebase (cmd/gcp-route-controller) |
The node can have many routes (or IP address). Each route corresponds to a specific port (or service). The meta-data service does not provide a way to know which route corresponds to which port. We assume in the case of RHCOS that the route exists for the LB in-front of the API service. But it could be for anything service on the host. Each LB has a health check that is setup. Without knowing what the port is, or what health check to test for, we can only make assumptions. That's why the |
I don't have the expertise here to say this change is correct, but it looks reasonable to me. One thing I can answer though:
It should be sufficient to deploy a 4.3 (or 4.2 for that matter) cluster in GCP, then cause the MCO to reboot the control plane by deploying a dummy MC that writes So probably first reproduce the failure by doing API calls in a loop |
Closes: BZ 1769847
You can run “openshift-tests run-monitor” and it will tell you if a master drops out |
/test e2e-gcp-op |
/approve |
Aside from trying to figure out how to get this to pass the CI (not my wheel-house), this is ready for review. I'll work on that part. |
just to capture the actual issue for bz1769847 the mcd daemon issues a reboot of the machine, the apiserver container is configured with graceful termination such that no new connections are allowed and all current work is completed with it's health check marked as failing. now when the reboot is issued, systemd start shutting down services, and it shuts down the gcp-routes.service.. and since gcp-routes.service is designed to cleanup when it receives stop, it removes the ip route immediately dropping/closing connections to the apiserver... and hence all the work done to gracefully close connections etc from above wrt apiserver is not being used here. this abrupt is what causes non-graceful rollout of apiserver in bz1769847.
Why is gcp-routes.service setup in this way? that detail should be in #1031 now what should be a solution to get the graceful connection dropping back for apiservers? A) switch the interlink between gcp-routes.service and gcp-routes-controller from STOP to something else, such that stopping gcp-routes doesn't affect connectivity but gcp-routes-controller still has the capability to request cleanup when required.
B) move the gcp-routes-controller health checking to gcp-routes.service it self. and allow mco to configure and turn-on the behavior using certain configuration. hopefully that provides more context around https://bugzilla.redhat.com/show_bug.cgi?id=1769847#c7 |
Aside: we can fix that |
Hm; one preparatory thing that may help here is moving the route script out of RHCOS and into the MCD. |
If the MCD knows what the route is, then the problem domain can be a whole lot simpler:
IMHO, that would be the better fix and solves my concerns about the correct route being set up for the service being served by the LB. |
/retest |
/bugzilla refresh The requirements for Bugzilla bugs have changed, recalculating validity. |
@openshift-bot: This pull request references Bugzilla bug 1769847, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/bugzilla refresh |
@miabbott: This pull request references Bugzilla bug 1769847, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
you would want RHCOS to be usable without machine-config-daemon running on it. like on bootstrap-host or new control-plane node. |
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, darkmuggle, runcom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@darkmuggle: All pull requests linked via external trackers have merged. Bugzilla bug 1769847 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Closes: BZ 1769847
- What I did
In BZ 1769847, it was observed that shutting down the GCP controller results in node failures. This change stops the host's routes service on signal.
- How to verify it
- Description for the changelog
Shutdown the GCP host route on OS Signal.