Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal #1317

darkmuggle · 2019-12-09T23:44:16Z

Closes: BZ 1769847

- What I did

In BZ 1769847, it was observed that shutting down the GCP controller results in node failures. This change stops the host's routes service on signal.

- How to verify it

- Description for the changelog
Shutdown the GCP host route on OS Signal.

darkmuggle · 2019-12-09T23:46:24Z

/cc @ashcrow I'm not sure how to fully test this. Per [1], I'm not entirely sure that this fixes the root-cause. However, this won't hurt.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1769847

openshift-ci-robot · 2019-12-09T23:47:41Z

@darkmuggle: This pull request references Bugzilla bug 1769847, which is invalid:

expected the bug to target the "4.4.0" release, but it targets "4.3.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

[WIP] Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kikisdeliveryservice · 2019-12-09T23:49:35Z

this will have to be into 4.4 (current master) and then backported to 4.3 via cherrypick bot

kikisdeliveryservice · 2019-12-10T00:00:04Z

cc: @cgwalters @abhinavdahiya is this what you had in mind per the BZ?

(this is outside of my wheelhouse so deferring to the above)

abhinavdahiya · 2019-12-10T00:09:22Z

I don't see the value add to the BZ 1769847 or existing code with this change??

darkmuggle · 2019-12-10T00:15:30Z

I don't see the value add to the BZ 1769847 or existing code with this change??

The reason for it was:

Clayton Coleman 1:33 PM
so on GCP thing 🙂
the GCP-routes service - does it clear itself on TERM (remove the mapping)?

This ensures that on-term, then route mapping is cleared. I don't see the value, but I was asked to look at it. This at least ensures that on TERM the node will start to fail the LB health checks a little faster...

kikisdeliveryservice · 2019-12-10T00:16:32Z

@abhinavdahiya

I don't see the value add to the BZ 1769847 or existing code with this change??

sounds like you are thinking gcp-routes.service should be changed directly?

darkmuggle · 2019-12-10T00:30:41Z

The crux of the problem is that the node's routes to the LB should not exist until the node is ready/capable of servicing the route. The gcp-routes.service setups the kernel's routing so that the requests from the LB can be handled by listening services. The MCO GCP Controller, AFAIK, starts or stops the host's service when its needed. Making the gcp-routes.service "smarter" doesn't solve the problem of knowing when/if the node is ready since it not aware of "why" the LB exists or what "what" service/port or for that matter, which health checks it should be checking.

kikisdeliveryservice · 2019-12-10T00:35:42Z

The crux of the problem is that the node's routes to the LB should not exist until the node is ready/capable of servicing the route. The gcp-routes.service setups the kernel's routing so that the requests from the LB can be handled by listening services. The MCO GCP Controller, AFAIK, starts or stops the host's service when its needed. Making the gcp-routes.service "smarter" doesn't solve the problem of knowing when/if the node is ready since it not aware of "why" the LB exists or what "what" service/port or for that matter, which health checks it should be checking.

basic q (bc i truly dont know) but wouldn't a service running on the host be better positioned to know if the host is ready/not ready?

since it not aware of "why" the LB exists

I dont quite understand what this means?

kikisdeliveryservice · 2019-12-10T00:43:15Z

Also linking over to : #1031 that has a good description and discussion on this portion of the codebase (cmd/gcp-route-controller)

darkmuggle · 2019-12-10T04:01:03Z

basic q (bc i truly dont know) but wouldn't a service running on the host be better positioned to know if the host is ready/not ready?

since it not aware of "why" the LB exists

I dont quite understand what this means?

The node can have many routes (or IP address). Each route corresponds to a specific port (or service). The meta-data service does not provide a way to know which route corresponds to which port. We assume in the case of RHCOS that the route exists for the LB in-front of the API service. But it could be for anything service on the host.

Each LB has a health check that is setup. Without knowing what the port is, or what health check to test for, we can only make assumptions.

That's why the gcp-routes.service is so dumb. It setups up the routes for the LB's. But if memory serves, the reason why MCO has this code was the route was setup before the API server was ready and that lead to a black-hole situation.

cmd/gcp-routes-controller/run.go

cgwalters · 2019-12-10T13:49:17Z

I don't have the expertise here to say this change is correct, but it looks reasonable to me. One thing I can answer though:

I'm not sure how to fully test this.

It should be sufficient to deploy a 4.3 (or 4.2 for that matter) cluster in GCP, then cause the MCO to reboot the control plane by deploying a dummy MC that writes /etc/darkmuggle-was-here or so. During the drain/reboot process, the API server should remain fully available.

So probably first reproduce the failure by doing API calls in a loop oc get pods in a shell or whatever, then deploy this patch and verify you don't see some API calls fail.

Closes: BZ 1769847

smarterclayton · 2019-12-10T20:20:37Z

You can run “openshift-tests run-monitor” and it will tell you if a master drops out

darkmuggle · 2019-12-10T20:21:07Z

/test e2e-gcp-op

cgwalters · 2019-12-10T22:02:17Z

/approve

darkmuggle · 2019-12-10T22:11:10Z

Aside from trying to figure out how to get this to pass the CI (not my wheel-house), this is ready for review. I'll work on that part.

abhinavdahiya · 2019-12-10T23:22:05Z

just to capture the actual issue for bz1769847

the mcd daemon issues a reboot of the machine, the apiserver container is configured with graceful termination such that no new connections are allowed and all current work is completed with it's health check marked as failing.
the gives time for LB to react and create gracefull rolling of apiservers.

now when the reboot is issued, systemd start shutting down services, and it shuts down the gcp-routes.service.. and since gcp-routes.service is designed to cleanup when it receives stop, it removes the ip route immediately dropping/closing connections to the apiserver... and hence all the work done to gracefully close connections etc from above wrt apiserver is not being used here.

this abrupt is what causes non-graceful rollout of apiserver in bz1769847.

since gcp-routes.service is designed to cleanup when it receives stop

Why is gcp-routes.service setup in this way? that detail should be in #1031

now what should be a solution to get the graceful connection dropping back for apiservers?

A) switch the interlink between gcp-routes.service and gcp-routes-controller from STOP to something else, such that stopping gcp-routes doesn't affect connectivity but gcp-routes-controller still has the capability to request cleanup when required.

we will have to keep in mind that upgrades change machine-os-content on reboot but the files on disk are changed immediately

B) move the gcp-routes-controller health checking to gcp-routes.service it self. and allow mco to configure and turn-on the behavior using certain configuration.

hopefully that provides more context around https://bugzilla.redhat.com/show_bug.cgi?id=1769847#c7

cgwalters · 2019-12-11T00:56:18Z

we will have to keep in mind that upgrades change machine-os-content on reboot but the files on disk are changed immediately

Aside: we can fix that

cgwalters · 2019-12-11T13:51:25Z

Hm; one preparatory thing that may help here is moving the route script out of RHCOS and into the MCD.

darkmuggle · 2019-12-11T16:07:30Z

Hm; one preparatory thing that may help here is moving the route script out of RHCOS and into the MCD.

If the MCD knows what the route is, then the problem domain can be a whole lot simpler:

burn the gcp-rotues.sh and gcp-routes.service in RHCOS
the MCD setup only the route it needs

IMHO, that would be the better fix and solves my concerns about the correct route being set up for the service being served by the LB.

darkmuggle · 2019-12-11T16:18:33Z

/retest

openshift-bot · 2019-12-11T16:58:32Z

/bugzilla refresh

The requirements for Bugzilla bugs have changed, recalculating validity.

openshift-ci-robot · 2019-12-11T16:58:35Z

@openshift-bot: This pull request references Bugzilla bug 1769847, which is invalid:

expected the bug to target the "4.4.0" release, but it targets "4.3.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

The requirements for Bugzilla bugs have changed, recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ashcrow · 2019-12-11T18:22:05Z

ci/prow/e2e-aws-scaleup-rhel7 can be ignored

miabbott · 2019-12-11T19:44:34Z

/bugzilla refresh

openshift-ci-robot · 2019-12-11T19:44:41Z

@miabbott: This pull request references Bugzilla bug 1769847, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abhinavdahiya · 2019-12-12T04:56:46Z

Hm; one preparatory thing that may help here is moving the route script out of RHCOS and into the MCD.

If the MCD knows what the route is, then the problem domain can be a whole lot simpler:
* burn the `gcp-rotues.sh` and `gcp-routes.service` in RHCOS

* the MCD setup only the route it needs
IMHO, that would be the better fix and solves my concerns about the correct route being set up for the service being served by the LB.

you would want RHCOS to be usable without machine-config-daemon running on it. like on bootstrap-host or new control-plane node.

runcom · 2019-12-12T16:29:47Z

/retest

runcom · 2019-12-12T19:55:42Z

/lgtm

openshift-ci-robot · 2019-12-12T19:55:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, darkmuggle, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2019-12-12T21:49:26Z

@darkmuggle: All pull requests linked via external trackers have merged. Bugzilla bug 1769847 has been moved to the MODIFIED state.

In response to this:

Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 9, 2019

openshift-ci-robot requested review from ericavonb and yuqi-zhang December 9, 2019 23:44

kikisdeliveryservice changed the title ~~cmd/gcp-routes-controller: shutdown gcp routes on signal~~ [WIP] Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal Dec 9, 2019

openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Dec 9, 2019

darkmuggle requested review from abhinavdahiya and removed request for ericavonb and yuqi-zhang December 10, 2019 00:00

kikisdeliveryservice requested a review from cgwalters December 10, 2019 00:43

cgwalters reviewed Dec 10, 2019

View reviewed changes

cmd/gcp-routes-controller/run.go Outdated Show resolved Hide resolved

cmd/gcp-routes-controller: shutdown gcp routes on signal

f031ac5

Closes: BZ 1769847

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2019

darkmuggle marked this pull request as ready for review December 10, 2019 22:10

darkmuggle changed the title ~~[WIP] Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal~~ Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal Dec 10, 2019

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 10, 2019

darkmuggle removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Dec 11, 2019

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Dec 11, 2019

openshift-ci-robot assigned runcom Dec 12, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 12, 2019

openshift-merge-robot merged commit 94a791f into openshift:master Dec 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal #1317

Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal #1317

darkmuggle commented Dec 9, 2019

darkmuggle commented Dec 9, 2019

openshift-ci-robot commented Dec 9, 2019

kikisdeliveryservice commented Dec 9, 2019

kikisdeliveryservice commented Dec 10, 2019

abhinavdahiya commented Dec 10, 2019

darkmuggle commented Dec 10, 2019

kikisdeliveryservice commented Dec 10, 2019 •

edited

Loading

darkmuggle commented Dec 10, 2019

kikisdeliveryservice commented Dec 10, 2019 •

edited

Loading

kikisdeliveryservice commented Dec 10, 2019 •

edited

Loading

darkmuggle commented Dec 10, 2019

cgwalters commented Dec 10, 2019

smarterclayton commented Dec 10, 2019

darkmuggle commented Dec 10, 2019

cgwalters commented Dec 10, 2019

darkmuggle commented Dec 10, 2019

abhinavdahiya commented Dec 10, 2019

cgwalters commented Dec 11, 2019

cgwalters commented Dec 11, 2019

darkmuggle commented Dec 11, 2019

darkmuggle commented Dec 11, 2019

openshift-bot commented Dec 11, 2019

openshift-ci-robot commented Dec 11, 2019

ashcrow commented Dec 11, 2019

miabbott commented Dec 11, 2019

openshift-ci-robot commented Dec 11, 2019

abhinavdahiya commented Dec 12, 2019

runcom commented Dec 12, 2019

runcom commented Dec 12, 2019

openshift-ci-robot commented Dec 12, 2019

openshift-ci-robot commented Dec 12, 2019

Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal #1317

Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal #1317

Conversation

darkmuggle commented Dec 9, 2019

darkmuggle commented Dec 9, 2019

openshift-ci-robot commented Dec 9, 2019

kikisdeliveryservice commented Dec 9, 2019

kikisdeliveryservice commented Dec 10, 2019

abhinavdahiya commented Dec 10, 2019

darkmuggle commented Dec 10, 2019

kikisdeliveryservice commented Dec 10, 2019 • edited Loading

darkmuggle commented Dec 10, 2019

kikisdeliveryservice commented Dec 10, 2019 • edited Loading

kikisdeliveryservice commented Dec 10, 2019 • edited Loading

darkmuggle commented Dec 10, 2019

cgwalters commented Dec 10, 2019

smarterclayton commented Dec 10, 2019

darkmuggle commented Dec 10, 2019

cgwalters commented Dec 10, 2019

darkmuggle commented Dec 10, 2019

abhinavdahiya commented Dec 10, 2019

cgwalters commented Dec 11, 2019

cgwalters commented Dec 11, 2019

darkmuggle commented Dec 11, 2019

darkmuggle commented Dec 11, 2019

openshift-bot commented Dec 11, 2019

openshift-ci-robot commented Dec 11, 2019

ashcrow commented Dec 11, 2019

miabbott commented Dec 11, 2019

openshift-ci-robot commented Dec 11, 2019

abhinavdahiya commented Dec 12, 2019

runcom commented Dec 12, 2019

runcom commented Dec 12, 2019

openshift-ci-robot commented Dec 12, 2019

openshift-ci-robot commented Dec 12, 2019

kikisdeliveryservice commented Dec 10, 2019 •

edited

Loading

kikisdeliveryservice commented Dec 10, 2019 •

edited

Loading

kikisdeliveryservice commented Dec 10, 2019 •

edited

Loading