Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal #1317

Merged
merged 1 commit into from
Dec 12, 2019
Merged

Conversation

darkmuggle
Copy link
Contributor

Closes: BZ 1769847

- What I did

In BZ 1769847, it was observed that shutting down the GCP controller results in node failures. This change stops the host's routes service on signal.

- How to verify it

- Description for the changelog
Shutdown the GCP host route on OS Signal.

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 9, 2019
@darkmuggle
Copy link
Contributor Author

/cc @ashcrow I'm not sure how to fully test this. Per [1], I'm not entirely sure that this fixes the root-cause. However, this won't hurt.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1769847

@kikisdeliveryservice kikisdeliveryservice changed the title cmd/gcp-routes-controller: shutdown gcp routes on signal [WIP] Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal Dec 9, 2019
@openshift-ci-robot
Copy link
Contributor

@darkmuggle: This pull request references Bugzilla bug 1769847, which is invalid:

  • expected the bug to target the "4.4.0" release, but it targets "4.3.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

[WIP] Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Dec 9, 2019
@kikisdeliveryservice
Copy link
Contributor

this will have to be into 4.4 (current master) and then backported to 4.3 via cherrypick bot

@kikisdeliveryservice
Copy link
Contributor

cc: @cgwalters @abhinavdahiya is this what you had in mind per the BZ?

(this is outside of my wheelhouse so deferring to the above)

@darkmuggle darkmuggle requested review from abhinavdahiya and removed request for ericavonb and yuqi-zhang December 10, 2019 00:00
@abhinavdahiya
Copy link
Contributor

I don't see the value add to the BZ 1769847 or existing code with this change??

@darkmuggle
Copy link
Contributor Author

I don't see the value add to the BZ 1769847 or existing code with this change??

The reason for it was:

Clayton Coleman 1:33 PM
so on GCP thing 🙂
the GCP-routes service - does it clear itself on TERM (remove the mapping)?

This ensures that on-term, then route mapping is cleared. I don't see the value, but I was asked to look at it. This at least ensures that on TERM the node will start to fail the LB health checks a little faster...

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Dec 10, 2019

@abhinavdahiya

I don't see the value add to the BZ 1769847 or existing code with this change??

sounds like you are thinking gcp-routes.service should be changed directly?

@darkmuggle
Copy link
Contributor Author

The crux of the problem is that the node's routes to the LB should not exist until the node is ready/capable of servicing the route. The gcp-routes.service setups the kernel's routing so that the requests from the LB can be handled by listening services. The MCO GCP Controller, AFAIK, starts or stops the host's service when its needed. Making the gcp-routes.service "smarter" doesn't solve the problem of knowing when/if the node is ready since it not aware of "why" the LB exists or what "what" service/port or for that matter, which health checks it should be checking.

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Dec 10, 2019

The crux of the problem is that the node's routes to the LB should not exist until the node is ready/capable of servicing the route. The gcp-routes.service setups the kernel's routing so that the requests from the LB can be handled by listening services. The MCO GCP Controller, AFAIK, starts or stops the host's service when its needed. Making the gcp-routes.service "smarter" doesn't solve the problem of knowing when/if the node is ready since it not aware of "why" the LB exists or what "what" service/port or for that matter, which health checks it should be checking.

basic q (bc i truly dont know) but wouldn't a service running on the host be better positioned to know if the host is ready/not ready?

since it not aware of "why" the LB exists

I dont quite understand what this means?

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Dec 10, 2019

Also linking over to : #1031 that has a good description and discussion on this portion of the codebase (cmd/gcp-route-controller)

@darkmuggle
Copy link
Contributor Author

basic q (bc i truly dont know) but wouldn't a service running on the host be better positioned to know if the host is ready/not ready?

since it not aware of "why" the LB exists

I dont quite understand what this means?

The node can have many routes (or IP address). Each route corresponds to a specific port (or service). The meta-data service does not provide a way to know which route corresponds to which port. We assume in the case of RHCOS that the route exists for the LB in-front of the API service. But it could be for anything service on the host.

Each LB has a health check that is setup. Without knowing what the port is, or what health check to test for, we can only make assumptions.

That's why the gcp-routes.service is so dumb. It setups up the routes for the LB's. But if memory serves, the reason why MCO has this code was the route was setup before the API server was ready and that lead to a black-hole situation.

@cgwalters
Copy link
Member

I don't have the expertise here to say this change is correct, but it looks reasonable to me. One thing I can answer though:

I'm not sure how to fully test this.

It should be sufficient to deploy a 4.3 (or 4.2 for that matter) cluster in GCP, then cause the MCO to reboot the control plane by deploying a dummy MC that writes /etc/darkmuggle-was-here or so. During the drain/reboot process, the API server should remain fully available.

So probably first reproduce the failure by doing API calls in a loop oc get pods in a shell or whatever, then deploy this patch and verify you don't see some API calls fail.

@smarterclayton
Copy link
Contributor

You can run “openshift-tests run-monitor” and it will tell you if a master drops out

@darkmuggle
Copy link
Contributor Author

/test e2e-gcp-op

@cgwalters
Copy link
Member

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2019
@darkmuggle darkmuggle marked this pull request as ready for review December 10, 2019 22:10
@darkmuggle darkmuggle changed the title [WIP] Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal Dec 10, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 10, 2019
@darkmuggle
Copy link
Contributor Author

Aside from trying to figure out how to get this to pass the CI (not my wheel-house), this is ready for review. I'll work on that part.

@abhinavdahiya
Copy link
Contributor

just to capture the actual issue for bz1769847

the mcd daemon issues a reboot of the machine, the apiserver container is configured with graceful termination such that no new connections are allowed and all current work is completed with it's health check marked as failing.
the gives time for LB to react and create gracefull rolling of apiservers.

now when the reboot is issued, systemd start shutting down services, and it shuts down the gcp-routes.service.. and since gcp-routes.service is designed to cleanup when it receives stop, it removes the ip route immediately dropping/closing connections to the apiserver... and hence all the work done to gracefully close connections etc from above wrt apiserver is not being used here.

this abrupt is what causes non-graceful rollout of apiserver in bz1769847.

since gcp-routes.service is designed to cleanup when it receives stop

Why is gcp-routes.service setup in this way? that detail should be in #1031

now what should be a solution to get the graceful connection dropping back for apiservers?

A) switch the interlink between gcp-routes.service and gcp-routes-controller from STOP to something else, such that stopping gcp-routes doesn't affect connectivity but gcp-routes-controller still has the capability to request cleanup when required.

  • we will have to keep in mind that upgrades change machine-os-content on reboot but the files on disk are changed immediately

B) move the gcp-routes-controller health checking to gcp-routes.service it self. and allow mco to configure and turn-on the behavior using certain configuration.

hopefully that provides more context around https://bugzilla.redhat.com/show_bug.cgi?id=1769847#c7

@cgwalters
Copy link
Member

we will have to keep in mind that upgrades change machine-os-content on reboot but the files on disk are changed immediately

Aside: we can fix that

@cgwalters
Copy link
Member

Hm; one preparatory thing that may help here is moving the route script out of RHCOS and into the MCD.

@darkmuggle
Copy link
Contributor Author

Hm; one preparatory thing that may help here is moving the route script out of RHCOS and into the MCD.

If the MCD knows what the route is, then the problem domain can be a whole lot simpler:

  • burn the gcp-rotues.sh and gcp-routes.service in RHCOS
  • the MCD setup only the route it needs

IMHO, that would be the better fix and solves my concerns about the correct route being set up for the service being served by the LB.

@darkmuggle
Copy link
Contributor Author

/retest

@openshift-bot
Copy link
Contributor

/bugzilla refresh

The requirements for Bugzilla bugs have changed, recalculating validity.

@openshift-ci-robot
Copy link
Contributor

@openshift-bot: This pull request references Bugzilla bug 1769847, which is invalid:

  • expected the bug to target the "4.4.0" release, but it targets "4.3.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

The requirements for Bugzilla bugs have changed, recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ashcrow
Copy link
Member

ashcrow commented Dec 11, 2019

ci/prow/e2e-aws-scaleup-rhel7 can be ignored

@darkmuggle darkmuggle removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Dec 11, 2019
@miabbott
Copy link
Member

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Dec 11, 2019
@openshift-ci-robot
Copy link
Contributor

@miabbott: This pull request references Bugzilla bug 1769847, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abhinavdahiya
Copy link
Contributor

Hm; one preparatory thing that may help here is moving the route script out of RHCOS and into the MCD.

If the MCD knows what the route is, then the problem domain can be a whole lot simpler:

* burn the `gcp-rotues.sh` and `gcp-routes.service` in RHCOS

* the MCD setup only the route it needs

IMHO, that would be the better fix and solves my concerns about the correct route being set up for the service being served by the LB.

you would want RHCOS to be usable without machine-config-daemon running on it. like on bootstrap-host or new control-plane node.

@runcom
Copy link
Member

runcom commented Dec 12, 2019

/retest

@runcom
Copy link
Member

runcom commented Dec 12, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 12, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, darkmuggle, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 94a791f into openshift:master Dec 12, 2019
@openshift-ci-robot
Copy link
Contributor

@darkmuggle: All pull requests linked via external trackers have merged. Bugzilla bug 1769847 has been moved to the MODIFIED state.

In response to this:

Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.