Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1927041: daemon: safer signal handling for shutdown #2395

Merged
merged 2 commits into from
Feb 19, 2021
Merged

Bug 1927041: daemon: safer signal handling for shutdown #2395

merged 2 commits into from
Feb 19, 2021

Conversation

darkmuggle
Copy link
Contributor

@darkmuggle darkmuggle commented Feb 8, 2021

This armors the signal handling for the daemon blocking any shutdown until after an update is complete.

The old functions catchIgnoreSIGTERM and cancelSIGTERM really didn't do much (they used the mutex and then set a bool) but there were no checks in the signal handling.

This also increases the time for the MCD to shutdown from 5min to a 1hr. Since the MCD will shut down immediately when safe to do so this shouldn't have negative effects except in the case of an already unhappy node.

@darkmuggle darkmuggle added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. WIP labels Feb 8, 2021
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2021
@kikisdeliveryservice kikisdeliveryservice changed the title daemon: safer signal handling for shutdown [WIP] daemon: safer signal handling for shutdown Feb 9, 2021
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2021
@darkmuggle darkmuggle changed the title [WIP] daemon: safer signal handling for shutdown daemon: safer signal handling for shutdown Feb 10, 2021
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 10, 2021
@darkmuggle darkmuggle added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 10, 2021
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 16, 2021
The MCD is using the standard 600 grace period to end its work (5min).
However, we have seen cases where this is insufficient and the node is
rebooted under the MCD. The MCD has sigterm handling, but if the grace
period times out, then Kubernetes sends a SIGKILL.
@darkmuggle
Copy link
Contributor Author

darkmuggle commented Feb 16, 2021

I dropped the Systemd Inhibit functionality to stop reboots. During the team discussion today, the prevailing view is that the MCO cannot block a reboot.

@@ -634,16 +637,16 @@ func (dn *Daemon) Run(stopCh <-chan struct{}, exitCh <-chan error) error {

go wait.Until(dn.worker, time.Second, stopCh)

for {
Copy link
Contributor Author

@darkmuggle darkmuggle Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The for loop is superfluous since once we get the sigterm we should finish the work and then get out of the way. A sigkill could/likely follow and there's literally nothing we can do about that.

@sinnykumari
Copy link
Contributor

This really looks good and handles well interruption during update process.
/approve

Colin, can you please also take a look to make sure we are not missing anything here?
cc @cgwalters

@darkmuggle
Copy link
Contributor Author

/retest

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you say the core thing this is fixing here is that currently we were ignoring SIGTERM if caught during an update, but we didn't then exit after the update had finished? And that was causing systemd to time out and go on the SIGKILL spree?

Overall I think this looks improved; one optional comment.

I find it hard to review all of this for correctness though - I'd reiterate that I think we really want to make this whole thing transactional w/ostree support. I hope at some point in the next few months to land some infrastructure for that.

return nil
case sig := <-signaled:
glog.Warningf("shutdown of machine-config-daemon triggered via signal %d", sig)
return fmt.Errorf("shutdown of the machine-config-daemon trigger via syscall signal %d", sig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think getting SIGTERM isn't an error, it's a normal condition. Which relates to one of the original goals I had here in that in the "idle" aka "not applying updates" case, we don't install a SIGTERM handler at all - we just let the kernel unilaterally kill the process.

I more recently posted about this here: https://internals.rust-lang.org/t/should-rust-programs-unwind-on-sigint/13800/11

See also e.g. nhorman/rng-tools#72

And related to all this, ideally of course we do #1190 - then we shouldn't need to handle SIGTERM at all in the MCO. With that if we're killed (or the machine power cycled/kernel froze) in the middle of an update, we either have the old or new system.

Copy link
Contributor Author

@darkmuggle darkmuggle Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did go back on and forth on whether or to report SIGTERM as an error or not for the reasons your blog highlights and ultimately went with returning the error to signal the reason for the death of the process (and thus make it obvious in the logs). In retrospect, I think just logging and moving on is the better path.

FWIW I was arguing for making the MCD block reboots, however, @crawford argued that fully transactional updates would negate the need; he would rather wait for transactional support from rpm-ostree before doing any reboot armoring. My view is that until we get the transactional update mechanism, a short-lived inhibitor is better than nothing and could prevent some support cases and bug reports.

And related to all this, ideally of course we do #1190 - then we shouldn't need to handle SIGTERM at all in the MCO.

I would disagree -- we should at, the very least, through away the pending transaction (or have rpm-ostree do it).

pkg/daemon/daemon.go Outdated Show resolved Hide resolved
@cgwalters
Copy link
Member

cgwalters commented Feb 18, 2021

Tangentially related to this...one thing that could help in the MCO design is if we had a "dual nature" as both a pod and a systemd unit. For example, we could represent the "applying updates" phase via e.g. systemctl start machine-config-operator-update.

This would also more naturally handle cases where e.g. we're updating a node and while that happens a cluster upgrade is happening an a new daemonset is rolling out. Now with correct SIGTERM handling the old pod won't die until done, but you could imagine that instead e.g. we allow the pod to be killed, but what's actually doing the updates is the systemd unit, and the new pod discovers that state instead. IOW think of the pod as like a "control plane proxy". A systemd unit would more obviously natively interact with native systemd features (including e.g. things like ordering Before/After= and possibly using the systemd inhibit functionality, etc.)

(This "proxying a systemd unit" is actually what's happening with rpm-ostreed today)

@darkmuggle
Copy link
Contributor Author

Tangentially related to this...one thing that could help in the MCO design is if we had a "dual nature" as both a pod and a systemd unit. For example, we could represent the "applying updates" phase via e.g. systemctl start machine-config-operator-update.

I had a similar thought, although your idea is a bit more refined. My idea was to have something that would ensure that if a user did a shutdown/reboot in the middle that RHCOS would wait until the update was done. However, the idea was NAK'd during our recent team discussion. Regardless, I do think that RHCOS and MCO could work a little better on coordinating when an update is safe to apply or block an update from starting if the machine is shutting down (such as in the case when a user has SSH'd).

@darkmuggle
Copy link
Contributor Author

Would you say the core thing this is fixing here is that currently we were ignoring SIGTERM if caught during an update, but we didn't then exit after the update had finished? And that was causing systemd to time out and go on the SIGKILL spree?

@cgwalters this was fix four issues that were observed were:

So the core problem this seeks to solve is:

  • wait to die until work finishes
  • don't start work after sigterm or signaled to stop
  • respond as soon as possible to either stopCh or sigterm

@darkmuggle darkmuggle removed the WIP label Feb 18, 2021
@sinnykumari
Copy link
Contributor

Thanks Colin for reviewing.
There are some good discussion points in the PR which will be be good to revisit and make update and reboot better.

@ben Latest push broke some tests which needs to get fixed first.

The armors the signal handling for the daemon blocking any shutdown
until _after_ an update is complete.

The old functions `catchIgnoreSIGTERM` and `cancelSIGTERM`
really didn't do much (they used the mutex and then set a bool) but
there was no checks in the signal handling.

Signed-off-by: Ben Howard <[email protected]>
@darkmuggle
Copy link
Contributor Author

Arg, I forgot to do the make verify; make test-unit dance after PR review update. That's been fixed. @Ksinny should be good to go now.

@sinnykumari
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2021
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, darkmuggle, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [darkmuggle,sinnykumari]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 19, 2021

@darkmuggle: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-workers-rhel7 dd71541 link /test e2e-aws-workers-rhel7
ci/prow/okd-e2e-aws dd71541 link /test okd-e2e-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit e0a636e into openshift:master Feb 19, 2021
@darkmuggle darkmuggle changed the title daemon: safer signal handling for shutdown Bug 1927041: daemon: safer signal handling for shutdown Feb 24, 2021
@openshift-ci-robot
Copy link
Contributor

@darkmuggle: All pull requests linked via external trackers have merged:

Bugzilla bug 1927041 has been moved to the MODIFIED state.

In response to this:

Bug 1927041: daemon: safer signal handling for shutdown

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. team-mco
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants