Bug 1927041: daemon: safer signal handling for shutdown #2395

darkmuggle · 2021-02-08T22:57:27Z

This armors the signal handling for the daemon blocking any shutdown until after an update is complete.

The old functions catchIgnoreSIGTERM and cancelSIGTERM really didn't do much (they used the mutex and then set a bool) but there were no checks in the signal handling.

This also increases the time for the MCD to shutdown from 5min to a 1hr. Since the MCD will shut down immediately when safe to do so this shouldn't have negative effects except in the case of an already unhappy node.

The MCD is using the standard 600 grace period to end its work (5min). However, we have seen cases where this is insufficient and the node is rebooted under the MCD. The MCD has sigterm handling, but if the grace period times out, then Kubernetes sends a SIGKILL.

darkmuggle · 2021-02-16T21:00:39Z

I dropped the Systemd Inhibit functionality to stop reboots. During the team discussion today, the prevailing view is that the MCO cannot block a reboot.

darkmuggle · 2021-02-17T15:37:58Z

pkg/daemon/daemon.go

@@ -634,16 +637,16 @@ func (dn *Daemon) Run(stopCh <-chan struct{}, exitCh <-chan error) error {

 	go wait.Until(dn.worker, time.Second, stopCh)

-	for {


The for loop is superfluous since once we get the sigterm we should finish the work and then get out of the way. A sigkill could/likely follow and there's literally nothing we can do about that.

pkg/daemon/daemon.go

sinnykumari · 2021-02-17T17:33:04Z

This really looks good and handles well interruption during update process.
/approve

Colin, can you please also take a look to make sure we are not missing anything here?
cc @cgwalters

darkmuggle · 2021-02-17T23:04:10Z

/retest

cgwalters

Would you say the core thing this is fixing here is that currently we were ignoring SIGTERM if caught during an update, but we didn't then exit after the update had finished? And that was causing systemd to time out and go on the SIGKILL spree?

Overall I think this looks improved; one optional comment.

I find it hard to review all of this for correctness though - I'd reiterate that I think we really want to make this whole thing transactional w/ostree support. I hope at some point in the next few months to land some infrastructure for that.

cgwalters · 2021-02-18T20:01:30Z

pkg/daemon/daemon.go

+		return nil
+	case sig := <-signaled:
+		glog.Warningf("shutdown of machine-config-daemon triggered via signal %d", sig)
+		return fmt.Errorf("shutdown of the machine-config-daemon trigger via syscall signal %d", sig)


I think getting SIGTERM isn't an error, it's a normal condition. Which relates to one of the original goals I had here in that in the "idle" aka "not applying updates" case, we don't install a SIGTERM handler at all - we just let the kernel unilaterally kill the process.

I more recently posted about this here: https://internals.rust-lang.org/t/should-rust-programs-unwind-on-sigint/13800/11

See also e.g. nhorman/rng-tools#72

And related to all this, ideally of course we do #1190 - then we shouldn't need to handle SIGTERM at all in the MCO. With that if we're killed (or the machine power cycled/kernel froze) in the middle of an update, we either have the old or new system.

I did go back on and forth on whether or to report SIGTERM as an error or not for the reasons your blog highlights and ultimately went with returning the error to signal the reason for the death of the process (and thus make it obvious in the logs). In retrospect, I think just logging and moving on is the better path.

FWIW I was arguing for making the MCD block reboots, however, @crawford argued that fully transactional updates would negate the need; he would rather wait for transactional support from rpm-ostree before doing any reboot armoring. My view is that until we get the transactional update mechanism, a short-lived inhibitor is better than nothing and could prevent some support cases and bug reports.

And related to all this, ideally of course we do #1190 - then we shouldn't need to handle SIGTERM at all in the MCO.

I would disagree -- we should at, the very least, through away the pending transaction (or have rpm-ostree do it).

pkg/daemon/daemon.go

cgwalters · 2021-02-18T20:11:55Z

Tangentially related to this...one thing that could help in the MCO design is if we had a "dual nature" as both a pod and a systemd unit. For example, we could represent the "applying updates" phase via e.g. systemctl start machine-config-operator-update.

This would also more naturally handle cases where e.g. we're updating a node and while that happens a cluster upgrade is happening an a new daemonset is rolling out. Now with correct SIGTERM handling the old pod won't die until done, but you could imagine that instead e.g. we allow the pod to be killed, but what's actually doing the updates is the systemd unit, and the new pod discovers that state instead. IOW think of the pod as like a "control plane proxy". A systemd unit would more obviously natively interact with native systemd features (including e.g. things like ordering Before/After= and possibly using the systemd inhibit functionality, etc.)

(This "proxying a systemd unit" is actually what's happening with rpm-ostreed today)

darkmuggle · 2021-02-18T22:38:51Z

Tangentially related to this...one thing that could help in the MCO design is if we had a "dual nature" as both a pod and a systemd unit. For example, we could represent the "applying updates" phase via e.g. systemctl start machine-config-operator-update.

I had a similar thought, although your idea is a bit more refined. My idea was to have something that would ensure that if a user did a shutdown/reboot in the middle that RHCOS would wait until the update was done. However, the idea was NAK'd during our recent team discussion. Regardless, I do think that RHCOS and MCO could work a little better on coordinating when an update is safe to apply or block an update from starting if the machine is shutting down (such as in the case when a user has SSH'd).

darkmuggle · 2021-02-18T23:31:03Z

Would you say the core thing this is fixing here is that currently we were ignoring SIGTERM if caught during an update, but we didn't then exit after the update had finished? And that was causing systemd to time out and go on the SIGKILL spree?

@cgwalters this was fix four issues that were observed were:

when https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/update.go#L1965 was called, it would in fact wait for 24 hours to be killed.
This is because all sigterms were simply swallowed. https://github.com/openshift/machine-config-operator/pull/2395/files#diff-349c0748c3f52201852d3027c29daf618b45300b949482728df6666a3f9ba245L1917-L1933 shows that the prior logic literally did nothing. The updateActive bool was used to log that an update was active but didn't change the daemon's execution.
This fixes a race condition between the MCD receiving a sigterm and the MCD starting work. We have a report of the NTO applying an MCC while the SRIOV Operator triggers a reboot. Effectively since the sigterm was being trapped, the daemon might log if an update is active, but otherwise would continue normal operations until its work queue was closed or it was sigkill'd.

The stopCh channel did not stop work.

machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go

Line 140 in 61a1377

    
           func BackoffUntil(f func(), backoff BackoffManager, sliding bool, stopCh <-chan struct{}) {

is the code that runs the worker func. The interesting problem is that wait.Until will run the function until stop is signaled, but it won't stop the func. Since

machine-config-operator/pkg/daemon/daemon.go

Lines 348 to 351 in 61a1377

    
           func (dn *Daemon) worker() { 
        
           	for dn.processNextWorkItem() { 
        
           	} 
        
           }

runs as an infinite loop work can start after its supposed to stop.

So the core problem this seeks to solve is:

wait to die until work finishes
don't start work after sigterm or signaled to stop
respond as soon as possible to either stopCh or sigterm

sinnykumari · 2021-02-19T09:24:34Z

Thanks Colin for reviewing.
There are some good discussion points in the PR which will be be good to revisit and make update and reboot better.

@ben Latest push broke some tests which needs to get fixed first.

The armors the signal handling for the daemon blocking any shutdown until _after_ an update is complete. The old functions `catchIgnoreSIGTERM` and `cancelSIGTERM` really didn't do much (they used the mutex and then set a bool) but there was no checks in the signal handling. Signed-off-by: Ben Howard <[email protected]>

darkmuggle · 2021-02-19T14:05:25Z

Arg, I forgot to do the make verify; make test-unit dance after PR review update. That's been fixed. @Ksinny should be good to go now.

sinnykumari · 2021-02-19T15:37:11Z

/lgtm

openshift-ci-robot · 2021-02-19T15:37:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, darkmuggle, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [darkmuggle,sinnykumari]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-02-19T16:06:55Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-02-19T16:20:00Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-02-19T16:45:57Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-02-19T16:58:56Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2021-02-19T17:50:30Z

@darkmuggle: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-workers-rhel7	`dd71541`	link	`/test e2e-aws-workers-rhel7`
ci/prow/okd-e2e-aws	`dd71541`	link	`/test okd-e2e-aws`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2021-02-19T17:50:54Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2021-02-24T16:05:23Z

@darkmuggle: All pull requests linked via external trackers have merged:

Bugzilla bug 1927041 has been moved to the MODIFIED state.

In response to this:

Bug 1927041: daemon: safer signal handling for shutdown

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

darkmuggle added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. WIP labels Feb 8, 2021

openshift-ci-robot requested review from sinnykumari and yuqi-zhang February 8, 2021 22:58

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2021

kikisdeliveryservice added the team-mco label Feb 9, 2021

kikisdeliveryservice changed the title ~~daemon: safer signal handling for shutdown~~ [WIP] daemon: safer signal handling for shutdown Feb 9, 2021

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2021

darkmuggle changed the title ~~[WIP] daemon: safer signal handling for shutdown~~ daemon: safer signal handling for shutdown Feb 10, 2021

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 10, 2021

darkmuggle added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 10, 2021

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 16, 2021

darkmuggle commented Feb 17, 2021

View reviewed changes

sinnykumari reviewed Feb 17, 2021

View reviewed changes

pkg/daemon/daemon.go Show resolved Hide resolved

sinnykumari requested a review from cgwalters February 17, 2021 17:33

cgwalters approved these changes Feb 18, 2021

View reviewed changes

darkmuggle mentioned this pull request Feb 18, 2021

WIP: Debugging MCD "Got SIGTERM, but actively updating" when it shouldn't be updating #2407

Closed

darkmuggle removed the WIP label Feb 18, 2021

openshift-ci-robot assigned sinnykumari Feb 19, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2021

openshift-merge-robot merged commit e0a636e into openshift:master Feb 19, 2021

darkmuggle changed the title ~~daemon: safer signal handling for shutdown~~ Bug 1927041: daemon: safer signal handling for shutdown Feb 24, 2021

jkyros mentioned this pull request Jun 21, 2021

Bug 1965992: Gracefully shutdown taking around 6-7 mins (libvirt provider) #2631

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1927041: daemon: safer signal handling for shutdown #2395

Bug 1927041: daemon: safer signal handling for shutdown #2395

darkmuggle commented Feb 8, 2021 •

edited

Loading

darkmuggle commented Feb 16, 2021 •

edited

Loading

darkmuggle Feb 17, 2021 •

edited

Loading

sinnykumari commented Feb 17, 2021

darkmuggle commented Feb 17, 2021

cgwalters left a comment

cgwalters Feb 18, 2021

darkmuggle Feb 18, 2021 •

edited

Loading

cgwalters commented Feb 18, 2021 •

edited

Loading

darkmuggle commented Feb 18, 2021

darkmuggle commented Feb 18, 2021

sinnykumari commented Feb 19, 2021

darkmuggle commented Feb 19, 2021

sinnykumari commented Feb 19, 2021

openshift-ci-robot commented Feb 19, 2021

openshift-bot commented Feb 19, 2021

openshift-bot commented Feb 19, 2021

openshift-bot commented Feb 19, 2021

openshift-bot commented Feb 19, 2021

openshift-ci bot commented Feb 19, 2021 •

edited

Loading

openshift-bot commented Feb 19, 2021

openshift-ci-robot commented Feb 24, 2021

		@@ -634,16 +637,16 @@ func (dn *Daemon) Run(stopCh <-chan struct{}, exitCh <-chan error) error {

		go wait.Until(dn.worker, time.Second, stopCh)

		for {

Bug 1927041: daemon: safer signal handling for shutdown #2395

Bug 1927041: daemon: safer signal handling for shutdown #2395

Conversation

darkmuggle commented Feb 8, 2021 • edited Loading

darkmuggle commented Feb 16, 2021 • edited Loading

darkmuggle Feb 17, 2021 • edited Loading

Choose a reason for hiding this comment

sinnykumari commented Feb 17, 2021

darkmuggle commented Feb 17, 2021

cgwalters left a comment

Choose a reason for hiding this comment

cgwalters Feb 18, 2021

Choose a reason for hiding this comment

darkmuggle Feb 18, 2021 • edited Loading

Choose a reason for hiding this comment

cgwalters commented Feb 18, 2021 • edited Loading

darkmuggle commented Feb 18, 2021

darkmuggle commented Feb 18, 2021

sinnykumari commented Feb 19, 2021

darkmuggle commented Feb 19, 2021

sinnykumari commented Feb 19, 2021

openshift-ci-robot commented Feb 19, 2021

openshift-bot commented Feb 19, 2021

openshift-bot commented Feb 19, 2021

openshift-bot commented Feb 19, 2021

openshift-bot commented Feb 19, 2021

openshift-ci bot commented Feb 19, 2021 • edited Loading

openshift-bot commented Feb 19, 2021

openshift-ci-robot commented Feb 24, 2021

darkmuggle commented Feb 8, 2021 •

edited

Loading

darkmuggle commented Feb 16, 2021 •

edited

Loading

darkmuggle Feb 17, 2021 •

edited

Loading

darkmuggle Feb 18, 2021 •

edited

Loading

cgwalters commented Feb 18, 2021 •

edited

Loading

openshift-ci bot commented Feb 19, 2021 •

edited

Loading