node-controller: Support an annotation to hold/prioritize updates #2162

cgwalters · 2020-10-15T00:45:33Z

Today the MCO arbitrarily chooses a node to update from the candidates.
We want to allow admins to avoid specific nodes entirely (for as long
as they want) as well as guide upgrade ordering.

This replaces the defunct etcd-specific code with support for a generic
annotation machineconfiguration.openshift.io/update-order that allows
an external controller (and/or human) to do both of these.

Setting it to 0 will entirely skip that node for updates. Otherwise,
higher values are preferred.

Closes: #2059

openshift-ci-robot · 2020-10-15T00:45:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Today the MCO arbitrarily chooses a node to update from the candidates. We want to allow admins to avoid specific nodes entirely (for as long as they want) as well as guide upgrade ordering. This replaces the defunct etcd-specific code with support for a generic annotation `machineconfiguration.openshift.io/update-order` that allows an external controller (and/or human) to do both of these. Setting it to `0` will entirely skip that node for updates. Otherwise, higher values are preferred. Closes: openshift#2059

cgwalters · 2020-10-15T11:36:03Z

pkg/controller/node/node_controller.go

+				glog.Warningf("Failed to parse %s %s: %v", node.Name, daemonconsts.MachineUpdateOrderingAnnotationKey, err)
+				continue
+			}
+			// order 0 means "skip this node"


This is the simple PoC, but if we go this route I think we need to make this more observable; something like a count of "held/skipped" nodes in the pool as well, etc.

crawford · 2020-10-15T13:43:51Z

/hold

As mentioned in #2059 (comment), I'm against this direction. Its potential for misuse is too high and it's not clear to me what problem this solves. As for the implementation, assigning special function to 0 is no longer considered good API. I'd prefer an algebraic type or a separate field entirely.

If the intention of this PR is to provide a mechanism for pausing updates for a particular node, then let's specifically tackle that. I'm in favor of defining an annotation whose presence is the signal to MCO that this node should be skipped.

cgwalters · 2020-10-15T14:15:22Z

As mentioned in #2059 (comment), I'm against this direction. Its potential for misuse is too high and it's not clear to me what problem this solves.

I replied here on that concern: #2059 (comment)

I completely agree that OpenShift should by default be more intelligent about how we upgrade nodes, but I can't imagine we hardcode all of that logic into the node controller. An update ordering system seems like it really needs to be a separate controller with a higher level view (including of machinesets, etc.). And on UPI metal admins are just going to want full control. So I don't see how we can avoid a low-level API like this at least eventually.

If the intention of this PR is to provide a mechanism for pausing updates for a particular node, then let's specifically tackle that.

That's fair, yeah we can make that separate.

eparis · 2020-10-15T14:21:42Z

I asked colin to look at the problem of designing an API to allow explicit pause that wouldn't constrain us too much from doing more in the future. i don't have a strong opinion of how much further to go than just a 'paused/unpaused' bit.

cgwalters · 2020-10-15T16:42:13Z

OK holding only is #2163

kikisdeliveryservice · 2020-10-15T19:28:57Z

To be clear this PR is now the prioritize updates PR and #2163 is the hold updates pr?

cgwalters · 2020-10-15T21:12:48Z

To be clear this PR is now the prioritize updates PR and #2163 is the hold updates pr?

Yeah they're related but conceptually orthogonal. Since it seems we want #2163 more we can rebase this on that when it merges, or close this if we decide to take another direction.

(I guess in fact a controller could implement update priority by simply adding a hold to everything it didn't want to update, would be crude but...)

openshift-merge-robot · 2020-11-05T23:08:46Z

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/okd-e2e-aws	`35ffe71`	link	`/test okd-e2e-aws`
ci/prow/e2e-aws	`35ffe71`	link	`/test e2e-aws`
ci/prow/e2e-aws-serial	`35ffe71`	link	`/test e2e-aws-serial`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2021-02-16T23:40:45Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-03-19T02:39:01Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-04-18T06:45:40Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-04-18T06:45:46Z

@cgwalters: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2021-04-18T06:45:51Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from ericavonb and sinnykumari October 15, 2020 00:45

cgwalters mentioned this pull request Oct 15, 2020

Allow administrators to guide upgrade ordering #2059

Open

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 15, 2020

cgwalters force-pushed the controller-order branch from a0a6192 to 35ffe71 Compare October 15, 2020 00:54

cgwalters commented Oct 15, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 15, 2020

kikisdeliveryservice requested review from runcom and removed request for ericavonb October 15, 2020 19:29

kikisdeliveryservice added the team-mco label Nov 18, 2020

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2021

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 19, 2021

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 18, 2021

openshift-ci-robot closed this Apr 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node-controller: Support an annotation to hold/prioritize updates #2162

node-controller: Support an annotation to hold/prioritize updates #2162

cgwalters commented Oct 15, 2020 •

edited

Loading

openshift-ci-robot commented Oct 15, 2020

cgwalters Oct 15, 2020

crawford commented Oct 15, 2020

cgwalters commented Oct 15, 2020

eparis commented Oct 15, 2020

cgwalters commented Oct 15, 2020

kikisdeliveryservice commented Oct 15, 2020

cgwalters commented Oct 15, 2020

openshift-merge-robot commented Nov 5, 2020

openshift-bot commented Feb 16, 2021

openshift-bot commented Mar 19, 2021

openshift-bot commented Apr 18, 2021

openshift-ci bot commented Apr 18, 2021

openshift-ci-robot commented Apr 18, 2021

node-controller: Support an annotation to hold/prioritize updates #2162

node-controller: Support an annotation to hold/prioritize updates #2162

Conversation

cgwalters commented Oct 15, 2020 • edited Loading

openshift-ci-robot commented Oct 15, 2020

cgwalters Oct 15, 2020

Choose a reason for hiding this comment

crawford commented Oct 15, 2020

cgwalters commented Oct 15, 2020

eparis commented Oct 15, 2020

cgwalters commented Oct 15, 2020

kikisdeliveryservice commented Oct 15, 2020

cgwalters commented Oct 15, 2020

openshift-merge-robot commented Nov 5, 2020

openshift-bot commented Feb 16, 2021

openshift-bot commented Mar 19, 2021

openshift-bot commented Apr 18, 2021

openshift-ci bot commented Apr 18, 2021

openshift-ci-robot commented Apr 18, 2021

cgwalters commented Oct 15, 2020 •

edited

Loading