-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pin and pre-load images #1481
Pin and pre-load images #1481
Changes from 5 commits
8a07077
da7d638
ee760a1
433c28e
8df4279
7d2bf33
fba408f
82fc4cf
0c6fe1d
54b515b
7946c21
a9ff7e7
2ef880c
f379ab1
a16ccec
25fc032
0f37ea0
014d66c
a73e6c1
ca63f06
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,237 @@ | ||
--- | ||
title: pin-and-pre-load-images | ||
authors: | ||
- "@jhernand" | ||
reviewers: | ||
- "@avishayt" # To ensure that this will be usable with the appliance. | ||
- "@danielerez" # To ensure that this will be usable with the appliance. | ||
- "@mrunalp" # To ensure that this can be implemented with CRI-O and MCO. | ||
- "@nmagnezi" # To ensure that this will be usable with the appliance. | ||
- "@oourfali" # To ensure that this will be usable with the appliance. | ||
approvers: | ||
- "@sinnykumari" | ||
- "@mrunalp" | ||
api-approvers: | ||
- "@sdodson" | ||
- "@zaneb" | ||
- "@deads2k" | ||
- "@JoelSpeed" | ||
creation-date: 2023-09-21 | ||
last-updated: 2023-09-21 | ||
tracking-link: | ||
- https://issues.redhat.com/browse/RFE-4482 | ||
see-also: | ||
- https://github.com/openshift/enhancements/pull/1432 | ||
- https://github.com/openshift/machine-config-operator/pull/3839 | ||
replaces: [] | ||
superseded-by: [] | ||
--- | ||
|
||
# Pin and pre-load images | ||
|
||
## Summary | ||
|
||
Provide an mechanism to pin and pre-load container images. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what is the motivation for pinning? I understand pre-loading, but disqualifying the images from garbage collection in the kubelet means unused images will clutter the disk and potentially cause the disk to be overused. Is it to prevent cri-o from removing them on upgrade as it does today? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The motivation is to be able for the cluster to operate properly in the absence of an external registry. The storage should suffice in these use-cases. The idea is to pin only platform and partner images only, not all customer images running workloads on this cluster. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if the only object is to get the clusters to operate without a registry, then simply pre-loading should suffice IMO. pinning doesn't seem necessary to me There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how would you ensure the pre-loaded images don't get removed by GC before they are needed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if "The storage should suffice in these use-cases" then GC should never kick in as it's currently only triggered by disk usage hitting a certain threshold There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note also that users will be adding/modifying/removing their own workloads, which can bring new images, consume disk space and trigger the GC. We don't want the pinned images to be the ones evicted by that. Do we still agree that it is better to explicitly say that some images should never be removed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I do, though we do need to think through what the user experience/recovery/alerting is if they do run out of storage because of pinned images and the system can't evict anything. I.e. how does the user get notified of the problem? And once they hit the problem is the cluster already broken or do they have time to take action before then? If they find out after the disk is full what's the recovery steps and what are the implications for the cluster+workloads until the recovery steps are performed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd say that problems due to exhausted disk space aren't new: they are already possible, and I assume that your concerns are already somehow addressed. However there are some differences:
I'l try to include these points in the document. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added this to the risk ans mitgations section. |
||
|
||
## Motivation | ||
|
||
Slow and/or unreliable connections to the image registry servers interfere with | ||
operations that require pulling images. For example, an upgrade may require | ||
pulling more than one hundred images. Failures to pull those images cause | ||
retries that interfere with the upgrade process and may eventually make it | ||
fail. One way to improve that is to pull the images in advance, before they are | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we have data indicating inability to pull an image (whether due to an issue in the registry or networking connectivity or whatever) is a common cause of upgrade failures? i think the more compelling motivation for this EP is to speed up upgrades by pre-pulling images so that when the user actually clicks the "upgrade now" button, the cluster spends less time in an upgrading state (which users tend to perceive as an unstable/risky state to be in) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have numbers. I know that the upgrade failure rate due to failures to pull images is 100% when there is no registry. That is our main use case. I am refraining from saying that this speeds up upgrades because in my experience it doesn't. A regular upgrade with a reasonable internet connection takes approx 40 minutes. The same upgrade after pre-pulling the images also takes approx those 40 minutes, at least that is my experience. I am assuming that is different when there is a severe bandwidth limitation, but I didn't test that scenario. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's safe to say that downloading the images in advance provides a more consistent upgrade time in environments with limited connectivity. This is important when scheduling upgrades into maintenance windows, even if the upgrade might not otherwise fail. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @avishayt, added your comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @bparees I know that it's not uncommon for upgrade CI jobs to have some image pulls get throttled entering ImagePullBackOff, that's in small five node clusters, if the clusters were significantly larger the likelihood of throttling increases because we'd have more individual nodes pulling images. |
||
actually needed, and ensure that they aren't removed. Doing that provides a | ||
more consistent upgrade time in those environments. That is important when | ||
scheduling upgrades into maintenance windows, even if the upgrade might not | ||
otherwise fail. | ||
|
||
### User Stories | ||
|
||
#### Pre-load and pin upgrade images | ||
|
||
As the administrator of a cluster that has a low bandwidth and/or unreliable | ||
connection to an image registry server I want to pin and pre-load all the | ||
images required for the upgrade in advance, so that when I decide to actually | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How are we picking the image to pre-load? Pulling the full payload may be more than what's actually needed. Are we going to look at the currently running images and only pull newer references of those from the payload? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Determining the set of image references for an upgrade is out of the scope for this enhancement. It will be part of a separate enhancement (based on the now obsolete #1432 ) describing the orchestration of the upgrade without a registry. Anyhow, the initial idea was to have a single # For control plane nodes:
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: PinnedImageSet
metadata:
name: my-control-plane-pinned-images
spec:
nodeSelector:
matchLabels:
node-role.kubernetes.io/control-plane: ""
pinnedImages:
...
---
# For worker nodes:
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: PinnedImageSet
metadata:
name: my-worker-pinned-images
spec:
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
pinnedImages:
... I expect that the calculation of those sets of images for the control plane nodes and the worker nodes will be done in advance by CVO or by the tool that creates the upgrade bundle. We don't intend to look at the currently running images for that. Note that this same There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd indeed keep that aside for now. the way those are calculated is in parallel to providing the relevant APIs to allow that to work. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm fine with not addressing this now and just viewing it as roughly equivalent to mirroring the release image plus any additional images, however long term I'd like to see us move toward all upgrades pre-fetching and pinning images and before we get there I'd like to ensure that we try to limit the images as much as reasonable. We should however try to make it clear what disk space will be consumed by this effort. |
||
perform the upgrade there will be no need to contact that slow and/or | ||
unreliable registry server and the upgrade will successfully complete in a | ||
predictable time. | ||
|
||
#### Pre-load and pin application images | ||
|
||
As the administrator of a cluster that has a low bandwidth and/or unreliable | ||
connection to an image registry server I want to pin and pre-load the images | ||
required by my application in advance, so that when I decide to actually deploy | ||
it there will be no need to contact that slow and/or unreliable registry server | ||
and my application will successfully deploy in a predictable time. | ||
|
||
### Goals | ||
|
||
Provide a mechanism that cluster administrators can use to pin and pre-load | ||
container images. | ||
|
||
### Non-Goals | ||
|
||
We wish to use the mechanism described in this enhancement to orchestrate | ||
upgrades without a registry server. But that orchestration is not a goal | ||
of this enhancement; it will be part of a separate enhancement based on | ||
parts of https://github.com/openshift/enhancements/pull/1432. | ||
|
||
## Proposal | ||
|
||
### Workflow Description | ||
|
||
1. The administrator of a cluster uses the `ContainerRuntimeConfig` object to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. First, I think we should make it very ergonomic to automatically pin the images referenced by the release image. Not sure where that API object would live...maybe something in the CVO? OK first actually, I think we should just default to having e.g. the MCO do a rolling "pre-upgrade" in addition to its actual upgrade. We could match it to the rollout of the MCD (the daemonset). Today the MCO does a node-by-node drain+upgrade (according to maxUnavailable). But we can change it to pre-pull images before it starts an upgrade on a node. (Or, for the very first node, it'd be concurrent). Then our API surface might have a tunable for "maximum number of nodes to preload/pin" like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cgwalters i agree, but the higher level orchestration should happen on the CVO level and should be covered another enhancement - the idea was to split between those in order to deliver it as building blocks |
||
request that a set of container images are pinned and pre-loaded: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The user experience of asking the administrator to provide a list of all of the images in one of our payloads is going to be a bit brittle. I understand that we need to be able to pin things that are not in the payload, too, but could we offer some shorthand for release payloads? For example, could we have a separate field for pinning a release image that uses the same pull spec that we use in the ClusterVersion API? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. alternatively i'd expect this to be wrapped by tooling that is part of the upgrade flow which generates/manages this resource for the user. that said, i haven't read #1483 yet which is the "bootstrapping + upgrading" portion of this overall effort. if the intent of this EP is just "make it possible to pull some images to nodes and pin them on those nodes" then i think this is fine as is, and we'd cover "how does this api get configured with the images to be pulled + pinned for an upgrade" as part of the upgrade process/EP. and i'm strongly supportive of that separation of concerns, we just need to ensure the api being defined here is one that is well aligned to how the upgrade flow will need to use it. (e.g. it probably implies multiple instances of this resource so that the upgrade tooling can create/manage its own "pin these images" list w/o conflicting with any list of images the user is providing for their workloads) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In #1432 (now closed) we suggested to have the logic to generate the list of images for an upgrade inside CVO. I am trying to keep that out of this EP to keep this simple and independent of upgrades. That part of #1432 will eventually be in a yet to be created EP. Note that it will not be part of #1483, as that is only about not requiring a registry during the upgrade. I think a reasonable way to avoid conflicts could be to have something like this: pinnedImageSets:
- name: upgrade
images:
- quay.io/openshift-release-dev/ocp-release@sha256:...
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
...
- name: my-workload
images:
- quay.io/my-org/my-image-1@sha256:...
- quay.io/my-org/my-image-2@sha256:...
... That would also help to include additional information regarding each set of images. For example, to have different sets of images for different types of nodes we could add a node selector: pinnedImageSets:
- name: upgrade-control-plane
nodeSelector:
matchLabels:
node-role.kubernetes.io/control-plane: ""
images:
- quay.io/openshift-release-dev/ocp-release@sha256:...
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
...
- name: upgrade-workers
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
images:
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
... Would that address your concerns? Note that all this would be part of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @zaneb, added a non-goal for that. |
||
|
||
```yaml | ||
apiVersion: machineconfiguration.openshift.io/v1 | ||
kind: ContainerRuntimeConfig | ||
metadata: | ||
name: ... | ||
spec: | ||
containerRuntimeConfig: | ||
pinnedImages: | ||
- quay.io/openshift-release-dev/ocp-release@sha256:... | ||
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... | ||
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... | ||
... | ||
``` | ||
|
||
1. The machine config operators ensures that all the images are pinned and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is one final thing I am not clear on: who is pulling the images? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like we should determine that before we move forward, though I could be convinced otherwise There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The suggestion is to have a new
This sub-controller can't pull the images directly because that requires talking to the gRPC API of CRI-O, and that is only available via Ultimately the images will be pulled by the CRI-O running in each node, think of it as going to each node and running There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see this similar to how MCO applies kubeletconfig and containerruntime config on node. kubeletconfig and containerruntime controller processes respective CRD applied to the cluster which in results renders a new MachineConfig. Node sub-controller applies desired config annotation on node to signal machine-config-deamon (MCD running on each node ) to apply the change. In similar way, PinnedImageSetController would process the applied PinnedImageSetController CRD and a new rendered MachineConifg gets generated. MCD learns to apply the change by pulling those images using crictl pull and whatever needs to be done locally on the node. Alternatively, we can introduce a new daemon if needed due to any unknown limitation in MCD. Note that introducing a new daemon would still require some sort of co-ordination so that they don't go sideways like MCD started drain/reboot due a change and at the same time new daemon to pulling the images) |
||
pulled in all the nodes of the cluster. | ||
|
||
### API Extensions | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As per my understanding, we will also introduce a pinnedImageSet CRD in MCO? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that is what I mention below:
Am I missing something? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By CRD for pinnedImageSet I meant similar to CRD we have today like machineconfig , kubeletconfig so that admin have friendly output to see by oc or equivalent utility. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that is what I mean when I say "object". I replaced "object" with "custom resource" to try to make it clear. So, we will have a new There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For clarity, a new type in the k8s API is a CRD (Custom Resource Definition) and an instance of that type is a CR (or object)</pedantry> |
||
|
||
There are no new object kinds introduced by this enhancement, but new fields | ||
will be added to existing `ContainerRuntimeConfig` objects. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assume it's fine to modify a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that we are considering introducing a new There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
as long as it's just an addition of a field and it has a sane default for upgrading clusters that won't have the value specified it's ok. @deads2k i know we've run into cases of clients that were updating resources instead of patching them and as a result they were stomping/dropping fields they didn't know about, is that something we worry about when modifying(extending) APIs or that just falls into "bad client behavior"? |
||
|
||
The new fields for the `ContainerRuntimeConfig` object are defined in detail in | ||
https://github.com/openshift/machine-config-operator/pull/3839. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. MCO api and crd has moved to openshift/api repo https://github.com/openshift/api/tree/master/machineconfiguration/v1 . Any api change will happen at new location. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @sinnykumari, I moved the API pull request to that repository: openshift/api#1609 . |
||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
Starting with version 4.14 of OpenShift CRI-O will have the capability to pin | ||
certain images (see [this](https://github.com/cri-o/cri-o/pull/6862) pull | ||
request for details). That capability will be used to pin all the images | ||
required for the upgrade, so that they aren't garbage collected by kubelet and | ||
CRI-O. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is not quite true today. These images aren't currently disqualified for image removal on upgrade that CRI-O currently does There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW: we're working on dropping this behavior in CRI-O but it will take some time (4.17 time frame maybe??) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So images will be removed when CRI-O itself is upgraded regardless of being pinned? If so, is there any way around it? Would it help to disable the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah there's a setting to disable it. I also think it's reasonable to disqualify pinned images in the restart flow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we include that change to CRI-O (disqualify pinned images in the restart flow) in this enhancement? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah I think so There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, added a paragraph explaining that. |
||
|
||
The changes to pin the images will be done in a `/etc/crio/crio.conf.d/pin.conf` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Admins could override that file which would cause a race. Do we need something like "multiple config-dirs" in CRI-O or a reserved namespace in the MCO? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe this file could be named after the
Less likely to conflict with files created by admins, but not totally impossible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I think that's better There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, added that in the latest patch. |
||
file, something like this: | ||
|
||
```toml | ||
pinned_images=[ | ||
"quay.io/openshift-release-dev/ocp-release@sha256:...", | ||
"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...", | ||
"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...", | ||
... | ||
] | ||
``` | ||
|
||
The images need to be pre-loaded and the CRI-O service needs to be reloaded | ||
when this configuration changes. To support that a new field will be added to | ||
the `ContainerRuntimeConfig` object: | ||
|
||
```yaml | ||
apiVersion: machineconfiguration.openshift.io/v1 | ||
kind: ContainerRuntimeConfig | ||
metadata: | ||
name: ... | ||
spec: | ||
containerRuntimeConfig: | ||
pinnedImages: | ||
- quay.io/openshift-release-dev/ocp-release@sha256:... | ||
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... | ||
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... | ||
... | ||
``` | ||
|
||
When the new `pinnedImages` field is added or changed the machine config | ||
operator will need to pull those images (with the equivalent of `crictl pull`), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We cannot support private registries here, right? At least no pull secrets. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this support private registries just fine: it uses the same mechanisms that CRI-O uses to pull any other images. |
||
create or update the corresponding `/etc/crio/crio.conf.d/pin.conf` file and ask | ||
CRI-O reload its configuration (with the equivalent of `systemctl reload | ||
crio.service`). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is the expectation that these images will be pulled+pinned on every node in the cluster? and that if a new node is added to the cluster, it will also then pull+pin these images when it comes up? one thing i am wondering about is storage concerns....not all nodes run all images. do we need a nodeselector field to be included w/ a given list of images, in order to control which nodes actually pull+pin those images? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that #1483 doesn't cover the overall upgrade approach. That was part of the now closed #1432 and will eventually be part of another EP. Our suggestion there is that CVO will be an user of this API: it extracts the list of images from the release image updates the Yes, the intent of this EP is just to pull and pin images on nodes, regardless of what those images will be used for. Yes, the images will be pulled and pinned in all the nodes of the cluster. Or else we can introduce the "pinned image set concept" to associate sets of images with node selectors as described in another comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When a new node comes up it should honor the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
when a new node comes up you'll also have to unset the "all images pulled+pinned" status condition (whereever/however that is being reported) so that anything waiting on that (such as an upgrade coordinator) knows it now needs to go back to waiting. i'm guessing you don't expect new nodes to be getting introduced during bootstrap+upgrade, but it feels like a potential for race conditions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right. Our suggestion in the previous EP was to suspend autoscaling during the upgrade to avoid that potential race. |
||
|
||
The machine config operator will then will use the gRPC API of CRI-O to run the | ||
equivalent of `crictl pull` for each of the images. When that is completed the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mean we mount There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that requires access to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't mount crio.sock today in MCD. |
||
machine config operator will update the new `status.pinnedImages` field of the | ||
rendered machine config: | ||
|
||
```yaml | ||
status: | ||
pinnedImages: | ||
- quay.io/openshift-release-dev/ocp-release@sha256:... | ||
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... | ||
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In a multi-node cluster, there may be different status values for each node. How can we track those? How are errors reported, so that if a node runs out of disk space or one of the images is inaccessible someone can discover that before trying the upgrade? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1, great point about per node status There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the user of this mechanism (be it the upgrade orchestrator or something else) what is important is if the images it needs have been pulled and pinned or not. That user can check that comparing the set of images it requested with the set of images contained in this In a previous incarnation of this EP I proposed to have these details in a new If we wanted to make things easy for users of this API (the upgrade orchestrator, for example) what would be convenient is to have some place in the API where they can check if all the images have been pulled and pinned in all the relevant nodes, without having to explicitly compare the sets of images. Something like this inside status:
pinnedImageSets:
- name: upgrade-control-plane
conditions:
- type: Ready
status: "True" With that the user of the API would only have to check if that Errors could then be reported with other conditions, for example: status:
pinnedImageSets:
- name: upgrade-control-plane
conditions:
- type: Failed
message: |
Image 'quay.io/openshift-release-dev/...' can't be pulled because 'node0' doesn't
have enough disk space Should I move the EP in that direction? Note again that I am not doing that because I have been asked to avoid adding a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it does seem like the pinnedimageset deserves its own status. I wasn't part of the conversation that lead to the decision to use ContainerRuntimeConfig or why we can't add more status to it, but it is starting to feel like having a specific CRD to drive+report on image pulling+pinning would be appropriate as the api seems like it is going to grow in complexity. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed the document to use a new |
||
... | ||
``` | ||
|
||
### Risks and Mitigations | ||
|
||
None. | ||
|
||
### Drawbacks | ||
|
||
This approach requires non trivial changes to the machine config operator. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. at least as currently defined another drawback is the implication that every pinned image is pulled/pinned on every node, even though most of the nodes will never run pods using a given image. (e.g. master nodes will never run a user workload image. and worker nodes will never run control plane images. but there are more subtle cases too). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, there are subtleties, like compact three node clusters, where control plane nodes do run workloads. I think we can address this concern with the "pinned image set" concept that includes a node selector. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. currently ctrcfgs are run per MCP, so a different set could be configured per control plane and worker pools, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can support different sets of images for control plane and workers adding a pinnedImageSets:
- name: upgrade-control-plane
nodeSelector:
matchLabels:
node-role.kubernetes.io/control-plane: ""
images:
- quay.io/openshift-release-dev/ocp-release@sha256:...
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
...
- name: upgrade-workers
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
images:
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... |
||
|
||
## Design Details | ||
|
||
### Open Questions | ||
|
||
None. | ||
|
||
### Test Plan | ||
|
||
We add a CI test that verifies that images are correctly pinned and pre-loaded. | ||
|
||
### Graduation Criteria | ||
|
||
The feature will ideally be introduced as `Dev Preview` in OpenShift 4.X, | ||
moved to `Tech Preview` in 4.X+1 and declared `GA` in 4.X+2. | ||
|
||
#### Dev Preview -> Tech Preview | ||
|
||
- Availability of the CI test. | ||
|
||
- Obtain positive feedback from at least one customer. | ||
|
||
#### Tech Preview -> GA | ||
|
||
- User facing documentation created in | ||
[https://github.com/openshift/openshift-docs](openshift-docs). | ||
|
||
#### Removing a deprecated feature | ||
|
||
Not applicable, no feature will be removed. | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
Not applicable. | ||
|
||
### Version Skew Strategy | ||
|
||
Not applicable. | ||
|
||
### Operational Aspects of API Extensions | ||
|
||
Not applicable, there are no API extensions. | ||
|
||
#### Failure Modes | ||
|
||
#### Support Procedures | ||
|
||
## Implementation History | ||
|
||
There is an initial prototype exploring some of the implementation details | ||
described here in this [https://github.com/jhernand/upgrade-tool](repository). | ||
|
||
## Alternatives | ||
|
||
The alternative to this is to manually pull the images in all the nodes of the | ||
cluster, manually create the `/etc/crio/crio.conf.d/pin.conf` file and manually | ||
reload the CRI-O service. | ||
|
||
## Infrastructure Needed | ||
|
||
Infrastructure will be needed to run the CI test described in the test plan | ||
above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW containers/bootc#128 is related to this, and would have the advantage it'd work in an ergonomic way outside of OCP too.