Upgrade without registry enhancement #1432

jhernand · 2023-06-29T08:00:00Z

OTA-1001: Upgrade without registry enhancement

This patch adds an enhancement that describes an automated mechanism to perform cluster upgrades without requiring an image registry server.

Related: https://issues.redhat.com/browse/RFE-4482
Related: https://issues.redhat.com/browse/OTA-1001
Related: https://issues.redhat.com/browse/OTA-997
Related: openshift/machine-config-operator#3839
Related: openshift/api#1548

openshift-ci · 2023-06-29T08:00:06Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2023-06-29T08:00:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lalatendumohanty for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

enhancements/update/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

oourfali · 2023-06-29T08:37:26Z

Looks good overall. Added some comments.

enhancements/update/upgrade-without-registry.md

oourfali · 2023-07-02T07:59:39Z

@mrunalp Hi! can you please review? Who else should we tag?

mrunalp · 2023-07-12T23:52:21Z

@sdodson ptal

enhancements/update/upgrade-without-registry.md

jhernand · 2023-07-17T08:01:57Z

Thanks for your review @mrunalp. I tried to incorporate it, please take another look.

oourfali

Overall it looks very good!
Some comments.

enhancements/update/upgrade-without-registry.md

oourfali · 2023-07-31T09:05:39Z

enhancements/update/upgrade-without-registry.md

+will be required to the CRI-O configuration during the upgrade.
+
+1. The cluster version operator needs to orchestrate the upgrade process.
+


Also need to mention the need not to add/remove nodes while upgrading.

In the "CVO will need to orchestrate the upgrade" we say this:

To do so it will first disable auto-scaling to ensure that no new nodes
are added to the cluster while this process is progress

Is that enough?

oourfali · 2023-07-31T09:07:36Z

@LalatenduMohanty We'll appreciate a review (feel free to pull others as well.... but not via always pull policy ;-) ).

enhancements/update/upgrade-without-registry.md

cgwalters · 2023-09-08T13:52:37Z

There is a cnv test e.g. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-azure-deploy-cnv/1699895359578312704 - it's just informing today, but at least provides the baseline capability for joint teams to monitor its success.

bparees · 2023-09-08T13:49:42Z

enhancements/update/upgrade-without-registry.md

+does so in order refresh catalog images that are specified with a tag. But it
+also does it when it pulls catalog images that are specified with a digest. That
+should be changed to use the `IfNotPresent` pull policy for catalog images that
+are specified by digest.


Changing the catalogs to use PullIfNotPresent when the image spec is a SHA was something I had started working on a while ago here:
operator-framework/operator-lifecycle-manager#2659

I summarized where it got stuck here:
operator-framework/operator-lifecycle-manager#2659 (comment)

so this would need to be picked back up, but also i wouldn't necessarily consider the catalog pods to be part of upgrading a cluster, in the sense that they do not come from the payload, so it's not clear if they should be covered here?

is the goal to allow the cluster to run w/o a registry entirely, or to get its upgrade payload content w/o a registry? because those are two different things.

tldr: more discussion/thought needed around the expectations for catalog images.

cc @joelanford

bparees · 2023-09-08T13:53:23Z

enhancements/update/upgrade-without-registry.md

+dump. If there is not enough space it will be reported as an error conditions in
+the `ContainerRuntimeConfig` object.
+
+If there is enough space then the machine config operator will start a


enough disk space? enough memory/compute resources?

and what will be done if there is not enough space to run this embedded registry server?

Disk space. If there is not enough disk space the error will be reported in a condition and the upgrade aborted.

hmm I dont think we're running any registry here tbh. @jhernand, did I miss anything?

We run a temporary registry server to read from the bundle and copy the images to the container storage of the node. Note that this is only needed when using a bundle; it isn't needed when only pre-loading and pinning the images.

Disk space. If there is not enough disk space the error will be reported in a condition and the upgrade aborted.

can you make sure the EP states these things?

bparees · 2023-09-08T13:58:42Z

enhancements/update/upgrade-without-registry.md

+object and the upgrade process will be aborted.
+
+If there is enough space the `bundle-extractor` will extract the contents of
+the bundle to the `/var/lib/upgrade/4.13.7-x86_64` directory.


and if there is not enough space?

The error will reported in a condition and the upgrade will be aborted.

This is another case where a multi-node cluster might end up with different results for different nodes.

The error will reported in a condition and the upgrade will be aborted.

please update the EP to state this

bparees · 2023-09-08T14:02:35Z

enhancements/update/upgrade-without-registry.md

+
+1. The upgrade infrastructure of the cluster ensures that all the images
+required for the upgrade are pinned and pre-loaded in all the nodes of the
+cluster.


at what point do these images get unpinned? presumably we don't want to keep them around forever, so is there a plan to remove the N-1 images from the previous upgrade, or something along those lines? (or at least unpin them so GC can remove them as needed)

The images will be pinned til the next upgrade. At that point they will be un-pinned, and the new ones will be pinned.

can you update the EP to state this?

presumably "pinned til the next upgrade" means "until the next upgrade is complete" (so we can be sure nothing is referencing the previous SHAs)

bparees · 2023-09-08T14:37:30Z

enhancements/update/upgrade-without-registry.md

+An external registry server is the currently supported solution for upgrades in
+disconnected environments. It doesn't require any change, but it is not
+feasible in most of the target environments due to the resource limitations
+described in the [motivation](#motivation) section of this document.


in particular I assume this is referring to this:
https://cloud.redhat.com/blog/introducing-mirror-registry-for-red-hat-openshift

you might reference it.

Yes, I am referring to that, added a link.

bparees · 2023-09-08T14:38:30Z

enhancements/update/upgrade-without-registry.md

+described in the [motivation](#motivation) section of this document.
+
+An internal registry server, running in the cluster itself, is a feasible
+alternative for clusters with multiple nodes. The registry server supported by


maybe. the orchestration of this as the registry moves between nodes during reboots, and how it pulls its own image, seem like questions that would need to be answered for a self-hosted registry solution.

In a multi-node scenario that internal registry is a highly available service. That guarantees that it will be able to pull its own image from another node. We tested that with a regular Quay instance, and works correctly.

bparees · 2023-09-08T15:09:38Z

It was touched on in the comments, but... this EP appears to address the problem of how the cluster can get new payload images w/o access to a registry.

But clusters don't exist to run payload images, so these clusters that are running w/o access to a registry..... where are they getting workload image content from? Are they running OLM operators? Where are they mirroring the olm content to and pulling it from?

And we still have some intent to move existing core payload content to be external/OLM provided (not to mention layered products and existing OLM operators that clusters tend to run). So to what extent do we really expect a customer to run clusters with literally no registry available? Or is there some benefit in having the registry but not having to mirror the payload into that registry?

I think maybe this EP should be split into two focuses:

speeding up upgrades by pre-staging images on the nodes (does not matter where the images are pulled from, whether usb stick or a registry, the point is that once we start the upgrade, the nodes don't have to pull the images at that point, so the perceived upgrade time should go down)
running clusters w/ no registry - not just where does payload content come from, but how the user/customer can provide their own images that are also needed(which the EP touches on, but more detail needed imho). Not clear this is even a valid use case or how valuable it would be? I see the RFEs, i don't see it tied to specific customer asks, so is this something that's being driven by customer requirements? But i don't think providing a solution that only addresses getting payload images w/o a registry helps customers since they will still need a registry for whatever workloads they want to run on the cluster. So again while the EP briefly references the idea that customers would be able to (and required to) include their custom images in the bundle that gets distributed to the nodes, I think more detail is needed on how this is expected to be achieved. In the case of a cluster running olm operators, you need to understand your operator is going to want to upgrade to a new version, and then go get the associated images (catalog, operator bundle, operator, operand) for the new version, for example. We have tooling for mirroring olm catalogs generally, but not for that sort of targeted copying that i'm aware of.

wking · 2023-09-09T07:09:05Z

running clusters w/ no registry - not just where does payload content come from, but how the user/customer can provide their own images that are also needed(which the EP touches on, but more detail needed imho).

More detail makes sense to me. Currently the enhancement has a non-goal entry saying:

It is not a goal to not require a registry server for other operations. For example, installing a new workload will still require a registry server.

I think that there should be some discussion in this enhancement of the RFE's:

I have a multi-node cluster that could have a complete power outage, and I want it to recover smoothly from that kind of disruption, despite the lack of access to image registries, local or remote.

Because that is the use-case that distinguishes images into "needs special handling so the cluster can restore the in-cluster registry" and "everything else can be served from the in-cluster registry" buckets. User workload would be in the latter bucket. OCP web-console would be in the latter bucket. The only images needed to launch the local in-cluster registry after a cold reboot of the whole cluster are the registry image, its operator image, etcd/Kube scheduler-stuff? I dunno. But very core things. You don't even need a cluster-version operator image for that restore. Maybe if the image registry goes the library-go static-manifest route you don't even need the etcd/Kube stuff and the only image you need is the registry image?

This patch adds an enhancement that describes an automated mechanism to perform cluster upgrades without requiring an image registry server. Related: https://issues.redhat.com/browse/RFE-4482 Related: https://issues.redhat.com/browse/OTA-1001 Related: https://issues.redhat.com/browse/OTA-997 Signed-off-by: Juan Hernandez <[email protected]>

openshift-ci · 2023-09-11T07:37:42Z

@jhernand: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

jhernand · 2023-09-11T08:11:52Z

It was touched on in the comments, but... this EP appears to address the problem of how the cluster can get new payload images w/o access to a registry.

But clusters don't exist to run payload images, so these clusters that are running w/o access to a registry..... where are they getting workload image content from? Are they running OLM operators? Where are they mirroring the olm content to and pulling it from?

This enhancement stems from the desire to upgrade clusters installed using the appliance. In that context images for the release, operators and customer workloads are available in an additional partition and served using a temporary registry server that runs as a systemd service and reads from that partition. Access to any image that isn't in that partition requires access to another registry server, and it isn't the intent of this enhancement to solve that problem.

And we still have some intent to move existing core payload content to be external/OLM provided (not to mention layered products and existing OLM operators that clusters tend to run). So to what extent do we really expect a customer to run clusters with literally no registry available? Or is there some benefit in having the registry but not having to mirror the payload into that registry?

If there is a registry server that is reliable an fast enough then none of this enhancement is needed. We are trying to address the situation where deploying that registry isn't feasible.

I think maybe this EP should be split into two focuses:

1. speeding up upgrades by pre-staging images on the nodes (does not matter where the images are pulled from, whether usb stick or a registry, the point is that once we start the upgrade, the nodes don't have to pull the images at that point, so the perceived upgrade time should go down)

2. running clusters w/ no registry - not just where does payload content come from, but how the user/customer can provide their own images that are also needed(which the EP touches on, but more detail needed imho).  Not clear this is even a valid use case or how valuable it would be?  I see the RFEs, i don't see it tied to specific customer asks, so is this something that's being driven by customer requirements?  But i don't think providing a solution that only addresses getting payload images w/o a registry helps customers since they will still need a registry for whatever workloads they want to run on the cluster.  So again while the EP briefly references the idea that customers would be able to (and required to) include their custom images in the bundle that gets distributed to the nodes, I think more detail is needed on how this is expected to be achieved.  In the case of a cluster running olm operators, you need to understand your operator is going to want to upgrade to a new version, and then go get the associated images (catalog, operator bundle, operator, operand) for the new version, for example.  We have tooling for mirroring olm catalogs generally, but not for that sort of targeted copying that i'm aware of.

Actually we already tried to separate those two intents. See the graduation criteria section, where we explain that we would like to implement the first intent first, and then the second. The first is as you describe: pre-load and pin images so that there is no need to contact a registry server during the upgrade. The second is almost as you describe, only that we are limiting ourselves to upgrades. Having both in this document seems to be a source of confusion. I will modify this enhancement so that it only covers the first part: pre-loading and pin. I will create another one for the second intent.

ingvagabund · 2023-09-11T11:17:00Z

enhancements/update/upgrade-without-registry.md

+In addition, for the cases were the cluster is completely disconnected or it
+isn't possible to use a registry server:
+
+1. An engineer uses the `oc adm upgrade create bundle ...` tool described in


Which team will own oc adm upgrade create bundle? The installer team?

I'd say the same team that owns CVO, as it is an specific for upgrades.

ingvagabund · 2023-09-11T12:41:22Z

enhancements/update/upgrade-without-registry.md

+collaborating with customer technicians.
+
+Eventually, the cluster will need to be upgraded, and then the technicians will
+need tools that make the process as simple as possible, ideally requiring no


Will the engineer be a support engineer? Or one of the developer engineers who will built the tool? Will the openshift documentation be the main source of knowledge or rather a detailed knowledge specific document living in e.g. the installer/oc repo? What resources/logs/etc. will be requested from a customer to properly debug if anything goes wrong?

sdodson

I need to re-read this as I've gotten lost between the earliest bits which clearly exclude a registry and later bits which seem to imply an image registry. I feel like we're relying on too many couplings. I think we can break things down into more distinct concerns, something like the following.

Bundle building, how do we turn a release image, metadata, and any supplemental images into a digested or signed archive
Bundle loading, how do we validate and extract that archive into container images and cluster resources (ie: configmaps w/ release payload signature, list of pinned images)
Image distribution, how do we got from container images in one directory on one host to all hosts
Image pinning, how do we ensure that those images are protected against garbage collection and unpin once no longer necessary
Cluster Updates

I'd very much like for the last step to largely remain unaltered. Perhaps no more than loading a validating webhook which says that when this scheme is in place the clusterversion is not updated unless that image is distributed and pinned on all hosts.

sdodson · 2023-09-08T19:38:40Z

enhancements/update/upgrade-without-registry.md

+Provide an automated mechanism to upgrade a cluster without requiring an image
+registry server, and without requiring OpenShift knowledge for the technicians
+performing the upgrades.


Suggested change

Provide an automated mechanism to upgrade a cluster without requiring an image

registry server, and without requiring OpenShift knowledge for the technicians

performing the upgrades.

Provide an automated mechanism to upgrade a cluster without requiring an off cluster

image registry server, and without requiring OpenShift knowledge for the technicians

performing the upgrades.

I don't want the first sentence to preclude an Image Registry on cluster being part of the solution.

sdodson · 2023-09-08T19:48:44Z

enhancements/update/upgrade-without-registry.md

+site, in particular I can't bring up an image registry server, neither outside
+of the cluster nor inside. I want to plug the USB stick provided by the


Please clarify why starting an image registry inside the cluster is unacceptable, if it's resource utilization has the proposed method been confirmed to be substantially better in that regard?

Starting a registry server inside single-node clusters is unacceptable because it introduces a chicken-egg problem if the registry server is a workload managed by the cluster.

For multi-node clusters an image registry server supported by Red Hat inside the cluster is unacceptable (that means full Quay) because it requires a large amount of resources (specially storage, requires ODF) and additional subscriptions. We have confirmed that the proposed mechanism is better in these regards: it doesn't need additional disks, it doesn't require a full Quay instance and doesn't require storage or registry workloads running after the upgrade has completed.

Quay supports a use case very similar to this: https://docs.openshift.com/container-platform/4.13/installing/disconnected_install/installing-mirroring-creating-registry.html

Yes, but that is only supported in an external machine, not inside the cluster. In the context that we are trying to cover we can't ask the customer to bring up an additional machine for that mirror registry.

But the cluster is HA and has protections against workloads becoming non HA. Would local storage operator with the registry configured to use file based storage not work? Even ephemeral storage seems like it could be made to work with an init container that mirrored content from its peer. The image to run the registry itself would need to be pinned.

The local storage operator requires an additional disk, and we can't guarantee that will be available in this context. Single-node clusters in particular often have only one disk.

What you describe is very close to our bundle approach. We use ephemeral storage: a temporary directory created from the bundle contents. We mirror content from peers: copy the bundle from peers. The registry server code is kind of pinned: it is part of CVO (or MCO, that is an implementation detail). In addition it only consumes resources during the upgrade, after that it goes away.

I'm also concerned about the proposed approach relying on low-level tools, including having to build new ones to prepare special software bundles, instead of developing an approach using higher-level tools like a regular registry and simply mirroring the required images into it (even if the mirror source is a USB drive).

Did anyone experiment with using the internal registry?

But the cluster is HA...

The local cluster is not actually highly available, because we have to deliver RFE-4482's :

I have a multi-node cluster that could have a complete power outage, and I want it to recover smoothly from that kind of disruption, despite the lack of access to image registries, local or remote.

For example, powering on the cluster after reaching the user-site. I still think we want to call that out in the use-case section of this enhancement, because it's the only reason I see that makes the in-cluster registry an insufficient solution.

sdodson · 2023-09-08T19:52:39Z

enhancements/update/upgrade-without-registry.md

+It is not a goal to not require a registry server for other operations. For
+example, installing a new workload will still require a registry server.


Why is this a non goal when it's also a goal to include user workload images in the bundle? What purpose does the workload image in the bundle serve if new workloads can be loaded from a remote registry?

Using a registry server to deploy new workloads to the cluster is a choice of the customer. We do not want to interfere with that. What we want is to give customer the option of upgrading the existing workloads (that they probably installed with the appliance, so without an off cluster registry server) without a registry server.

Sure, this should not preclude use of an external registry for new workload, but it seems very arbitrary to say that you can upgrade existing workloads via this method but not install new workloads via this method. How is that different?

Maybe the gap in understanding here is my understanding of scope of the appliance, I'll go look for that but if you have a reference that'd be great.

There are two separate mechanisms proposed here:

The capability to pin and pre-load images.

The upgrade bundle.

The first one, pin and pre-load images, can be universally useful and supportable, not just for upgrades but also for new workloads. But the second one is very specific for upgrades, I don't see how can we make it useful and supportable for installing new workloads.

@danielerez @nmagnezi can you point @sdodson to the appliance documentation?

Sure, that's the appliance user-guide: https://github.com/openshift/appliance/blob/master/docs/user-guide.md

sdodson · 2023-09-08T20:08:25Z

enhancements/update/upgrade-without-registry.md

+
+### User Stories
+
+#### Prepare an upgrade bundle


The CVO only requires that the signature for the payload be loaded, which presumably this tool would do. The CVO could check the current and proposed release payload to ensure that they're connected in addition to validating those payloads and the metadata within them are authentic. This does not currently happen, as long as the payload signature is validated the CVO will update to any digest based payload.

sdodson · 2023-09-11T14:03:15Z

enhancements/update/upgrade-without-registry.md

+As an engineer working in a partner's factory I want to be able to include in
+the upgrade bundle custom images for workloads specific to the customer.
+
+#### Explicitly allow vetted upgrade bundle


Yeah it's got to be either a digest supplied via pre-established trust like the cluster admin setting the expected digest prior to accepting the contents of the bundle or a signing mechanism with pre-trusted key. Either way it needs to apply to the entire bundle including the metadata that references any image loaded. Thta metadata needs to include release payload signatures as well.

sdodson · 2023-09-11T16:27:59Z

enhancements/update/upgrade-without-registry.md

+does so in order refresh catalog images that are specified with a tag. But it
+also does it when it pulls catalog images that are specified with a digest. That
+should be changed to use the `IfNotPresent` pull policy for catalog images that
+are specified by digest.


Regarding running a cluster versus just upgrading, if a drive fails what are the expectations upon replacement? Are these baremetal platform clusters where the host netboots and reinstalls or would physical media with RHCOS and the bundle be provided? Is it a challenge for the technician to know which bundle to provide if so? Maybe the drive is preloaded prior to delivery?

enhancements/update/upgrade-without-registry.md

sdodson · 2023-09-11T16:31:49Z

enhancements/update/upgrade-without-registry.md

+not be able to start after the upgrade. To mitagate that risk when a bundle is
+used the upgrade mechanism will check if there are pods using the `Always` pull
+policy, and will emit a warning if there are any.


What if it's opaque? A customer's binary operator creates the pod specs.

In that case the potential problem will go undetected: it isn't 100% exact, but I think it helps anyhow.

It seems like we would want the warning (or even an error) when building the bundle, rather than when it is installed.

sdodson · 2023-09-11T17:11:58Z

enhancements/update/upgrade-without-registry.md

+Once the image is pinned and pre-loaded the cluster version operator will
+inspect it and find out the references to the payload images. It will then also
+pin and pre-load those images, updating the `ContainerRuntimeConfig` object:


Why the hand off to CVO here?

It seems to me that this inspection of the release image is something that belongs into CVO more than MCO. I assume CVO already does something similar today to verify signatures, for example.

I think the CVO should just act on the image not care about how the image got there. Certainly if the CVO is explicitly pulling the image we should fix that.

CVO does not pull anything, because we know that image access is fiddly. We get access to incoming release-image manifests by launching a version-* pod with a shared volume, so CRI-O can go fetch the image for us, the version-* containers copy the manifests into the shared volume, and the outgoing CVO grabs the manifests from the shared volume to start rolling out the update.

sdodson · 2023-09-11T17:35:32Z

enhancements/update/upgrade-without-registry.md

+It is desirable to have another test that scans the OCP components looking for
+use of the `Always` pull policy. This should probably run for each pull request
+of each OCP component, and prevent merging if it detects that the offending pull
+policy is used. We should consider adding admission in CI for this.


I like the admission webhook suggested earlier, if that were deployed during an upgrade test and prior to OLM operator tests, would that catch all uses?

dhellmann · 2023-09-12T18:00:11Z

enhancements/update/upgrade-without-registry.md

+
+- The cluster is not connected to the Internet, or the bandwidth is very limited.
+- It isn't possible to bring up additional machines, even temporarily.
+- The resources of the cluster are limited, in particular in terms of CPU and memory.


Why would the registry need to be running all the time if it's only needed during the upgrade?

dhellmann · 2023-09-12T18:04:00Z

enhancements/update/upgrade-without-registry.md

+site, in particular I can't bring up an image registry server, neither outside
+of the cluster nor inside. I want to plug the USB stick provided by the


I'm also concerned about the proposed approach relying on low-level tools, including having to build new ones to prepare special software bundles, instead of developing an approach using higher-level tools like a regular registry and simply mirroring the required images into it (even if the mirror source is a USB drive).

Did anyone experiment with using the internal registry?

dhellmann · 2023-09-12T18:06:31Z

enhancements/update/upgrade-without-registry.md

+In addition, for the cases were the cluster is completely disconnected or it
+isn't possible to use a registry server:
+
+1. An engineer uses the `oc adm upgrade create bundle ...` tool described in


Could you make sure all of these details are written down in the alternatives section, please?

dhellmann · 2023-09-12T18:14:18Z

enhancements/update/upgrade-without-registry.md

+not be able to start after the upgrade. To mitagate that risk when a bundle is
+used the upgrade mechanism will check if there are pods using the `Always` pull
+policy, and will emit a warning if there are any.


It seems like we would want the warning (or even an error) when building the bundle, rather than when it is installed.

dhellmann · 2023-09-12T18:15:03Z

enhancements/update/upgrade-without-registry.md

+a fallback alternative. These need to be changed so that they don't do it or so
+that they have a fallback mechanism when the registry server isn't available.
+
+For example, in OpenShift 4.1.13 the machine config operator runs the


Is 4.1.13 correct here? If so, is there a more recent version with this problem?

dhellmann · 2023-09-12T18:22:58Z

enhancements/update/upgrade-without-registry.md

+```
+
+We explored several alternatives for the format of the bundle, and using a dump
+of a registry server to be the best one.


Having all of these alternatives inline disrupts the flow of trying to understand the actual proposal. Could you move them down to the "Alternatives" section?

dhellmann · 2023-09-12T18:25:00Z

enhancements/update/upgrade-without-registry.md

+
+The cluster version operator will be monitoring device events in all the nodes
+of the cluster in order to detect when such an USB stick (or any other kind of
+media containing an upgrade bundle) has been plugged. To control that a new


This paragraph should say "The bundle-monitor will be monitoring device events..." instead of the CVO.

dhellmann · 2023-09-12T18:28:50Z

enhancements/update/upgrade-without-registry.md

+object and the upgrade process will be aborted.
+
+If there is enough space the `bundle-extractor` will extract the contents of
+the bundle to the `/var/lib/upgrade/4.13.7-x86_64` directory.


This is another case where a multi-node cluster might end up with different results for different nodes.

dhellmann · 2023-09-12T18:30:21Z

enhancements/update/upgrade-without-registry.md

+                ...
+              ]
+            }
+        node1:


This is going to produce a lot of data in a large cluster, and a lot of potential collisions as nodes try to write their status to the same resource.

dhellmann · 2023-09-12T19:24:43Z

enhancements/update/upgrade-without-registry.md

+```bash
+$ oc adm upgrade create bundle \
+--arch=x86_64 \
+--version=4.13.7 \


Limiting this to a version and arch as input is going to make it challenging to create bundles with hot fixes. It might be better to use an explicit image reference here, just like with other admin commands that operate on images.

sinnykumari

It seems to be big change in MCO workflow. Don't see any timeframe mentioned in the EP, wondering what time-frame we are looking for these changes and who is going to implement and maintain them?

sinnykumari · 2023-09-12T21:51:36Z

enhancements/update/upgrade-without-registry.md

+1. No OpenShift component should explicitly try to contact the registry server
+without a fallback alternative.
+
+1. The machine config operator needs to support image pinning, pre-loading and


Do we perform image pinning only for OCP specific images or it also includes customer workload images?

It depends. But in high-level, the right approach would be:

Pin OCP images - for cases where the cluster is used to run customer workloads only

Pin OCP images and partner images - for cases where partners have an appliance (basically OCP + more infra workloads they need to run) - in this case, you'll want to have both OCP and partner images pinned. While the customer workloads aren't.

An example, Dell ships an appliance that includes OCP + powerflex storage. You'd need both OCP images and powerflex related images pinned. This setup is used by customers to run their workloads. They can host these images in a registry (in cluster) and pull those from there. But the images required for OCP + Dell software + the in-cluster registry, should be pinned so that everything is running (for reboot, upgrade, etc.).

ah got it, thanks for the explanation.

LalatenduMohanty · 2023-09-14T13:45:23Z

enhancements/update/upgrade-without-registry.md

+- The cluster is not connected to the Internet, or the bandwidth is very limited.
+- It isn't possible to bring up additional machines, even temporarily.
+- The resources of the cluster are limited, in particular in terms of CPU and memory.
+- The technicians responsible for performing the upgrade have little or no OpenShift knowledge.


These motivations does not seem right because without the registry customer can not run the workload. I think we want to separate the lifecycle events and infrastructure for OCP from the infrastructure required for workloads.

LalatenduMohanty · 2023-09-14T14:50:08Z

enhancements/update/upgrade-without-registry.md

+to be started manually later or it if should wait till the user explicitly sets
+it to `Immediate`. The default value for that will be `Immediate`. This is
+intended for situations where the user wishes to prepare everything for the
+upgrade in advance, but wants to perform the upgrade later.


I do not see a reason around why the update schedule needs to be part of this enhancement. We are planning introduce too many changes and some of the changes directly not related to the goal. The update schedule is a generic feature which should be a separate enhancement.

cgwalters · 2023-09-14T15:22:51Z

enhancements/update/upgrade-without-registry.md

+install) as well as workload images and buffer.
+
+The changes to pin the images will be done in a
+`/etc/crio/crio.conf.d/pin-upgrade.conf` file, something like this:


One thing that could actually simplify things here is to make this in /run/crio/crio.conf - i.e. it's an explicitly transient state before the next boot - and the config automatically goes away after the next boot (when we've upgraded).

I think we should flesh out some more "transient MachineConfig" style flows that write to /run, don't reboot, and don't have config drift monitoring etc.

cgwalters · 2023-09-14T15:34:48Z

It seems to be big change in MCO workflow. Don't see any timeframe mentioned in the EP, wondering what time-frame we are looking for these changes and who is going to implement and maintain them?

Well...this would have a larger blast radius, but I think a tension we're seeing here is we don't have a good place to do stuff like this because IMO it fits squarely in between the CVO and MCO.

But...we could create a new component (github.com/openshift-upgrade-orchestrator) or something that would sit below the CVO but own things like rolling out a temporary registry if applicable, etc.

(Or, failing that, it's literally a new controller in github.com/openshift/machine-config/operator/pkg/controller/upgrade-infra-orchestrator or something that could have a distinct OWNERS perhaps...)

This patch adds a new `pinnedImages` field to the `ContainerRuntime` type. Related: openshift/enhancements#1432 Related: https://issues.redhat.com/browse/RFE-4482 Signed-off-by: Juan Hernandez <[email protected]>

This patch adds an enhancement that describes a mechanism to pin and pre-load container images. Related: https://issues.redhat.com/browse/RFE-4482 Related: https://issues.redhat.com/browse/OTA-1001 Related: https://issues.redhat.com/browse/OTA-997 Related: openshift/machine-config-operator#3839 Related: openshift#1432 Signed-off-by: Juan Hernandez <[email protected]>

jhernand · 2023-09-21T15:24:15Z

Dear reviewers, as agreed last week in the architecture meeting I am splitting this enhancement into three smaller enhancements:

Pin and pre-load images.
Don't require a registry server during upgrades.
Add support for offline transport of the required images (a.k.a. upgrade bundles).

The two first ones are ready for you to take a look:

The third one is less important and I will create it later.

jhernand · 2023-09-25T07:08:29Z

I am closing this pull request. The review will continue in #1481 and #1483.

This patch adds a new `PinnedImageSet` object that describes how to pre-load and pin container images. Related: openshift/enhancements#1432 Related: https://issues.redhat.com/browse/RFE-4482 Signed-off-by: Juan Hernandez <[email protected]>

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 29, 2023

oourfali reviewed Jun 29, 2023

View reviewed changes

enhancements/update/upgrade-without-registry.md Outdated Show resolved Hide resolved

enhancements/update/upgrade-without-registry.md Outdated Show resolved Hide resolved

jhernand force-pushed the upgrade_without_registry_enhancement branch from 3416e01 to e34b7bb Compare June 30, 2023 13:46

mrunalp reviewed Jul 12, 2023

View reviewed changes

enhancements/update/upgrade-without-registry.md Outdated Show resolved Hide resolved

mrunalp reviewed Jul 12, 2023

View reviewed changes

enhancements/update/upgrade-without-registry.md Outdated Show resolved Hide resolved

mrunalp reviewed Jul 12, 2023

View reviewed changes

enhancements/update/upgrade-without-registry.md Outdated Show resolved Hide resolved

mrunalp reviewed Jul 12, 2023

View reviewed changes

enhancements/update/upgrade-without-registry.md Outdated Show resolved Hide resolved

mrunalp reviewed Jul 13, 2023

View reviewed changes

enhancements/update/upgrade-without-registry.md Outdated Show resolved Hide resolved

mrunalp reviewed Jul 13, 2023

View reviewed changes

enhancements/update/upgrade-without-registry.md Outdated Show resolved Hide resolved

jhernand force-pushed the upgrade_without_registry_enhancement branch from e34b7bb to 38c3594 Compare July 17, 2023 08:00

jhernand force-pushed the upgrade_without_registry_enhancement branch from 38c3594 to fa8ba50 Compare July 26, 2023 06:03

jhernand marked this pull request as ready for review July 26, 2023 06:04

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 26, 2023

openshift-ci bot requested review from LalatenduMohanty and petr-muller July 26, 2023 06:04

jhernand force-pushed the upgrade_without_registry_enhancement branch 6 times, most recently from 1bee1b0 to e159b94 Compare July 27, 2023 15:30

oourfali reviewed Jul 31, 2023

View reviewed changes

avishayt reviewed Jul 31, 2023

View reviewed changes

bparees reviewed Sep 8, 2023

View reviewed changes

jhernand force-pushed the upgrade_without_registry_enhancement branch from fb447d7 to 58ef5d9 Compare September 11, 2023 07:30

ingvagabund reviewed Sep 11, 2023

View reviewed changes

sdodson reviewed Sep 11, 2023

View reviewed changes

dhellmann reviewed Sep 12, 2023

View reviewed changes

sinnykumari reviewed Sep 12, 2023

View reviewed changes

LalatenduMohanty reviewed Sep 14, 2023

View reviewed changes

LalatenduMohanty suggested changes Sep 14, 2023

View reviewed changes

cgwalters reviewed Sep 14, 2023

View reviewed changes

jhernand mentioned this pull request Sep 21, 2023

Pin and pre-load images #1481

Merged

jhernand closed this Sep 25, 2023

jhernand mentioned this pull request Sep 25, 2023

Don't require registry during reboot and upgrade #1483

Merged

jhernand mentioned this pull request Oct 6, 2023

Add PinnedImageSet openshift/api#1609

Closed

		will be required to the CRI-O configuration during the upgrade.

		1. The cluster version operator needs to orchestrate the upgrade process.

		site, in particular I can't bring up an image registry server, neither outside
		of the cluster nor inside. I want to plug the USB stick provided by the

		It is not a goal to not require a registry server for other operations. For
		example, installing a new workload will still require a registry server.

Upgrade without registry enhancement #1432

Upgrade without registry enhancement #1432

Conversation

jhernand commented Jun 29, 2023 • edited Loading

openshift-ci bot commented Jun 29, 2023

openshift-ci bot commented Jun 29, 2023

oourfali commented Jun 29, 2023

oourfali commented Jul 2, 2023

mrunalp commented Jul 12, 2023

jhernand commented Jul 17, 2023

oourfali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oourfali commented Jul 31, 2023

cgwalters commented Sep 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bparees commented Sep 8, 2023

wking commented Sep 9, 2023 • edited Loading

openshift-ci bot commented Sep 11, 2023

jhernand commented Sep 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdodson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sinnykumari left a comment

Choose a reason for hiding this comment

jhernand commented Jun 29, 2023 •

edited

Loading

wking commented Sep 9, 2023 •

edited

Loading