Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade without registry enhancement #1432

Conversation

jhernand
Copy link
Contributor

@jhernand jhernand commented Jun 29, 2023

OTA-1001: Upgrade without registry enhancement

This patch adds an enhancement that describes an automated mechanism to perform cluster upgrades without requiring an image registry server.

Related: https://issues.redhat.com/browse/RFE-4482
Related: https://issues.redhat.com/browse/OTA-1001
Related: https://issues.redhat.com/browse/OTA-997
Related: openshift/machine-config-operator#3839
Related: openshift/api#1548

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 29, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 29, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 29, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lalatendumohanty for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@oourfali
Copy link

Looks good overall. Added some comments.

@jhernand jhernand force-pushed the upgrade_without_registry_enhancement branch from 3416e01 to e34b7bb Compare June 30, 2023 13:46
@oourfali
Copy link

oourfali commented Jul 2, 2023

@mrunalp Hi! can you please review? Who else should we tag?

@mrunalp
Copy link
Member

mrunalp commented Jul 12, 2023

@sdodson ptal

@jhernand jhernand force-pushed the upgrade_without_registry_enhancement branch from e34b7bb to 38c3594 Compare July 17, 2023 08:00
@jhernand
Copy link
Contributor Author

Thanks for your review @mrunalp. I tried to incorporate it, please take another look.

@jhernand jhernand force-pushed the upgrade_without_registry_enhancement branch from 38c3594 to fa8ba50 Compare July 26, 2023 06:03
@jhernand jhernand marked this pull request as ready for review July 26, 2023 06:04
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 26, 2023
@jhernand jhernand force-pushed the upgrade_without_registry_enhancement branch 6 times, most recently from 1bee1b0 to e159b94 Compare July 27, 2023 15:30
Copy link

@oourfali oourfali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks very good!
Some comments.

enhancements/update/upgrade-without-registry.md Outdated Show resolved Hide resolved
will be required to the CRI-O configuration during the upgrade.

1. The cluster version operator needs to orchestrate the upgrade process.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to mention the need not to add/remove nodes while upgrading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the "CVO will need to orchestrate the upgrade" we say this:

To do so it will first disable auto-scaling to ensure that no new nodes
are added to the cluster while this process is progress

Is that enough?

@oourfali
Copy link

@LalatenduMohanty We'll appreciate a review (feel free to pull others as well.... but not via always pull policy ;-) ).

enhancements/update/upgrade-without-registry.md Outdated Show resolved Hide resolved
enhancements/update/upgrade-without-registry.md Outdated Show resolved Hide resolved
enhancements/update/upgrade-without-registry.md Outdated Show resolved Hide resolved
enhancements/update/upgrade-without-registry.md Outdated Show resolved Hide resolved
enhancements/update/upgrade-without-registry.md Outdated Show resolved Hide resolved
@cgwalters
Copy link
Member

There is a cnv test e.g. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-azure-deploy-cnv/1699895359578312704 - it's just informing today, but at least provides the baseline capability for joint teams to monitor its success.

does so in order refresh catalog images that are specified with a tag. But it
also does it when it pulls catalog images that are specified with a digest. That
should be changed to use the `IfNotPresent` pull policy for catalog images that
are specified by digest.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the catalogs to use PullIfNotPresent when the image spec is a SHA was something I had started working on a while ago here:
operator-framework/operator-lifecycle-manager#2659

I summarized where it got stuck here:
operator-framework/operator-lifecycle-manager#2659 (comment)

so this would need to be picked back up, but also i wouldn't necessarily consider the catalog pods to be part of upgrading a cluster, in the sense that they do not come from the payload, so it's not clear if they should be covered here?

is the goal to allow the cluster to run w/o a registry entirely, or to get its upgrade payload content w/o a registry? because those are two different things.

tldr: more discussion/thought needed around the expectations for catalog images.

cc @joelanford

dump. If there is not enough space it will be reported as an error conditions in
the `ContainerRuntimeConfig` object.

If there is enough space then the machine config operator will start a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enough disk space? enough memory/compute resources?

and what will be done if there is not enough space to run this embedded registry server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disk space. If there is not enough disk space the error will be reported in a condition and the upgrade aborted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I dont think we're running any registry here tbh. @jhernand, did I miss anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We run a temporary registry server to read from the bundle and copy the images to the container storage of the node. Note that this is only needed when using a bundle; it isn't needed when only pre-loading and pinning the images.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disk space. If there is not enough disk space the error will be reported in a condition and the upgrade aborted.

can you make sure the EP states these things?

object and the upgrade process will be aborted.

If there is enough space the `bundle-extractor` will extract the contents of
the bundle to the `/var/lib/upgrade/4.13.7-x86_64` directory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if there is not enough space?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error will reported in a condition and the upgrade will be aborted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another case where a multi-node cluster might end up with different results for different nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error will reported in a condition and the upgrade will be aborted.

please update the EP to state this


1. The upgrade infrastructure of the cluster ensures that all the images
required for the upgrade are pinned and pre-loaded in all the nodes of the
cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at what point do these images get unpinned? presumably we don't want to keep them around forever, so is there a plan to remove the N-1 images from the previous upgrade, or something along those lines? (or at least unpin them so GC can remove them as needed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The images will be pinned til the next upgrade. At that point they will be un-pinned, and the new ones will be pinned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update the EP to state this?

presumably "pinned til the next upgrade" means "until the next upgrade is complete" (so we can be sure nothing is referencing the previous SHAs)

An external registry server is the currently supported solution for upgrades in
disconnected environments. It doesn't require any change, but it is not
feasible in most of the target environments due to the resource limitations
described in the [motivation](#motivation) section of this document.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in particular I assume this is referring to this:
https://cloud.redhat.com/blog/introducing-mirror-registry-for-red-hat-openshift

you might reference it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am referring to that, added a link.

described in the [motivation](#motivation) section of this document.

An internal registry server, running in the cluster itself, is a feasible
alternative for clusters with multiple nodes. The registry server supported by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe. the orchestration of this as the registry moves between nodes during reboots, and how it pulls its own image, seem like questions that would need to be answered for a self-hosted registry solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a multi-node scenario that internal registry is a highly available service. That guarantees that it will be able to pull its own image from another node. We tested that with a regular Quay instance, and works correctly.

@bparees
Copy link
Contributor

bparees commented Sep 8, 2023

It was touched on in the comments, but... this EP appears to address the problem of how the cluster can get new payload images w/o access to a registry.

But clusters don't exist to run payload images, so these clusters that are running w/o access to a registry..... where are they getting workload image content from? Are they running OLM operators? Where are they mirroring the olm content to and pulling it from?

And we still have some intent to move existing core payload content to be external/OLM provided (not to mention layered products and existing OLM operators that clusters tend to run). So to what extent do we really expect a customer to run clusters with literally no registry available? Or is there some benefit in having the registry but not having to mirror the payload into that registry?

I think maybe this EP should be split into two focuses:

  1. speeding up upgrades by pre-staging images on the nodes (does not matter where the images are pulled from, whether usb stick or a registry, the point is that once we start the upgrade, the nodes don't have to pull the images at that point, so the perceived upgrade time should go down)

  2. running clusters w/ no registry - not just where does payload content come from, but how the user/customer can provide their own images that are also needed(which the EP touches on, but more detail needed imho). Not clear this is even a valid use case or how valuable it would be? I see the RFEs, i don't see it tied to specific customer asks, so is this something that's being driven by customer requirements? But i don't think providing a solution that only addresses getting payload images w/o a registry helps customers since they will still need a registry for whatever workloads they want to run on the cluster. So again while the EP briefly references the idea that customers would be able to (and required to) include their custom images in the bundle that gets distributed to the nodes, I think more detail is needed on how this is expected to be achieved. In the case of a cluster running olm operators, you need to understand your operator is going to want to upgrade to a new version, and then go get the associated images (catalog, operator bundle, operator, operand) for the new version, for example. We have tooling for mirroring olm catalogs generally, but not for that sort of targeted copying that i'm aware of.

@wking
Copy link
Member

wking commented Sep 9, 2023

running clusters w/ no registry - not just where does payload content come from, but how the user/customer can provide their own images that are also needed(which the EP touches on, but more detail needed imho).

More detail makes sense to me. Currently the enhancement has a non-goal entry saying:

It is not a goal to not require a registry server for other operations. For example, installing a new workload will still require a registry server.

I think that there should be some discussion in this enhancement of the RFE's:

I have a multi-node cluster that could have a complete power outage, and I want it to recover smoothly from that kind of disruption, despite the lack of access to image registries, local or remote.

Because that is the use-case that distinguishes images into "needs special handling so the cluster can restore the in-cluster registry" and "everything else can be served from the in-cluster registry" buckets. User workload would be in the latter bucket. OCP web-console would be in the latter bucket. The only images needed to launch the local in-cluster registry after a cold reboot of the whole cluster are the registry image, its operator image, etcd/Kube scheduler-stuff? I dunno. But very core things. You don't even need a cluster-version operator image for that restore. Maybe if the image registry goes the library-go static-manifest route you don't even need the etcd/Kube stuff and the only image you need is the registry image?

This patch adds an enhancement that describes an automated mechanism to
perform cluster upgrades without requiring an image registry server.

Related: https://issues.redhat.com/browse/RFE-4482
Related: https://issues.redhat.com/browse/OTA-1001
Related: https://issues.redhat.com/browse/OTA-997
Signed-off-by: Juan Hernandez <[email protected]>
@jhernand jhernand force-pushed the upgrade_without_registry_enhancement branch from fb447d7 to 58ef5d9 Compare September 11, 2023 07:30
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 11, 2023

@jhernand: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jhernand
Copy link
Contributor Author

It was touched on in the comments, but... this EP appears to address the problem of how the cluster can get new payload images w/o access to a registry.

But clusters don't exist to run payload images, so these clusters that are running w/o access to a registry..... where are they getting workload image content from? Are they running OLM operators? Where are they mirroring the olm content to and pulling it from?

This enhancement stems from the desire to upgrade clusters installed using the appliance. In that context images for the release, operators and customer workloads are available in an additional partition and served using a temporary registry server that runs as a systemd service and reads from that partition. Access to any image that isn't in that partition requires access to another registry server, and it isn't the intent of this enhancement to solve that problem.

And we still have some intent to move existing core payload content to be external/OLM provided (not to mention layered products and existing OLM operators that clusters tend to run). So to what extent do we really expect a customer to run clusters with literally no registry available? Or is there some benefit in having the registry but not having to mirror the payload into that registry?

If there is a registry server that is reliable an fast enough then none of this enhancement is needed. We are trying to address the situation where deploying that registry isn't feasible.

I think maybe this EP should be split into two focuses:

1. speeding up upgrades by pre-staging images on the nodes (does not matter where the images are pulled from, whether usb stick or a registry, the point is that once we start the upgrade, the nodes don't have to pull the images at that point, so the perceived upgrade time should go down)

2. running clusters w/ no registry - not just where does payload content come from, but how the user/customer can provide their own images that are also needed(which the EP touches on, but more detail needed imho).  Not clear this is even a valid use case or how valuable it would be?  I see the RFEs, i don't see it tied to specific customer asks, so is this something that's being driven by customer requirements?  But i don't think providing a solution that only addresses getting payload images w/o a registry helps customers since they will still need a registry for whatever workloads they want to run on the cluster.  So again while the EP briefly references the idea that customers would be able to (and required to) include their custom images in the bundle that gets distributed to the nodes, I think more detail is needed on how this is expected to be achieved.  In the case of a cluster running olm operators, you need to understand your operator is going to want to upgrade to a new version, and then go get the associated images (catalog, operator bundle, operator, operand) for the new version, for example.  We have tooling for mirroring olm catalogs generally, but not for that sort of targeted copying that i'm aware of.

Actually we already tried to separate those two intents. See the graduation criteria section, where we explain that we would like to implement the first intent first, and then the second. The first is as you describe: pre-load and pin images so that there is no need to contact a registry server during the upgrade. The second is almost as you describe, only that we are limiting ourselves to upgrades. Having both in this document seems to be a source of confusion. I will modify this enhancement so that it only covers the first part: pre-loading and pin. I will create another one for the second intent.

In addition, for the cases were the cluster is completely disconnected or it
isn't possible to use a registry server:

1. An engineer uses the `oc adm upgrade create bundle ...` tool described in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which team will own oc adm upgrade create bundle? The installer team?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the same team that owns CVO, as it is an specific for upgrades.

collaborating with customer technicians.

Eventually, the cluster will need to be upgraded, and then the technicians will
need tools that make the process as simple as possible, ideally requiring no
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the engineer be a support engineer? Or one of the developer engineers who will built the tool? Will the openshift documentation be the main source of knowledge or rather a detailed knowledge specific document living in e.g. the installer/oc repo? What resources/logs/etc. will be requested from a customer to properly debug if anything goes wrong?

Copy link
Member

@sdodson sdodson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to re-read this as I've gotten lost between the earliest bits which clearly exclude a registry and later bits which seem to imply an image registry. I feel like we're relying on too many couplings. I think we can break things down into more distinct concerns, something like the following.

  • Bundle building, how do we turn a release image, metadata, and any supplemental images into a digested or signed archive
  • Bundle loading, how do we validate and extract that archive into container images and cluster resources (ie: configmaps w/ release payload signature, list of pinned images)
  • Image distribution, how do we got from container images in one directory on one host to all hosts
  • Image pinning, how do we ensure that those images are protected against garbage collection and unpin once no longer necessary
  • Cluster Updates

I'd very much like for the last step to largely remain unaltered. Perhaps no more than loading a validating webhook which says that when this scheme is in place the clusterversion is not updated unless that image is distributed and pinned on all hosts.

Comment on lines +37 to +39
Provide an automated mechanism to upgrade a cluster without requiring an image
registry server, and without requiring OpenShift knowledge for the technicians
performing the upgrades.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Provide an automated mechanism to upgrade a cluster without requiring an image
registry server, and without requiring OpenShift knowledge for the technicians
performing the upgrades.
Provide an automated mechanism to upgrade a cluster without requiring an off cluster
image registry server, and without requiring OpenShift knowledge for the technicians
performing the upgrades.

I don't want the first sentence to preclude an Image Registry on cluster being part of the solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +101 to +102
site, in particular I can't bring up an image registry server, neither outside
of the cluster nor inside. I want to plug the USB stick provided by the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify why starting an image registry inside the cluster is unacceptable, if it's resource utilization has the proposed method been confirmed to be substantially better in that regard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting a registry server inside single-node clusters is unacceptable because it introduces a chicken-egg problem if the registry server is a workload managed by the cluster.

For multi-node clusters an image registry server supported by Red Hat inside the cluster is unacceptable (that means full Quay) because it requires a large amount of resources (specially storage, requires ODF) and additional subscriptions. We have confirmed that the proposed mechanism is better in these regards: it doesn't need additional disks, it doesn't require a full Quay instance and doesn't require storage or registry workloads running after the upgrade has completed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that is only supported in an external machine, not inside the cluster. In the context that we are trying to cover we can't ask the customer to bring up an additional machine for that mirror registry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the cluster is HA and has protections against workloads becoming non HA. Would local storage operator with the registry configured to use file based storage not work? Even ephemeral storage seems like it could be made to work with an init container that mirrored content from its peer. The image to run the registry itself would need to be pinned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local storage operator requires an additional disk, and we can't guarantee that will be available in this context. Single-node clusters in particular often have only one disk.

What you describe is very close to our bundle approach. We use ephemeral storage: a temporary directory created from the bundle contents. We mirror content from peers: copy the bundle from peers. The registry server code is kind of pinned: it is part of CVO (or MCO, that is an implementation detail). In addition it only consumes resources during the upgrade, after that it goes away.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also concerned about the proposed approach relying on low-level tools, including having to build new ones to prepare special software bundles, instead of developing an approach using higher-level tools like a regular registry and simply mirroring the required images into it (even if the mirror source is a USB drive).

Did anyone experiment with using the internal registry?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the cluster is HA...

The local cluster is not actually highly available, because we have to deliver RFE-4482's :

I have a multi-node cluster that could have a complete power outage, and I want it to recover smoothly from that kind of disruption, despite the lack of access to image registries, local or remote.

For example, powering on the cluster after reaching the user-site. I still think we want to call that out in the use-case section of this enhancement, because it's the only reason I see that makes the in-cluster registry an insufficient solution.

Comment on lines +114 to +115
It is not a goal to not require a registry server for other operations. For
example, installing a new workload will still require a registry server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a non goal when it's also a goal to include user workload images in the bundle? What purpose does the workload image in the bundle serve if new workloads can be loaded from a remote registry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a registry server to deploy new workloads to the cluster is a choice of the customer. We do not want to interfere with that. What we want is to give customer the option of upgrading the existing workloads (that they probably installed with the appliance, so without an off cluster registry server) without a registry server.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this should not preclude use of an external registry for new workload, but it seems very arbitrary to say that you can upgrade existing workloads via this method but not install new workloads via this method. How is that different?

Maybe the gap in understanding here is my understanding of scope of the appliance, I'll go look for that but if you have a reference that'd be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two separate mechanisms proposed here:

  1. The capability to pin and pre-load images.
  2. The upgrade bundle.

The first one, pin and pre-load images, can be universally useful and supportable, not just for upgrades but also for new workloads. But the second one is very specific for upgrades, I don't see how can we make it useful and supportable for installing new workloads.

@danielerez @nmagnezi can you point @sdodson to the appliance documentation?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### User Stories

#### Prepare an upgrade bundle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CVO only requires that the signature for the payload be loaded, which presumably this tool would do. The CVO could check the current and proposed release payload to ensure that they're connected in addition to validating those payloads and the metadata within them are authentic. This does not currently happen, as long as the payload signature is validated the CVO will update to any digest based payload.

As an engineer working in a partner's factory I want to be able to include in
the upgrade bundle custom images for workloads specific to the customer.

#### Explicitly allow vetted upgrade bundle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's got to be either a digest supplied via pre-established trust like the cluster admin setting the expected digest prior to accepting the contents of the bundle or a signing mechanism with pre-trusted key. Either way it needs to apply to the entire bundle including the metadata that references any image loaded. Thta metadata needs to include release payload signatures as well.

does so in order refresh catalog images that are specified with a tag. But it
also does it when it pulls catalog images that are specified with a digest. That
should be changed to use the `IfNotPresent` pull policy for catalog images that
are specified by digest.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding running a cluster versus just upgrading, if a drive fails what are the expectations upon replacement? Are these baremetal platform clusters where the host netboots and reinstalls or would physical media with RHCOS and the bundle be provided? Is it a challenge for the technician to know which bundle to provide if so? Maybe the drive is preloaded prior to delivery?

Comment on lines +227 to +229
not be able to start after the upgrade. To mitagate that risk when a bundle is
used the upgrade mechanism will check if there are pods using the `Always` pull
policy, and will emit a warning if there are any.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it's opaque? A customer's binary operator creates the pod specs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case the potential problem will go undetected: it isn't 100% exact, but I think it helps anyhow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we would want the warning (or even an error) when building the bundle, rather than when it is installed.

Comment on lines +408 to +410
Once the image is pinned and pre-loaded the cluster version operator will
inspect it and find out the references to the payload images. It will then also
pin and pre-load those images, updating the `ContainerRuntimeConfig` object:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the hand off to CVO here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this inspection of the release image is something that belongs into CVO more than MCO. I assume CVO already does something similar today to verify signatures, for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the CVO should just act on the image not care about how the image got there. Certainly if the CVO is explicitly pulling the image we should fix that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CVO does not pull anything, because we know that image access is fiddly. We get access to incoming release-image manifests by launching a version-* pod with a shared volume, so CRI-O can go fetch the image for us, the version-* containers copy the manifests into the shared volume, and the outgoing CVO grabs the manifests from the shared volume to start rolling out the update.

Comment on lines +902 to +799
It is desirable to have another test that scans the OCP components looking for
use of the `Always` pull policy. This should probably run for each pull request
of each OCP component, and prevent merging if it detects that the offending pull
policy is used. We should consider adding admission in CI for this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the admission webhook suggested earlier, if that were deployed during an upgrade test and prior to OLM operator tests, would that catch all uses?


- The cluster is not connected to the Internet, or the bandwidth is very limited.
- It isn't possible to bring up additional machines, even temporarily.
- The resources of the cluster are limited, in particular in terms of CPU and memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the registry need to be running all the time if it's only needed during the upgrade?

Comment on lines +101 to +102
site, in particular I can't bring up an image registry server, neither outside
of the cluster nor inside. I want to plug the USB stick provided by the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also concerned about the proposed approach relying on low-level tools, including having to build new ones to prepare special software bundles, instead of developing an approach using higher-level tools like a regular registry and simply mirroring the required images into it (even if the mirror source is a USB drive).

Did anyone experiment with using the internal registry?

In addition, for the cases were the cluster is completely disconnected or it
isn't possible to use a registry server:

1. An engineer uses the `oc adm upgrade create bundle ...` tool described in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make sure all of these details are written down in the alternatives section, please?

Comment on lines +227 to +229
not be able to start after the upgrade. To mitagate that risk when a bundle is
used the upgrade mechanism will check if there are pods using the `Always` pull
policy, and will emit a warning if there are any.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we would want the warning (or even an error) when building the bundle, rather than when it is installed.

a fallback alternative. These need to be changed so that they don't do it or so
that they have a fallback mechanism when the registry server isn't available.

For example, in OpenShift 4.1.13 the machine config operator runs the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 4.1.13 correct here? If so, is there a more recent version with this problem?

```

We explored several alternatives for the format of the bundle, and using a dump
of a registry server to be the best one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having all of these alternatives inline disrupts the flow of trying to understand the actual proposal. Could you move them down to the "Alternatives" section?


The cluster version operator will be monitoring device events in all the nodes
of the cluster in order to detect when such an USB stick (or any other kind of
media containing an upgrade bundle) has been plugged. To control that a new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph should say "The bundle-monitor will be monitoring device events..." instead of the CVO.

object and the upgrade process will be aborted.

If there is enough space the `bundle-extractor` will extract the contents of
the bundle to the `/var/lib/upgrade/4.13.7-x86_64` directory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another case where a multi-node cluster might end up with different results for different nodes.

...
]
}
node1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to produce a lot of data in a large cluster, and a lot of potential collisions as nodes try to write their status to the same resource.

```bash
$ oc adm upgrade create bundle \
--arch=x86_64 \
--version=4.13.7 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limiting this to a version and arch as input is going to make it challenging to create bundles with hot fixes. It might be better to use an explicit image reference here, just like with other admin commands that operate on images.

Copy link
Contributor

@sinnykumari sinnykumari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be big change in MCO workflow. Don't see any timeframe mentioned in the EP, wondering what time-frame we are looking for these changes and who is going to implement and maintain them?

1. No OpenShift component should explicitly try to contact the registry server
without a fallback alternative.

1. The machine config operator needs to support image pinning, pre-loading and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we perform image pinning only for OCP specific images or it also includes customer workload images?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends. But in high-level, the right approach would be:

  1. Pin OCP images - for cases where the cluster is used to run customer workloads only
  2. Pin OCP images and partner images - for cases where partners have an appliance (basically OCP + more infra workloads they need to run) - in this case, you'll want to have both OCP and partner images pinned. While the customer workloads aren't.

An example, Dell ships an appliance that includes OCP + powerflex storage. You'd need both OCP images and powerflex related images pinned. This setup is used by customers to run their workloads. They can host these images in a registry (in cluster) and pull those from there. But the images required for OCP + Dell software + the in-cluster registry, should be pinned so that everything is running (for reboot, upgrade, etc.).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah got it, thanks for the explanation.

- The cluster is not connected to the Internet, or the bandwidth is very limited.
- It isn't possible to bring up additional machines, even temporarily.
- The resources of the cluster are limited, in particular in terms of CPU and memory.
- The technicians responsible for performing the upgrade have little or no OpenShift knowledge.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These motivations does not seem right because without the registry customer can not run the workload. I think we want to separate the lifecycle events and infrastructure for OCP from the infrastructure required for workloads.

to be started manually later or it if should wait till the user explicitly sets
it to `Immediate`. The default value for that will be `Immediate`. This is
intended for situations where the user wishes to prepare everything for the
upgrade in advance, but wants to perform the upgrade later.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see a reason around why the update schedule needs to be part of this enhancement. We are planning introduce too many changes and some of the changes directly not related to the goal. The update schedule is a generic feature which should be a separate enhancement.

install) as well as workload images and buffer.

The changes to pin the images will be done in a
`/etc/crio/crio.conf.d/pin-upgrade.conf` file, something like this:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that could actually simplify things here is to make this in /run/crio/crio.conf - i.e. it's an explicitly transient state before the next boot - and the config automatically goes away after the next boot (when we've upgraded).

I think we should flesh out some more "transient MachineConfig" style flows that write to /run, don't reboot, and don't have config drift monitoring etc.

@cgwalters
Copy link
Member

cgwalters commented Sep 14, 2023

It seems to be big change in MCO workflow. Don't see any timeframe mentioned in the EP, wondering what time-frame we are looking for these changes and who is going to implement and maintain them?

Well...this would have a larger blast radius, but I think a tension we're seeing here is we don't have a good place to do stuff like this because IMO it fits squarely in between the CVO and MCO.

But...we could create a new component (github.com/openshift-upgrade-orchestrator) or something that would sit below the CVO but own things like rolling out a temporary registry if applicable, etc.

(Or, failing that, it's literally a new controller in github.com/openshift/machine-config/operator/pkg/controller/upgrade-infra-orchestrator or something that could have a distinct OWNERS perhaps...)

jhernand added a commit to jhernand/machine-config-operator that referenced this pull request Sep 21, 2023
This patch adds a new `pinnedImages` field to the `ContainerRuntime`
type.

Related: openshift/enhancements#1432
Related: https://issues.redhat.com/browse/RFE-4482
Signed-off-by: Juan Hernandez <[email protected]>
jhernand added a commit to jhernand/machine-config-operator that referenced this pull request Sep 21, 2023
This patch adds a new `pinnedImages` field to the `ContainerRuntime`
type.

Related: openshift/enhancements#1432
Related: https://issues.redhat.com/browse/RFE-4482
Signed-off-by: Juan Hernandez <[email protected]>
jhernand added a commit to jhernand/enhancements that referenced this pull request Sep 21, 2023
This patch adds an enhancement that describes a mechanism to pin and
pre-load container images.

Related: https://issues.redhat.com/browse/RFE-4482
Related: https://issues.redhat.com/browse/OTA-1001
Related: https://issues.redhat.com/browse/OTA-997
Related: openshift/machine-config-operator#3839
Related: openshift#1432

Signed-off-by: Juan Hernandez <[email protected]>
@jhernand
Copy link
Contributor Author

Dear reviewers, as agreed last week in the architecture meeting I am splitting this enhancement into three smaller enhancements:

  1. Pin and pre-load images.
  2. Don't require a registry server during upgrades.
  3. Add support for offline transport of the required images (a.k.a. upgrade bundles).

The two first ones are ready for you to take a look:

The third one is less important and I will create it later.

@jhernand
Copy link
Contributor Author

I am closing this pull request. The review will continue in #1481 and #1483.

@jhernand jhernand closed this Sep 25, 2023
jhernand added a commit to jhernand/machine-config-operator that referenced this pull request Oct 2, 2023
This patch adds a new `PinnedImageSet` object that describes how to
pre-load and pin container images.

Related: openshift/enhancements#1432
Related: https://issues.redhat.com/browse/RFE-4482
Signed-off-by: Juan Hernandez <[email protected]>
jhernand added a commit to jhernand/api that referenced this pull request Oct 6, 2023
This patch adds a new `PinnedImageSet` object that describes how to
pre-load and pin container images.

Related: openshift/enhancements#1432
Related: https://issues.redhat.com/browse/RFE-4482
Signed-off-by: Juan Hernandez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.