enhancements/machine-config: add updates to PinnedImageSet

Signed-off-by: Sam Batschelet <[email protected]>
openshift · Mar 15, 2024 · 25ce316 · 25ce316
1 parent ffe13fc
commit 25ce316
Showing 1 changed file with 121 additions and 150 deletions.
diff --git a/enhancements/machine-config/pin-and-pre-load-images.md b/enhancements/machine-config/pin-and-pre-load-images.md
@@ -89,27 +89,77 @@ to request that a set of container images are pinned and pre-loaded:
     metadata:
       name: my-pinned-images
     spec:
-      nodeSelector:
-        matchLabels:
-          node-role.kubernetes.io/control-plane: ""
       pinnedImages:
-      - quay.io/openshift-release-dev/ocp-release@sha256:...
-      - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
-      - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
+      - name: "quay.io/openshift-release-dev/ocp-release@sha256:..."
+      - name: "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:..."
+      - name: "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:..."
       ...
     ```
 
-1. A new `PinnedImageSetController` sub-controller of the
-_machine-config-controller_ ensures that all the images are pinned and
-pulled in all the nodes that match the node selector.
+1. `MachineConfigPoolSpec` will add an optional field taking a reference to a `PinnedImageSet`.
+
+  ```yaml
+  apiVersion: machineconfiguration.openshift.io/v1
+  kind: MachineConfigPool
+  spec:
+    pinnedImageSets:
+    - name: "my-pinned-images"
+  ```
+
+1. A new `Validation WebHook` will be added to ensure the validity of
+`PinnedImageSetRefs` passed to `MachineConfigPoolSpec` and provide appropriate
+errors before the update is applied.
+
+1. The `MachineConfigDaemon` will ensure adequate storage, manage image prefetch
+logic, `CRI-O` configuration updates and status reporting through the
+`MachineConfigPool` using the new `prefetch_image_manager`.
 
 ### API Extensions
 
 A new `PinnedImageSet` custom resource definition will be added to the
 `machineconfiguration.openshift.io` API group.
 
-The new custom resource definition is described in detail in
-https://github.com/openshift/api/pull/1609.
+```golang
+  type PinnedImageSetSpec struct {
+    // [docs]
+    // +kubebuilder:validation:Required
+    // +kubebuilder:validation:MinItems=1
+    // +kubebuilder:validation:MaxItems=2000
+    // +listType=map
+    // +listMapKey=name
+    PinnedImages []PinnedImageRef `json:"pinnedImages"`
+  }
+
+  type PinnedImageRef struct {
+    // [docs]
+    // +kubebuilder:validation:Required
+    // +kubebuilder:validation:MinLength=1
+    // +kubebuilder:validation:MaxLength=447
+    // +kubebuilder:validation:XValidation:rule=`self.matches('^([a-zA-Z0-9-]+\\.)+[a-zA-Z0-9-]+(:[0-9]{2,5})?/([a-zA-Z0-9-_]{0,61}/)?[a-zA-Z0-9-_.]*@sha256:[a-f0-9]{64}$')`,message="The OCI Image reference must be in the format host[:port][/namespace]/name@sha256:<digest> with a valid SHA256 digest"
+    Name string `json:"name"`
+  }
+```
+
+`MachineConfigPoolSpec` will add `PinnedImageSets` .
+
+```golang
+    // [docs]
+	  // +openshift:enable:FeatureGate=PinnedImages
+	  // +optional
+	  // +listType=map
+	  // +listMapKey=name
+	  PinnedImageSets []PinnedImageSetRef `json:"pinnedImageSets,omitempty"`
+
+
+    type PinnedImageSetRef struct {
+        // [docs]
+	      // +kubebuilder:validation:MinLength=1
+	      // +kubebuilder:validation:MaxLength=253
+	      // +kubebuilder:validation:Pattern=`^(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$`
+	      // +kubebuilder:validation:Required	
+	      Name string `json:"name"`
+    }
+```
 
 ### Implementation Details/Notes/Constraints
 
@@ -126,32 +176,22 @@ configuration parameter to "", but that would affect all images, not just the
 pinned ones. This behavior needs to be changed in CRI-O so that pinned images
 aren't removed, regardless of the value of `version_file_persist`.
 
-The changes to pin the images will be done in a file inside the
-`/etc/crio/crio.conf.d` directory. To avoid potential conflicts with files
-manually created by the administrator the name of this file will be the name of
-the `PinnedImageSet` custom resource concatenated with the UUID assigned by the
-API server. For example, if the custom resource is this:
+The changes to `pinned_images` `CRI-O` configuration will be persisted to a
+static config file `/etc/crio/crio.conf.d/50-pinned-images` directory.
 
 ```yaml
 apiVersion: machineconfiguration.openshift.io/v1alpha1
 kind: PinnedImageSet
 metadata:
   name: my-pinned-images
-  uuid: 550a1d88-2976-4447-9fc7-b65e457a7f42
 spec:
   pinnedImages:
-  - quay.io/openshift-release-dev/ocp-release@sha256:...
-  - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
-  - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
+  - name: "quay.io/openshift-release-dev/ocp-release@sha256:..."
+  - name: "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:..."
+  - name: "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:..."
   ...
 ```
 
-Then the complete path will be this:
-
-```txt
-/etc/crio/crio.conf.d/my-pinned-images-550a1d88-2976-4447-9fc7-b65e457a7f42.conf
-```
-
 The content of the file will the `pinned_images` parameter containing the list
 of images:
 
@@ -164,71 +204,45 @@ pinned_images=[
 ]
 ```
 
-In addition to the list of images to be pinned, the `PinnedImageSet` custom
-resource will also contain a node selector. This is intended to support
-different sets of images for different kinds of nodes. For example, to pin
-different images for control plane and worker nodes the user could create two
-`PinnedImageSet` custom resources:
-
-```yaml
-# For control plane nodes:
-apiVersion: machineconfiguration.openshift.io/v1alpha1
-kind: PinnedImageSet
-metadata:
-  name: my-control-plane-pinned-images
-spec:
-  nodeSelector:
-    matchLabels:
-      node-role.kubernetes.io/control-plane: ""
-  pinnedImages:
-  ...
-
----
-
-# For worker nodes:
-apiVersion: machineconfiguration.openshift.io/v1alpha1
-kind: PinnedImageSet
-metadata:
-  name: my-worker-pinned-images
-spec:
-  nodeSelector:
-    matchLabels:
-      node-role.kubernetes.io/worker: ""
-  pinnedImages:
-  ...
-```
+The `PinnedImageSet` is tightly coupled to `MachineConfigPool` and each `CR` can
+be referenced on the pool level via the `MachineConfigPoolSpec`. This is
+flexible becuase it allows a controller/admin the ability to prescribe sets to
+pools. Although a cluster only has two pools by default `master` and `worker`
+[custom pools](https://github.com/openshift/machine-config-operator/blob/master/docs/custom-pools.md)
+can be added for further fine grained controls.
 
 This is specially convenient for pinning images for upgrades: there are many
 images that are needed only by the control plane nodes and there is no need to
 have them consuming disk space in worker nodes.
 
-When no node selector is specified the images will be pinned in all the nodes
-of the cluster.
-
-The _machine-config-controller_ will grow a new `PinnedImageSetController` sub
-controller that will watch the `PinnedImageSet` custom resources. When one of
-those is created, updated or deleted the controller will start a daemon set that
-will do the following in each node of the cluster:
+The _machine-config-daemon_ will grow a new `prefetch_image_manager` utilizing
+the same general flow as the `certificate_writer` which is not dependent on
+defining the configuration as `MachineConfig`.  controller that will watch the
+`PinnedImageSet` and `MachineConfigPool` resources. On sync the manager will
+perform the following tasks.
+
+1. Using the same mechanism as `MachineConfigDaemon` update() flow
+`nodeWriter.SetWorking()` will update the node annotation signaling to the pool
+that work has begun. If during this pull an error occurs it will be translated
+to `Degraded` = true status. By utilizing this exiting flow of
+`MachineConfigDaemon` we will require no additional node annotations. In the
+future when `MachineConfigDaemon` migrates to `MachineConfigNode`
+PinnedImageSets will also migrate.
+
+```sh
+NAME     CONFIG                UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT
+master   rendered-master-...   False      True      False      3              3                   2                     0                     
+```
 
 1. Verify that there is enough disk space available in the machine to pull the
    images.
 
-1. Create, modify or delete the CRI-O pinning configuration file.
-
-1. Reload the CRI-O configuration, with the equivalent of `systemctl reload crio`.
-
-    Note that currently CRI-O doesn't recalculate the set of pinned images on
-    reload, support for that will need to be added.
-
 1. Use the CRI-O gRPC API to run the equivalent of `crictl pull` for each of the
 images.
 
-This needs to be a daemon set running in each node of the cluster because it
-needs to create configuration files on the node, and use the CRI-O gRPC API,
-which is only accessible via `localhost`.
+1. Create, modify or delete the CRI-O pinning configuration file.
 
-This logic could be part of a new daemon set, specific for this, or else part of
-the _machine-config-daemon_.
+1. Reload the CRI-O configuration, with the equivalent of `systemctl reload crio`.
 
 In order to verify that there is enough disk space we will first check which of
 the images in the `PinnedImageSet` have not yet been downloaded, using CRI-O
@@ -244,89 +258,37 @@ complete OpenShift releases: blobs take 16 GiB and when they are written to disk
 they take 32 GiB.
 
 If the calculate disk space exceeds the available disk space then the
-`PinnedImageSet` will be marked as failed and the process will not move forward.
-For example:
-
-```yaml
-status:
-  conditions:
-  - type: Ready
-    status: "False"
-  - type: Failed
-    status: "True"
-    message: |
-      Pulling the images in `node12` requires at least 16 GiB of disk
-      space, but there are only 10 GiB available
-```
+`prefetch_image_manger` will report the error as a `Degraded` = `true` status
+for the MachineConfigPool with the error in the `message`.
 
 Note that even with this check it will still be possible (but less likely) to
 have failures to pull images due to disk space: the heuristic could be wrong,
 and there may be other components pulling images or consuming disk space in
 some other way. Those failures will be detected and reported in the status of
-the `PinnedImageSet` when CRI-O fails to pull the image.
-
-The steps above will happen in all the nodes of the cluster. The daemon set
-will be provided with enough information to ensure that each pod applies only
-the changes required for the node where it runs, according to the node
-selectors in the `PinnedImageSet` custom resources.
+the `MachineConfigPool` when CRI-O fails to pull the image.
 
 When all the images have been successfully pinned and pulled in all the matching
-nodes the `PinnedImageSetController` will set the `Ready` condition to `True`:
+nodes the `MachineConfigPool` will use `nodeWriter.SetDone()` to communicate to
+the pool that work is done.
 
-```yaml
-status:
-  conditions:
-  - type: Ready
-    status: "True"
-  - type: Failed
-    status: "False"
-```
+Today the Updated status reflects only the rendered worker config. The message
+will need to be slightly updated in the case where prefetch is a success.
 
-If something fails the `Failed` condition will be set to `True`, and the
-details of the error will be in the message. For example if `node12` fails to
-pull an image:
-
-```yaml
-status:
-  pinnedImages:
-  - quay.io/openshift-release-dev/ocp-release@sha256:...
-  - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
-  conditions:
-  - type: Ready
-    status: "False"
-  - type: Failed
-    status: "True"
-    message: Node 'node12' failed to pull image `quay.io/...` because ...
 ```
-
-We will consider using the new _machine-config-operator_ state reporting
-mechanism introduced [https://github.com/openshift/api/pull/1596](here) to
-report additional details about the progress or issues of each node.
-
-Additional information may be needed inside the `status` field of the
-`PinnedImageSet` custom resources in order to implement the mechanisms described
-above. For example, it may be necessary to have conditions per node, something
-like this:
-
-```yaml
-status:
-  node0:
-  - type: Ready
-    status: "False"
-  - type: Failed
+  - lastTransitionTime: "2024-03-14T20:31:32Z"
+    message: All nodes are updated with MachineConfig rendered-worker-be6914dc246ca0b222fc6b4e2181b6da and image prefetch complete
+    reason: ""
     status: "True"
-  node1:
-  - type: Ready
-    status: "True"
-  - type: Failed
-    status: "True"
-  ...
+    type: Updated
 ```
 
-Those details are explicitly left out of this document, because they are mostly
-implementation details, and not relevant for the user of the API.
+**Note:** As the API is introduced as `v1alpha1` via `TechPreview` this will not
+directly affect customer upgrades. The usage of `Node` annotations will not be a
+`v1` goal but this accommodation is intended to allow the maturity of
+`MachineConfigNode` before it is implemented. The goal is to use
+`MachineConfigNode` directly when it has graduated to `v1`.
 
-### Risks and Mitigations
+### Risks and Mitigation
 
 Pre-loading disk images can consume a large amount of disk space. For example,
 pre-loading all the images required for an upgrade can consume more 32 GiB.
@@ -336,11 +298,20 @@ garbage collection. To ensure disk space issues are reported and that garbage
 collection doesn't interfere we will introduced new mechanisms:
 
 1. Disk space will be checked before trying to pre-load images, and if there are
-issues they will be reported explicitly via the status of the `PinnedImageSet`
-status.
+issues they will be reported explicitly via the status of the
+`MachineConfigPool` status setting the pool as `Degraded`. A `MachineConfigPool`
+in `Degraded` state will block upgrades and this is by design. `PinnedImageSet`
+is a dependency of a cluster upgrade.
+
+1. There is a possibility that storage may become fully exhausted by unforeseen
+factors while images are being downloaded. Therefore, the configuration for
+`pinned_images` in CRI-O and the subsequent reloading of the CRI-O service are
+executed as the final steps, and only after all images have been successfully
+pulled. This approach ensures that, in an emergency situation where space needs
+to be freed up, the kubelet can remove these images to clear storage space.
 
 1. If disk space is exhausted while pulling the images, then the issue will be
-reported via the status of the `PinnedImageSet` as well. Typically these issues
+reported via the status of the `Node` as well. Typically these issues
 are reported as failures to pull images in the status of pods, but in this case
 there are no pods pulling the images. CRI-O will still report these issues in
 the log (if there is space for that), like for any other image pull.