From d01c6410bbb09ccd294b60f04cc8274257142352 Mon Sep 17 00:00:00 2001 From: Flavio Castelli Date: Fri, 21 Dec 2018 18:49:39 +0100 Subject: [PATCH] WIP: RFC about container image update with 4.0 This is a WIP RFC that describes how, starting from 4.0, images should be tagged and how we should deal with their update. --- 2018/004-image-update-caasp-v4.md | 660 ++++++++++++++++++++++++++++++ 1 file changed, 660 insertions(+) create mode 100644 2018/004-image-update-caasp-v4.md diff --git a/2018/004-image-update-caasp-v4.md b/2018/004-image-update-caasp-v4.md new file mode 100644 index 0000000..a4cd638 --- /dev/null +++ b/2018/004-image-update-caasp-v4.md @@ -0,0 +1,660 @@ +Image tagging and update strategy for CaaSP 4.0 + +| Field | Value | +|:-------|:-----------| +| Status | Draft | +| Date | YYYY-MM-DD | + +## Introduction + +>This section targets end users, operators, PM, SEs and anyone else that might +> need a quick explanation of your proposed change. + +CaaSP has always been running some of its components using containers. Container +images in the past have always been delivered via RPM packages. On +CaaSP all packages are managed by the `transactional-update` tool. This +technology requires nodes to be rebooted every time a new package is installed. +The update mechanism of CaaSP, both for RPMs and for container images, has been +designed around that. + +Starting from v4 all the container images are going to be distributed using the +SUSE registry. That means RPM packages are not going to be used to install and +update containers. + +Due to this change we have to revisit the update story for container images. + + + +Before we proceed into the details of this RFC, it's important to outline how +our images are being tagged. + +All our images are built inside of the build service and have at least two tags +associated with them: + + * **short:** `` this is something like: `2.6.1` + * **intermediate:** `-` this is something like: + `2.6.1-1`. This value is hard-coded into the KIWI image definition; image + owners are going to bump it manually. + * **long:** `--`. The `buildID` is a + value automatically generated by the Open Build Service. This tag would be + something like `2.6.1-1.100`. Note well the `buildID` changes after each + build and is specific to the project where the image is being built. + To be more explicit: the same image built inside of the `Devel:CaaSP` + project will have different numbers compared to very same one built inside + of `SUSE:CaaSP`. + +### Problem description + +> Why do we want this change and what problem are we trying to address? + +Starting with v4 the need to reboot a node to get newer container images is no +longer needed: images can be pulled from the registry on demand. + +How should CaaSP images be updated? We think some kind of updates could be done +in a fully automated way (no user interaction), while others would still need +some level of coordination; hence, they should not be done in an automated +fashion. + +These are the possible scenarios that lead to the update of a container image: + + 1. Development team updates the *"core"* application of the image (eg: `MariaDB` + going from version `1.2.3` to `1.3.0`). All the tags of the image would + change: short, intermediate and long. + 2. Development team keeps the *"core"* application to the same version but fixes + an image specific bug (a misconfiguration, a bug inside of the entrypoint, + ...). The short tag would not change. The intermediate and long ones would. + 3. Maintenance team publishes an automatic rebuild of the image: the **"core"** + application isn't changing, one or more of its dependencies are rebuilt + (eg: a `libopenssl` CVE issue). Only the long tag would change. + 4. Maintainer of the *"core"* application releases a regular SLE update of the + application. The container image is automatically rebuilt. This case should + be handled carefully: the image tags won't match the core application + version. We expect only patch release of the *"core"* application to be + released in this way. In theory this kind of upgrades should not break the + usage of the container image. + +Finally, this RFC intends to set rules about how to reference images inside of +CaaSP. + +With v3 the container images are referenced in two ways: + + * Images running on the admin node: their exact tag is calculated on every boot, + based on the latest tag that is available on the node. + * Images running on the CaaSP cluster: the exact tag is hard-coded into the + `kubernetes-salt` package. A reboot of the admin node is needed to get the + manifest files with the new numbers, a whole cluster update is needed to + have salt propagate the new manifest files and trigger the usage of the + new container images. + +With v4 we are still going to ship the kubernetes manifest files via two regular RPMs. +The only difference compared to v3 is that we are going to hard-code the tags of +all the images, including the ones running on the admin node. + +We want to reference images using a version instead of a fixed tag like `latest` +or `stable`. This makes easier for a customer to roll back to a previous release +if something goes wrong. It also makes easier to understand which version of the +images are being used across the whole cluster +(`latest`/`stable` can have different meanings across nodes). + + +To summarize, these are the questions we want to address with this RFC: + + * How should CaaSP images be updated with the first release of v4? We think + some kind of updates could be done in a fully automated way (no user + interaction), while others would still need some level of coordination. + * How should we reference images, via their short, intermediate + or long tags? + +### Proposed change + +> A brief summary of the proposed change - the 10,000 ft view on what it will +> change once this change is implemented. + +We are proposing the following upgrade policies (please look at the previous +section for a detailed explanation of the different cases): + + * Case #1 - core application upgrade: we want to be fully in charge of that. + New salt states and kubernetes manifests must be available on the admin node + before we can use these images. They cannot be automated. + * Case #2 - same core application upgrade, only fixes to the image: in some + corner cases we could still need changes to the salt states and/or the + kubernetes manifests. Hence, they cannot be automated. + * Case #3 - maintenance update: no changes needed neither to salt states nor + to kubernetes manifests. These can be automated. + * Case #4 - update of core application via SLE maintainer: we think the best + solution to this scenario is to have the maintenance QA block this automatic + rebuild of the image from being published. They should notify the development + team about this core application upgrade so that everything can fall back to + case #1. + +It's important to point out that we are not looking for the perfect solution, +we are looking for something that allows us to release CaaSP 4.0 on time, +without introducing technical debt, without making the whole upgrade story worse +than the one of v3 and without having to write that much code. + +V3 already has some limits, we are fine keeping them around for 4.0: + + * Nodes have to be rebooted to get latest images. + * Admin nodes images are always using the latest version available on the + node. + * Nodes are rebooted even when not actually needed: images are delivered via + RPMs, when a new velum image is published all the nodes of the cluster are + going to be rebooted even if the only one that actually needs this package + is the admin node. + * Kubernetes add-ons (dex, tiller, kubedns) are currently referenced by their + short tag. + * Kubernetes daemon-sets (flannel) are referenced by their short tag. + * Static pods on the Kubernetes nodes (haproxy) are referenced by their + short tag. + * The short tag consists only of the version of the core application shipped + inside of the image. If the image is rebuilt due to a bug or a security fix + the short tag won't change. The usage of short tags can lead customers to + run unpatched images or bugged ones without being aware of that. + +The proposed solution for 4.0 is going to keep some of these limitations (reboots) +but is going to improve the security aspects. + +## Detailed RFC + +> In this section of the document the target audience is the dev team. Upon +> reading this section each engineer should have a rather clear picture of what +> needs to be done in order to implement the described feature. + +### Changes for image building + +Images are going to have 3 tags: + + * **short:** this is the core application version. For example, given `2.6.2` + is the version of the `docker-registry` application, the short tag of its + image is going to be `2.6.2`. + * **intermediate:** this is core application version plus the image revision + number. The revision number is written into the KIWI spec file, it's updated + manually by the developers every time a fix is done to the image. + The intermediate tag for the first iteration of docker-registry image + would be `2.6.2-rev1`. + * **long:** intermediate tag + build numbers generated by the Open Build + Service. The long tag for the first iteration of docker-registry image + would be something like `2.6.2-rev1+5.1`, where `5.1` are numbers that + are automatically incremented by the Open Build Service every time the + image is built. + +Images are going to be pushed to the SUSE registry, they are not going to be +wrapped into RPMs anymore. + +We will keep using a post-build hook on the build service to create an RPM file +per image: `-metadata`. This RPM file will have a simple +text file inside of it with the metadata of the image (like all its tags). + +The contents of the file are not actually useful, the purpose of this RPM is +to allow us to keep using the current upgrade mechanism of CaaSP v3 that relies +on the assumption that container image upgrades are delivered via RPMs. + +Everything is going to be explained by the remaining parts of the document. + +To summarize, given the `velum` image, the Open Build Service will create the +following artifacts: + + * Native container image: pushed to SUSE registry. + * `velum-container-image-metadata` RPM package: has the simple text file + described above, it's pushed to the usual update channel of CaaSP. + +### Changes for `caasp-manifest` package + +This is the package that includes the manifest files of all the pods that are +running on the admin node. The files are installed under `/usr/share/caasp-container-manifests`. +The package provides also a service that is run at boot time. + +The service reads the template file shipped by the package and writes the +rendered output to `/etc/kubernetes/manifests`. This location is monitored by +kubelet which will take care of starting the pods. + +The `caasp-manifest` package will be changed to: + + * No longer provide template files. + * Install all the manifest files under `/etc/kubernetes/manifests`. + * All the images mentioned inside of the manifest files will have a specific + tag already hard-coded. + * All the images are going to be referenced by using their intermediate tag. + * All the manifests will enforce kubelet to use the "always pull" policy. + * The service that previously rendered the template files and wrote them + under `/etc/` is no longer needed: it must be dropped from the package. + +### Changes for the `kuberentes-salt` package + +All the Kubernetes add-ons (dex, tiller, kubedns), "static" pods (haproxy) +and daemon sets (flannel) are shipped via the `kubernetes-salt` package. + +All these manifest files must be changed in the following way: + + * Reference all the images using their intermediate tag + * Use the "always pull" policy + +### Changes for the base operative system + +This section describes the changes that have to be done to the underlying +operative system. We basically have to alter it's package selection. +That will require changes to our KIWI templates (the ones used to create +ISOs, qcows,...) and to our package patters. + +The following changes have to be made: + + * Remove container-feeder + * Remove all the RPMs that are providing the pre-built container images. + * Add all the RPMs that are providing the container image metadata files + (`velum-container-image-metadata`, `dex-container-image-metadata`, ...). + +## Upgrade scenarios on the admin node + +This section will describe all the possible upgrade scenarios involving container +images that can happen on the admin node. + +For each one of them we will illustrate what is going to happen and their +eventual limitations/risks. + +### New image is released + +This happens with one of the following cases: + + * Core application version changes: new set of tags are created (short, + intermediate, long). + * Core application version doesn't change, a fix is done to the image + (eg: its entrypoint is fixed): short tag doesn't change, intermediate and + long tags change. + +Behind the scenes: + + * [developers] Update KIWI definition of the image + * [developers] Update the `caasp-manifest` package: the manifest file that + uses the image has to reference the new intermediate tag (which is constant + and always predictable). + * [QA] Test the fresh image. The stability and predictability of the + intermediate tags makes their life easier and reduces the chances of + mistakes. + * Image is published to the SUSE registry. + * `caasp-manifest` and `-metadata` packages are published to the + CaaSP Updates channel. + +Considerations about the images: + + * The old image can still be pulled via its intermediate and long tags. + * Using the short tag will pull the new image. + * The new image can also be pulled using its intermediate and long tags. + +What happens before `transactional-update` is run: + + * If the pod running the container image dies: + * kubelet will automatically restart the pod + * The manifest file is still referencing the old image: the intermediate tag + used is the one of the old image. + * The "always pull" policy will force the container engine to look for a + newer version of the old image. + * The pod starts immediately using the old image. + * If the admin node is rebooted: same outcome of the scenario above. + +When `transactional-update` is run: + + * `transactional-update` notices there are package update + available on all the nodes of the cluster. + * `transactional-update` creates a new snapshot with the updated packages + installed and sets the "has updates" grain on each node of the cluster. + * Velum notices the value of the grain on the admin node: the + *"update admin node"* button is shown. + +What happens if the old container pod dies before the admin node is rebooted: + + * kubelet will automatically restart the pod + * The manifest file is still referencing the old image: the intermediate tag + used is the one of the old image. + * The "always pull" policy will force the container engine to look for a + newer version of the old image. + * The pod starts immediately using the old image. + +When the admin node is rebooted: + + * The new manifest files are in place, they reference the intermediate tag of + the new image. + * The new image is pulled from the registry. + * Velum will allow the user to press the "update nodes" button. This update + is not needed by the nodes of a cluster, it's just a waste of time. This is + a limitation affecting CaaSP since v1. + +### Automatic image rebuild + +This happens when the image is rebuilt because one of its dependencies got +updated (eg: a `libopenssl` issue). + +Behind the scenes: + + * [developers] Nothing is done. + * [QA] Test the fresh image. + * Image is published to the SUSE registry. + * `-metadata` package is published to the CaaSP Updates channel. + +Considerations about the images: + + * The old image can still be pulled via its long tag. + * Using the short or intermediate tag will pull the new image. + * The new image can also be pulled using its long tag. + +What happens before `transactional-update` is run: + + * If the pod running the container image dies: + * kubelet will automatically restart the pod + * The "always pull" policy will force the container engine to look for a + newer version of the old image. + * The container engine pulls the new image because it has the same intermediate tag + as the old one. + * The pod starts using the new image as soon as image pull is done. + * If the admin node is rebooted: same outcome of the scenario above. + +> **Warning:** this behaviour could lead to a potential service disruption. When +> the container dies it won't immediately be restarted, this will happen only +> after the new image is pulled from the registry. Pulling an image takes an +> amount of time that depends on the size of the image and the network bandwidth. +> +> The impact of this issue can be reduced by configuring all the nodes of the +> cluster to use a local mirror of the SUSE registry. + +When `transactional-update` is run: + + * `transactional-update` notices there are package update + available on all the nodes of the cluster. + * `transactional-update` creates a new snapshot with the updated packages + installed and sets the "has updates" grain on each node of the cluster. + * Velum notices the value of the grain on the admin node: the + *"update admin node"* button is shown. + +What happens if the old container pod dies before the admin node is rebooted: + + * kubelet will automatically restart the pod + * The "always pull" policy will force the container engine to look for a + newer version of the old image. + * The container engine pulls the new image because it has the same intermediate tag + as the old one. + * The pod starts using the new image as soon as image pull is done. + +**Note well:** this is the same issue illustrated above under *"what happens before +`transactional-update` is run if..."*. The same mitigation technique is going +to help with this scenario as well. + +When the node is rebooted: + + * The "always pull" policy will force the container engine to look for a + newer version of the old image. + * The container engine pulls the new image because it has the same intermediate tag + as the old one. + * The pod starts using the new image as soon as image pull is done. + +The whole node is rebooting, so no unexpected service disruption happens. + +## Upgrade scenarios for kubernetes-related container images. + +This section will describe all the possible upgrade scenarios involving container +images that are running on the kubernetes master and worker nodes. + +There are three types of container workloads to consider: + + * Containers deployed as native Kubernetes deployment objects: these include + services like dex, kubedns, tiller and many others. + * Containers deployed as native Kubernetes daemon sets: this - at the time of + writing - includes only the cilium and flannel services. + * Containers deployed via manifest files: these are files managed by salt, + at the time of writing only HAproxy is being deployed in that way on all + the nodes of the cluster. + +Upgrading kubernetes daemon sets and static pods poses the same challenges. On +the other hand, the update of kubernetes deployments workloads has different +implications. Due to that the next sections will focus on these two kind of +upgrades. + +For each one of them we will illustrate what is going to happen and their +eventual limitations/risks. + +### Upgrade scenarios: kubernetes add-ons + +All these workloads are deployed using native *"kubernetes deployment"* objects. + +#### New image is released + +This happens with one of the following cases: + + * Core application version changes: new set of tags are created (short, + intermediate, long). + * Core application version doesn't change, a fix is done to the image + (eg: its entrypoint is fixed): short tag doesn't change, intermediate and + long tags change. + +Behind the scenes: + + * [developers] Update KIWI definition of the image + * [developers] Update the `kubernetes-salt` package: the manifest file that + uses the image has to reference the new intermediate tag (which is constant + and always predictable). + * [QA] Test the fresh image. The stability and predictability of the + intermediate tags makes their life easier and reduces the chances of + mistakes. + * Image is published to the SUSE registry. + * `kubernetes-salt` and `-metadata` packages are published to the + CaaSP Updates channel. + +Considerations about the images: + + * The old image can still be pulled via its intermediate and long tags. + * Using the short tag will pull the new image. + * The new image can also be pulled using its intermediate and long tags. + +What happens when `transactional-update` is run: + + * `transactional-update` notices there are package update + available on all the nodes of the cluster. + * `transactional-update` creates a new snapshot with the updated packages + installed and sets the "has updates" grain on each node of the cluster. + * Velum notices the value of the grain on the admin node: the + *"update admin node"* button is shown. + * Once the admin node is rebooted, Velum allows the user to press the + "update nodes" button + +When the update cluster orchestration is done: + + * salt gives Kubernetes a new deployment definition. + * The desired state is different from the current one: the image tag + referenced by the deployment object is a different one. + * Kubernetes will roll out the update. This is not going to cause downtime. + +What happens if the old container pod dies before the "update cluster" +orchestration is run: + + * Kubernetes starts a new pod somewhere in the cluster + * The deployment description didn't change yet: the intermediate tag being + referenced is still the old one. + * A new pod is started, the old container image is being used by it. + + +#### Automatic rebuild + +This happens when the image is rebuilt because one of its dependencies got +updated (eg: a `libopenssl` issue). + +Behind the scenes: + + * [developers] Nothing is done. + * [QA] Test the fresh image. + * Image is published to the SUSE registry. + * `-metadata` package is published to the CaaSP Updates channel. + +Considerations about the images: + + * The old image can still be pulled via its long tag. + * Using the short or intermediate tag will pull the new image. + * The new image can also be pulled using its long tag. + +What happens when `transactional-update` is run: + + * `transactional-update` notices there are package update + available on all the nodes of the cluster. + * `transactional-update` creates a new snapshot with the updated packages + installed and sets the "has updates" grain on each node of the cluster. + * Velum notices the value of the grain on the admin node: the + *"update admin node"* button is shown. + * Once the admin node is rebooted, Velum allows the user to press the "update nodes" button + +What happens if the old container pod dies before the "update cluster" orchestration is run: + + * Kubernetes starts a new pod somewhere in the cluster + * The "always pull policy" forces the container engine to check if a new + version of the image exists. + * The new image is pulled from the registry because the intermediate tag + didn't change. + * The pod is started as soon as the image pull is completed. + +No downtime is going to be faced by the user even if the container engine has +to pull the image from scratch. That happens because all our deployments are made +by more than one replica, hence the requests are automatically routed to the +other pods erogating the service. + +When the update cluster orchestration is done: + + * The deployment definition will **not** change: the intermediate tag didn't + change. + * The nodes of the clusters are going to be rebooted due to the update of the + `-metadata` package. + * The reboot of the nodes will lead kubernetes to schedule the pod somewhere + else on the cluster. + * The "always pull policy" will cause the new pod to start using the new image. + +The nodes are rebooted in a controlled fashion by our salt orchestration. Nodes +are drained, hence no downtime is going to be experienced by the user. + +### Upgrade scenarios: static pods and daemon sets + +This section focuses on the update of the static pods (eg: HAproxy) and +kubernetes daemon sets (eg: flannel and cilium). + +### New image is released + +This happens with one of the following cases: + + * Core application version changes: new set of tags are created (short, + intermediate, long). + * Core application version doesn't change, a fix is done to the image + (eg: its entrypoint is fixed): short tag doesn't change, intermediate and + long tags change. + +Behind the scenes: + + * [developers] Update KIWI definition of the image + * [developers] Update the `kubernetes-salt` package: the manifest file that + uses the image has to reference the new intermediate tag (which is constant + and always predictable). + * [QA] Test the fresh image. The stability and predictability of the + intermediate tags makes their life easier and reduces the chances of + mistakes. + * Image is published to the SUSE registry. + * `kubernetes-salt` and `-metadata` packages are published to the + CaaSP Updates channel. + +Considerations about the images: + + * The old image can still be pulled via its intermediate and long tags. + * Using the short tag will pull the new image. + * The new image can also be pulled using its intermediate and long tags. + +What happens when `transactional-update` is run: + + * `transactional-update` notices there are package update + available on all the nodes of the cluster. + * `transactional-update` creates a new snapshot with the updated packages + installed and sets the "has updates" grain on each node of the cluster. + * Velum notices the value of the grain on the admin node: the + *"update admin node"* button is shown. + * Once the admin node is rebooted, Velum allows the user to press the + "update nodes" button + +**TODO:** answer these questions: + + * how does the update work, when is the deamon set updated? If we update + the daemon set on a non-drained node we will immediately pull the new container + image. That could result in a downtime for nodes using flannel (the whole node + network stack is not working while the flannel pod is not running). + * how is daemon set updated rolled out? does k8s pull the new image before killing + the running pod? I doubt so... + * when are we updating the static pods? Are we doing that after the node is drained? + +The whole point is: we have to drain nodes before touching daemon sets (esp. CNI) +and static pods. That could be a problem for v3 as well. + +### Automatic rebuild + +This happens when the image is rebuilt because one of its dependencies got +updated (eg: a `libopenssl` issue). + +Behind the scenes: + + * [developers] Nothing is done. + * [QA] Test the fresh image. + * Image is published to the SUSE registry. + * `-metadata` package is published to the CaaSP Updates channel. + +Considerations about the images: + + * The old image can still be pulled via its long tag. + * Using the short or intermediate tag will pull the new image. + * The new image can also be pulled using its long tag. + +What happens when `transactional-update` is run: + + * `transactional-update` notices there are package update + available on all the nodes of the cluster. + * `transactional-update` creates a new snapshot with the updated packages + installed and sets the "has updates" grain on each node of the cluster. + * Velum notices the value of the grain on the admin node: the + *"update admin node"* button is shown. + * Once the admin node is rebooted, Velum allows the user to press the "update nodes" button + + +If a running container pod dies before the "update cluster" orchestration is run: + + * kubelet automatically restars the pod on the same node + * The "always pull policy" forces the container engine to check if a new + version of the image exists. + * The new image is pulled from the registry because the intermediate tag + didn't change. + * The pod is started as soon as the image pull is completed. + +> **Warning:** this can lead to a temporary downtime of the node. The length of +> the downtime depends on the size of the image and the network bandwidth. + +This can be mitigated by using an on-premise a mirror of the SUSE registry. + +After the update cluster orchestration is done: + + * The deployment definition will **not** change: the intermediate tag + didn't change. + * The nodes of the clusters are going to be rebooted due to the update of + the `-metadata` package. + * At boot time the always pull policy will cause the new pod to start using + the new image. + +The nodes are rebooted in a controlled fashion by our salt orchestration. +Nodes are drained, hence no downtime is going to be expected by the user. + + +### Dependencies + +Highlight how the change may affect the rest of the product (new components, +modifications in other areas), or other teams/products. + +### Concerns and Unresolved Questions + +List any concerns, unknowns, and generally unresolved questions etc. + +## Alternatives + +List any alternatives considered, and the reasons for choosing this option +over them. + +## Revision History: + +| Date | Comment | +|:-----------|:--------------| +| YYYY-MM-DD | Initial Draft |