Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider making config changes truly transactional on RHCOS #1190

Closed
cgwalters opened this issue Oct 18, 2019 · 22 comments
Closed

consider making config changes truly transactional on RHCOS #1190

cgwalters opened this issue Oct 18, 2019 · 22 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@cgwalters
Copy link
Member

With ostree, each deployment (bootable target) has its own copy of /etc.

Today, the MCD writes into the current /etc. This has weird side effects; for example, it means that we may be affecting running static pods for kubelet. We may rewrite the pull secret.

It also means we don't have rollback at the ostree level.

It'd be fairly easy to change the MCD to (on RHCOS) create a new deployment for pure config changes, and write the new changes to /etc there - leaving the booted system untouched.

This would mean that config changes would be fully transactional and offline, the same way OS updates are.

@cgwalters
Copy link
Member Author

Another way to look at this is, it'd enforce that config changes require a reboot.

We could drop the hacks around using journald for which config we're in, because /etc becomes transactional. (And much closer to being immutable - see also coreos/rpm-ostree#702 )

@cgwalters
Copy link
Member Author

Also a prerequisite is moving the pull secret from /var into /etc - and in general, all MachineConfig changes.

@mrunalp
Copy link
Member

mrunalp commented Oct 20, 2019

👍 IIUC, this would prevent new static pod manifests to take effect until we reboot.

@sinnykumari
Copy link
Contributor

This proposal looks interesting!
One question though - will this apply on firstboot of node as well? This question is because today mcd available on host reads MachineConfig and writes /etc/ignition-machine-config-encapsulated.json during firstboot which is later used by mcd to apply kargs in firstboot.

@cgwalters
Copy link
Member Author

One question though - will this apply on firstboot of node as well?

I don't see a blocker to doing this on firstboot too.

@sinnykumari
Copy link
Contributor

One question though - will this apply on firstboot of node as well?

I don't see a blocker to doing this on firstboot too.

Looks like I might have misunderstood little bit making /etc/ changes transactional.

What will happen with this proposal in place during firstboot? Most likely during firstboot we will have config changes (like new file to be added in /etc/ignition-machine-config-encapsulated.json ) and also new machine-os-content. Today, we create new deployment for mchine-os-content changes which gets applied after reboot. What will happen to files like /etc/ignition-machine-config-encapsulated.json, will it be written to current deployment or in new deployment?

@cgwalters
Copy link
Member Author

The motivation for filing this issue was the etcd upgrade bug.

However, we're also now seeing this for kubelet config.

And see also a related podman config bug.

One comment I had on the podman bug was:

The problem is that doing the naive thing and always doing an OS update before config changes would force double reboots for every upgrade.
We could try to detect "major version" upgrades and force doing the OS update first there...wouldn't be terribly hard, but would add noticeable latency to upgrades.

@cgwalters
Copy link
Member Author

What will happen with this proposal in place during firstboot?
...
What will happen to files like /etc/ignition-machine-config-encapsulated.json, will it be written to current deployment or in new deployment?

That's a good question. Note that today, it's Ignition which writes the initial files, not the MCD. So unless we changed how Ignition works too, the answer would be that the files are written in the current root.

And in fact, we need this to happen because we need the pull secret /var/lib/kubelet/config.json in the current root so that podman can pull the oscontainer. (In general, we need other configs for that too too, such as any image content source policy, certificates, proxy config etc.)

I think it's OK if we only do this "transactional /etc" for MCD upgrades, because that's where we're actually doing an upgrade, and we have a workload running on the cluster.

@cgwalters
Copy link
Member Author

Also a prerequisite is moving the pull secret from /var into /etc - and in general, all MachineConfig changes.

Eh, thinking about this more it's not a hard dependency - we can just make changes to /etc transactional, and say that things outside of it aren't. In practice, the pull secret doesn't change often or in an incompatible way. What we need is transactional changes for the static pods etc. which should be in /etc.

@sinnykumari
Copy link
Contributor

That's a good question. Note that today, it's Ignition which writes the initial files, not the MCD. So unless we changed how Ignition works too, the answer would be that the files are written in the current root.

Ah right, makes sense to me now. thanks for the explanation.

wking added a commit to wking/cincinnati-graph-data that referenced this issue Feb 20, 2020
The machine-config operator had a bug where MachineConfig entries lead
the machine-config daemon (MCD) to lay down a storage.conf that
exactly matched the content installed by the containers-common RPM.
On update, the RHCOS machine pivots to a new OSTree image (defined in
the machine-os-content image referenced from the release image).
Seeing storage.conf content that matched the old OSTree image,
libostree replaced storage.conf with the version defined in the new
OSTree image [1].  Then, when the MCD comes back up post-pivot, it
sees the divergent storage.conf content and freaks out with logs like
[2]:

  E1210 16:15:51.105286   11181 daemon.go:1350] content mismatch for file /etc/containers/storage.conf:

and the machine-config operator goes Degraded=True with
RequiredPoolsFailed "nodes are reporting degraded status on sync" [3].

The narrow machine-config fix was to annotate storage.conf that it
writes, libostree doesn't touch the files on pivot [4].  This
addresses the storage.conf case, but leaves the MCD vulnerable to
other instances of "MCD writes exactly the OSTree contents to $FILE
and expects it to remain untouched during an OSTree pivot that bumps
the file".  I'm not aware of a generic fix at the moment, although [5]
might be related.  You can guard a cluster against the narrow bug by
setting a MachineConfig [6] or higher level object such as a
ContainerRuntimeConfig [7] that will cause the MCD to write a
storage.conf that diverges (even just by a comment or whitespace) from
the OSTree original.

Tracking the narrow fix through the various z streams:

The 4.1 machine-config bug was introduced in d2c44d7 [8], which landed
before 4.1.0-rc.0:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.0-rc.0 | grep machine-config
    machine-config-controller                     https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    machine-config-daemon                         https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    machine-config-server                         https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
  $ git --no-pager log --oneline --first-parent de9998eb37 | grep d2c44d7
  d2c44d7c Merge pull request openshift#330 from umohnani8/runtime

The 4.1 machine-config fix was [9], landed in 1301934 [10], which is
new in 4.1.34:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.34-x86_64 | grep machine-config
    machine-config-controller                     https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    machine-config-daemon                         https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    machine-config-server                         https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.31-x86_64 | grep machine-config
    machine-config-controller                     https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    machine-config-daemon                         https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    machine-config-server                         https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
  $ git --no-pager log --oneline --first-parent -2 f56d736e74a
  f56d736e (origin/release-4.1) Merge pull request openshift#1147 from openshift-cherrypick-robot/cherry-pick-1114-to-release-4.1
  1301934a Merge pull request openshift#1382 from vrutkovs/4.1-containers-conf-generated

The 4.2 machine-config fix was [2], landed in bd358bb [11], which is new
in 4.2.18:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.16-x86_64 | grep machine-config
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       31fed93186c9f84708f5cdfd0227ffe4f79b31cd
  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.18-x86_64 | grep machine-config
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       9366460085b2a24d825380759f554769ec5ab4f9
  $ git --no-pager log --oneline --first-parent -2 9366460085
  93664600 Merge pull request openshift#1362 from rphillips/fixes/1787581_4.2
  bd358bb7 Merge pull request openshift#1323 from openshift-cherrypick-robot/cherry-pick-1320-to-release-4.2

The 4.3 machine-config fix was [12], landed in 9fd53bd [13], which
landed early enough for 4.3.0-rc.0:

$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.0-rc.0-x86_64 | grep machine-config
  machine-config-operator                       https://github.com/openshift/machine-config-operator                       23a6e6fb37e73501bc3216183ef5e6ebb15efc7a
$ git --no-pager log --oneline --first-parent -8 23a6e6fb37
23a6e6fb Merge pull request openshift#1348 from openshift-cherrypick-robot/cherry-pick-1285-to-release-4.3
80c8aed7 Merge pull request openshift#1343 from retroflexer/cherry-pick-backup-restore-kube-static-resources
269990a3 Merge pull request openshift#1344 from openshift-cherrypick-robot/cherry-pick-1296-to-release-4.3
fd3ca395 Merge pull request openshift#1338 from runcom/fix-go-mod
ba304dbb Merge pull request openshift#1333 from openshift-cherrypick-robot/cherry-pick-1278-to-release-4.3
787f3fa9 Merge pull request openshift#1332 from runcom/reserved-cpus-4.3
2b85d6ba Merge pull request openshift#1329 from openshift-cherrypick-robot/cherry-pick-1314-to-release-4.3
9fd53bd5 Merge pull request openshift#1322 from openshift-cherrypick-robot/cherry-pick-1320-to-release-4.3

The 4.4 machine-config fix was [3] which has landed before any 4.4 RCs
have been cut.  Even in 4.4, the generated note was the first content
touch to this template:

  $ git --no-pager log --oneline --follow origin/release-4.4 -- templates/common/_base/files/container-storage.yaml
  46c4e27a (origin/pr/1320) templates/container-storage: Add a "this is generated" note
  47a6321c templates: Move container-storage.yaml into common/
  74ae3b31 (origin/pr/330) Add ContainerRuntime CRD and Controller

(47a6321c was a pure rename).

So the MCD has been annotating storage.conf since 4.1.34, 4.2.18, and
all 4.3 and later releases.  When has the RPM-installed storage.conf
changed?  Figuring this part out is a bit awkward, because we need to
drill down machine-os-content -> RHCOS -> RPM -> file.  For example,
from 4.2.16 -> 4.2.18 [14]:

  $ oc image info --output json $(oc adm release info --image-for=machine-os-content quay.io/openshift-release-dev/ocp-release:4.2.16-x86_64) | jq -r .config.config.Labels.version
  42.81.20200114.0
  $ oc image info --output json $(oc adm release info --image-for=machine-os-content quay.io/openshift-release-dev/ocp-release:4.2.18-x86_64) | jq -r .config.config.Labels.version
  42.81.20200203.1
  $ ./differ.py --first-endpoint art --first-version 42.81.20200114.0 --second-endpoint art --second-version 42.81.20200203.1 | jq -r '.diff | keys | sort[]'
  cri-o
  ignition
  libarchive
  machine-config-daemon
  openshift-clients
  openshift-hyperkube
  sqlite-libs

storage.conf is managed by the containers-common RPM, so no change
from 4.2.16 to 4.2.18, and that update will safely pull in the fixed
MCD without a surprising pivot change.  Here are our changes to the
RPM across the various z streams:

  $ for OCP in 4.1.1 4.1.23 4.1.24 4.1.31-x86_64 4.1.34-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.1/${RHCOS}/commitmeta.json" | jq -r '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .[2]')"; echo "${RHCOS} ${COMMON} ${OCP}"; done
  410.8.20190606.0 0.1.32 4.1.1
  410.8.20191030.0 0.1.32 4.1.23
  410.81.20191112.2 0.1.37 4.1.24
  410.81.20200114.0 0.1.37 4.1.31-x86_64
  410.81.20200204.1 0.1.40 4.1.34-x86_64
  $ for OCP in 4.2.0-rc.0 4.2.2 4.2.4 4.2.16-x86_64 4.2.18-x86_64 4.2.19-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.2/${RHCOS}/commitmeta.json" | jq -r '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .[2]')"; echo "${RHCOS} ${COMMON} ${OCP}"; done
  42.80.20190930.1 0.1.32 4.2.0-rc.0
  42.80.20191022.0 0.1.32 4.2.2
  42.81.20191107.0 0.1.37 4.2.4
  42.81.20200114.0 0.1.37 4.2.16-x86_64
  42.81.20200203.1 0.1.37 4.2.18-x86_64
  42.81.20200210.0 0.1.40 4.2.19-x86_64
  $ for OCP in 4.3.0-rc.0-x86_64 4.3.3-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/${RHCOS}/x86_64/commitmeta.json" | jq -r '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .[2]')"; echo "${RHCOS} ${COMMON} ${OCP}"; done
  43.81.202001072253.0 0.1.40 4.3.0-rc.0-x86_64
  43.81.202002170853.0 0.1.40 4.3.3-x86_64

Fetching a source RPM for containers-common, e.g. from [15,16] shows
the source packages coming from skopeo.  Checking [17]:

  $ git --no-pager log --follow --oneline --stat=200 -M50% -- vendor/github.com/containers/storage/storage.conf
  afaa9e7f Bump github.com/containers/storage from 1.15.1 to 1.15.2
   vendor/github.com/containers/storage/storage.conf | 3 ---
   1 file changed, 3 deletions(-)
  39ff039b Image encryption/decryption support in skopeo
   vendor/github.com/containers/storage/storage.conf | 44 +++++++++++++++++++++++++-------------------
   1 file changed, 25 insertions(+), 19 deletions(-)
  05ae513b Bump github.com/containers/buildah from 1.8.4 to 1.11.4
   vendor/github.com/containers/storage/storage.conf | 7 -------
   1 file changed, 7 deletions(-)
  700b3102 update github.com/containers/{image,storage}
   vendor/github.com/containers/storage/storage.conf | 8 ++++++++
   1 file changed, 8 insertions(+)
  033b2902 migrate to go modules
   vendor/github.com/containers/storage/storage.conf | 130 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
   1 file changed, 130 insertions(+)
  $ git --no-pager log --follow --oneline --stat=200 -M50% 033b2902^ -- contrib/storage.conf
  fe259105 add storage.conf and manpage in contrib/
   contrib/storage.conf | 28 ++++++++++++++++++++++++++++
   1 file changed, 28 insertions(+)
  $ for HASH in fe259105 033b2902 700b3102 05ae513b 39ff039b afaa9e7f; do git describe --contains "${HASH}"; done
  v0.1.29~3^2
  v0.1.38~14^2~2
  v0.1.39~1
  v0.1.41~25^2
  v0.1.41~21^2
  v0.1.41~12^2

So changes may have been made in 0.1.29 (when the file landed for the
first time, likely from wherever we store post-Git patches), and were
likely made in 0.1.38, 0.1.39, and 0.1.41.

Comparing with our machine-os-content, that means vulnerable
transitions are:

* 4.1.* -> 4.1.34, since 4.1.31 -> 4.1.34 takes containers-common from
  0.1.37 to 0.1.40, picking up the v0.1.38~14^2~2 and v0.1.39~1 bumps.
  There may be no safe way to get to 4.1.34.

* 4.1.* -> 4.2...  FIXME

* 4.2.16 and earler -> 4.2.19, since 4.2.18 -> 4.2.19 takes
  containers-common from 0.1.37 to 0.1.40, picking up the
  v0.1.38~14^2~2 and v0.1.39~1 bumps.  4.2.16 and earlier -> 4.2.18 is
  fine, because there were no RPM-induced storage.conf bumps.  4.2.18
  -> 4.2.* is fine, because 4.2.18 has the patched machine-config
  source.

* 4.2.16 and earlier -> 4.3, since 4.2.18 -> 4.3 takes
  containers-common from 0.1.37 to 0.1.40, picking up the
  v0.1.38~14^2~2 and v0.1.39~1 bumps.  4.2.18 -> 4.3 is fine, because
  4.2.18 has the patched machine-config source.

* 4.3 -> 4.3 are fine, since they all have the patched machine-config
  source.

So ideally this pull would block edges from 4.2.16 and earlier into
4.3.  But because blocked-edges requires explicit to, I've just added
the 4.3.0 blocker (other 4.3.z releases either already blocked 4.2.*
or only give 4.2.18+ as update sources).  I've also dropped 4.2.16
from the *-4.3 channels with a comment about this bug.  There
shouldn't be much pushback on pulling the edge, because users can
still move from 4.2 to 4.3 via 4.2.19 -> 4.3.2.

Also simplify the wording on the GCP bug 1793635, which remains
unfixed.

[1]: openshift/machine-config-operator#1320 (comment)
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1782152#c5
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1781708#c0
[4]: https://github.com/openshift/machine-config-operator/pull/1320/files
[5]: openshift/machine-config-operator#1190
[6]: https://github.com/openshift/machine-config-operator/blob/13f0dda734262c3edbd23c007e42b7704125e88f/docs/MachineConfiguration.md
[7]: https://github.com/openshift/machine-config-operator/blob/13f0dda734262c3edbd23c007e42b7704125e88f/docs/ContainerRuntimeConfigDesign.md
[8]: openshift/machine-config-operator#330 (comment)
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1782153
[10]: openshift/machine-config-operator#1382 (comment)
[11]: openshift/machine-config-operator#1323 (comment)
[12]: https://bugzilla.redhat.com/show_bug.cgi?id=1782149
[13]: openshift/machine-config-operator#1322 (comment)
[14]: https://gitlab.cee.redhat.com/coretools/differ
      Internal link, sorry :/  But you can also browse the history at:
      https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.2&release=42.81.20200114.0 etc.
[15]: https://access.redhat.com/downloads/content/290/ver=4.2/rhel---8/4.2.0/x86_64/packages
[16]: https://access.redhat.com/downloads/content/rhel---8/x86_64/8841/containers-common/0.1.32-5.git1715c90.el8/x86_64/fd431d51/package
[17]: https://github.com/containers/skopeo/
wking added a commit to wking/cincinnati-graph-data that referenced this issue Feb 20, 2020
The machine-config operator had a bug where MachineConfig entries lead
the machine-config daemon (MCD) to lay down a storage.conf that
exactly matched the content installed by the containers-common RPM.
On update, the RHCOS machine pivots to a new OSTree image (defined in
the machine-os-content image referenced from the release image).
Seeing storage.conf content that matched the old OSTree image,
libostree replaced storage.conf with the version defined in the new
OSTree image [1].  Then, when the MCD comes back up post-pivot, it
sees the divergent storage.conf content and freaks out with logs like
[2]:

  E1210 16:15:51.105286   11181 daemon.go:1350] content mismatch for file /etc/containers/storage.conf:

and the machine-config operator goes Degraded=True with
RequiredPoolsFailed "nodes are reporting degraded status on sync" [3].

The narrow machine-config fix was to annotate storage.conf that it
writes, libostree doesn't touch the files on pivot [4].  This
addresses the storage.conf case, but leaves the MCD vulnerable to
other instances of "MCD writes exactly the OSTree contents to $FILE
and expects it to remain untouched during an OSTree pivot that bumps
the file".  I'm not aware of a generic fix at the moment, although [5]
might be related.  You can guard a cluster against the narrow bug by
setting a MachineConfig [6] or higher level object such as a
ContainerRuntimeConfig [7] that will cause the MCD to write a
storage.conf that diverges (even just by a comment or whitespace) from
the OSTree original.

Tracking the narrow fix through the various z streams:

The 4.1 machine-config bug was introduced in d2c44d7 [8], which landed
before 4.1.0-rc.0:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.0-rc.0 | grep machine-config
    machine-config-controller                     https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    machine-config-daemon                         https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    machine-config-server                         https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
  $ git --no-pager log --oneline --first-parent de9998eb37 | grep d2c44d7
  d2c44d7c Merge pull request openshift#330 from umohnani8/runtime

The 4.1 machine-config fix was [9], landed in 1301934 [10], which is
new in 4.1.34:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.34-x86_64 | grep machine-config
    machine-config-controller                     https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    machine-config-daemon                         https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    machine-config-server                         https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.31-x86_64 | grep machine-config
    machine-config-controller                     https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    machine-config-daemon                         https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    machine-config-server                         https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
  $ git --no-pager log --oneline --first-parent -2 f56d736e74a
  f56d736e (origin/release-4.1) Merge pull request openshift#1147 from openshift-cherrypick-robot/cherry-pick-1114-to-release-4.1
  1301934a Merge pull request openshift#1382 from vrutkovs/4.1-containers-conf-generated

The 4.2 machine-config fix was [2], landed in bd358bb [11], which is new
in 4.2.18:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.16-x86_64 | grep machine-config
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       31fed93186c9f84708f5cdfd0227ffe4f79b31cd
  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.18-x86_64 | grep machine-config
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       9366460085b2a24d825380759f554769ec5ab4f9
  $ git --no-pager log --oneline --first-parent -2 9366460085
  93664600 Merge pull request openshift#1362 from rphillips/fixes/1787581_4.2
  bd358bb7 Merge pull request openshift#1323 from openshift-cherrypick-robot/cherry-pick-1320-to-release-4.2

The 4.3 machine-config fix was [12], landed in 9fd53bd [13], which
landed early enough for 4.3.0-rc.0:

$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.0-rc.0-x86_64 | grep machine-config
  machine-config-operator                       https://github.com/openshift/machine-config-operator                       23a6e6fb37e73501bc3216183ef5e6ebb15efc7a
$ git --no-pager log --oneline --first-parent -8 23a6e6fb37
23a6e6fb Merge pull request openshift#1348 from openshift-cherrypick-robot/cherry-pick-1285-to-release-4.3
80c8aed7 Merge pull request openshift#1343 from retroflexer/cherry-pick-backup-restore-kube-static-resources
269990a3 Merge pull request openshift#1344 from openshift-cherrypick-robot/cherry-pick-1296-to-release-4.3
fd3ca395 Merge pull request openshift#1338 from runcom/fix-go-mod
ba304dbb Merge pull request openshift#1333 from openshift-cherrypick-robot/cherry-pick-1278-to-release-4.3
787f3fa9 Merge pull request openshift#1332 from runcom/reserved-cpus-4.3
2b85d6ba Merge pull request openshift#1329 from openshift-cherrypick-robot/cherry-pick-1314-to-release-4.3
9fd53bd5 Merge pull request openshift#1322 from openshift-cherrypick-robot/cherry-pick-1320-to-release-4.3

The 4.4 machine-config fix was [3] which has landed before any 4.4 RCs
have been cut.  Even in 4.4, the generated note was the first content
touch to this template:

  $ git --no-pager log --oneline --follow origin/release-4.4 -- templates/common/_base/files/container-storage.yaml
  46c4e27a (origin/pr/1320) templates/container-storage: Add a "this is generated" note
  47a6321c templates: Move container-storage.yaml into common/
  74ae3b31 (origin/pr/330) Add ContainerRuntime CRD and Controller

(47a6321c was a pure rename).

So the MCD has been annotating storage.conf since 4.1.34, 4.2.18, and
all 4.3 and later releases.  When has the RPM-installed storage.conf
changed?  Figuring this part out is a bit awkward, because we need to
drill down machine-os-content -> RHCOS -> RPM -> file.  For example,
from 4.2.16 -> 4.2.18 [14]:

  $ oc image info --output json $(oc adm release info --image-for=machine-os-content quay.io/openshift-release-dev/ocp-release:4.2.16-x86_64) | jq -r .config.config.Labels.version
  42.81.20200114.0
  $ oc image info --output json $(oc adm release info --image-for=machine-os-content quay.io/openshift-release-dev/ocp-release:4.2.18-x86_64) | jq -r .config.config.Labels.version
  42.81.20200203.1
  $ ./differ.py --first-endpoint art --first-version 42.81.20200114.0 --second-endpoint art --second-version 42.81.20200203.1 | jq -r '.diff | keys | sort[]'
  cri-o
  ignition
  libarchive
  machine-config-daemon
  openshift-clients
  openshift-hyperkube
  sqlite-libs

storage.conf is managed by the containers-common RPM, so no change
from 4.2.16 to 4.2.18, and that update will safely pull in the fixed
MCD without a surprising pivot change.  Here are our changes to the
RPM across the various z streams:

  $ for OCP in 4.1.1 4.1.23 4.1.24 4.1.31-x86_64 4.1.34-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.1/${RHCOS}/commitmeta.json" | jq -r '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .[2]')"; echo "${RHCOS} ${COMMON} ${OCP}"; done
  410.8.20190606.0 0.1.32 4.1.1
  410.8.20191030.0 0.1.32 4.1.23
  410.81.20191112.2 0.1.37 4.1.24
  410.81.20200114.0 0.1.37 4.1.31-x86_64
  410.81.20200204.1 0.1.40 4.1.34-x86_64
  $ for OCP in 4.2.0-rc.0 4.2.2 4.2.4 4.2.16-x86_64 4.2.18-x86_64 4.2.19-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.2/${RHCOS}/commitmeta.json" | jq -r '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .[2]')"; echo "${RHCOS} ${COMMON} ${OCP}"; done
  42.80.20190930.1 0.1.32 4.2.0-rc.0
  42.80.20191022.0 0.1.32 4.2.2
  42.81.20191107.0 0.1.37 4.2.4
  42.81.20200114.0 0.1.37 4.2.16-x86_64
  42.81.20200203.1 0.1.37 4.2.18-x86_64
  42.81.20200210.0 0.1.40 4.2.19-x86_64
  $ for OCP in 4.3.0-rc.0-x86_64 4.3.3-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/${RHCOS}/x86_64/commitmeta.json" | jq -r '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .[2]')"; echo "${RHCOS} ${COMMON} ${OCP}"; done
  43.81.202001072253.0 0.1.40 4.3.0-rc.0-x86_64
  43.81.202002170853.0 0.1.40 4.3.3-x86_64

Fetching a source RPM for containers-common, e.g. from [15,16] shows
the source packages coming from skopeo.  Checking [17]:

  $ git --no-pager log --follow --oneline --stat=200 -M50% -- vendor/github.com/containers/storage/storage.conf
  afaa9e7f Bump github.com/containers/storage from 1.15.1 to 1.15.2
   vendor/github.com/containers/storage/storage.conf | 3 ---
   1 file changed, 3 deletions(-)
  39ff039b Image encryption/decryption support in skopeo
   vendor/github.com/containers/storage/storage.conf | 44 +++++++++++++++++++++++++-------------------
   1 file changed, 25 insertions(+), 19 deletions(-)
  05ae513b Bump github.com/containers/buildah from 1.8.4 to 1.11.4
   vendor/github.com/containers/storage/storage.conf | 7 -------
   1 file changed, 7 deletions(-)
  700b3102 update github.com/containers/{image,storage}
   vendor/github.com/containers/storage/storage.conf | 8 ++++++++
   1 file changed, 8 insertions(+)
  033b2902 migrate to go modules
   vendor/github.com/containers/storage/storage.conf | 130 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
   1 file changed, 130 insertions(+)
  $ git --no-pager log --follow --oneline --stat=200 -M50% 033b2902^ -- contrib/storage.conf
  fe259105 add storage.conf and manpage in contrib/
   contrib/storage.conf | 28 ++++++++++++++++++++++++++++
   1 file changed, 28 insertions(+)
  $ for HASH in fe259105 033b2902 700b3102 05ae513b 39ff039b afaa9e7f; do git describe --contains "${HASH}"; done
  v0.1.29~3^2
  v0.1.38~14^2~2
  v0.1.39~1
  v0.1.41~25^2
  v0.1.41~21^2
  v0.1.41~12^2

So changes may have been made in 0.1.29 (when the file landed for the
first time, likely from wherever we store post-Git patches), and were
likely made in 0.1.38, 0.1.39, and 0.1.41.  However, the skopeo and
derivative containers-common RPMs may have had patched versions of the
file tracked in dist-git [18].  Comparing the dist-git 4.1 tip with
the machine-config template:

  $ git -C containers/skopeo remote -v | grep 'dist-git.*fetch'
  dist-git git://pkgs.devel.redhat.com/rpms/skopeo.git (fetch)
  $ git --no-pager -C containers/skopeo log --date=short --format='%ad %h %s' -2 dist-git/rhaos-4.1-rhel-8 -- storage.conf
  2018-07-18 3757b210 add statx to seccomp.json to containers-config add seccomp.json to containers-config
  2017-11-08 284f9024 Force storage.conf to default to overlay
  $ git --no-pager -C containers/skopeo grep '^Version:' 3757b210
  3757b210:skopeo.spec:Version: 0.1.31
  $ diff -U3 <(git -C containers/skopeo cat-file -p 3757b210:storage.conf) <(sed 's/^    //' openshift/machine-config-operator/templates/common/_base/files/container-storage.yaml)--- /dev/fd/63 2020-02-20 01:13:48.073704685 -0800
  +++ /dev/fd/62	 2020-02-20 01:13:48.073704685 -0800
  @@ -1,3 +1,10 @@
  +filesystem: "root"
  +mode: 0644
  +path: "/etc/containers/storage.conf"
  +contents:
  +  inline: |
  +# This file is generated by the Machine Config Operator's containerruntimeconfig controller.
  +#
   # storage.conf is the configuration file for all tools
   # that share the containers/storage libraries
   # See man 5 containers-storage.conf for more information

So the machine-config master (5ed0aee72c) only differs from the old
0.1.31 RPM storage.conf by the "file is generated" marker.

There does not seem to be any 4.2-specific content.  Presumably
they're using the same rhaos-4.1-rhel-8 RPMs.  4.3 has some changes:

  $ git --no-pager log --date=short --format='%ad %h %s' -2 --stat=80 dist-git/rhaos-4.3-rhel-8 -- storage.conf
  2019-12-09 4a131916 skopeo-0.1.40-2.el8

   storage.conf | 39 +++++++++++++++++++++++++++++----------
   1 file changed, 29 insertions(+), 10 deletions(-)
  2019-10-08 13a4ce10 skopeo-1:0.1.40-0.1.gitf72e39f

   storage.conf | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
   1 file changed, 114 insertions(+)

So it looks like we can ignore the dev skopeo repository, focus on the
dist-git skopeo repository, and say that before 0.1.40-2.el8 we had a
version of storage.conf in the RPMs that matched the unpatched
machine-config templates, and with 0.1.40-2.el8 and later the RPMs had
different content.  Sanity checking via [19,20]:

  $ diff -U3 <(rpm2cpio containers-common-0.1.32-5.git1715c90.el8.x86_64.rpm | cpio -i --to-stdout ./etc/containers/storage.conf 2>/dev/null) <(sed 's/^    //' templates/common/_base/files/container-storage.yaml)
  --- /dev/fd/63				2020-02-20 01:36:23.031918968 -0800
  +++ /dev/fd/62				2020-02-20 01:36:23.031918968 -0800
  @@ -1,3 +1,10 @@
  +filesystem: "root"
  +mode: 0644
  +path: "/etc/containers/storage.conf"
  +contents:
  +  inline: |
  +# This file is generated by the Machine Config Operator's containerruntimeconfig controller.
  +#
   # storage.conf is the configuration file for all tools
   # that share the containers/storage libraries
   # See man 5 containers-storage.conf for more information

but I'm not clear on why the product pages are claiming
containers-common-0.1.32 for 4.1.34 [19,20].
FIXME

Comparing with our machine-os-content, that means vulnerable
transitions are:

* 4.1.* -> 4.1.34, since 4.1.31 -> 4.1.34 takes containers-common from
  0.1.37 to 0.1.40, picking up the v0.1.38~14^2~2 and v0.1.39~1 bumps.
  There may be no safe way to get to 4.1.34.

* 4.1.* -> 4.2...  FIXME

* 4.2.16 and earler -> 4.2.19, since 4.2.18 -> 4.2.19 takes
  containers-common from 0.1.37 to 0.1.40, picking up the
  v0.1.38~14^2~2 and v0.1.39~1 bumps.  4.2.16 and earlier -> 4.2.18 is
  fine, because there were no RPM-induced storage.conf bumps.  4.2.18
  -> 4.2.* is fine, because 4.2.18 has the patched machine-config
  source.

* 4.2.16 and earlier -> 4.3, since 4.2.18 -> 4.3 takes
  containers-common from 0.1.37 to 0.1.40, picking up the
  v0.1.38~14^2~2 and v0.1.39~1 bumps.  4.2.18 -> 4.3 is fine, because
  4.2.18 has the patched machine-config source.

* 4.3 -> 4.3 are fine, since they all have the patched machine-config
  source.

So ideally this pull would block edges from 4.2.16 and earlier into
4.3.  But because blocked-edges requires explicit to, I've just added
the 4.3.0 blocker (other 4.3.z releases either already blocked 4.2.*
or only give 4.2.18+ as update sources).  I've also dropped 4.2.16
from the *-4.3 channels with a comment about this bug.  There
shouldn't be much pushback on pulling the edge, because users can
still move from 4.2 to 4.3 via 4.2.19 -> 4.3.2.

Also simplify the wording on the GCP bug 1793635, which remains
unfixed.

[1]: openshift/machine-config-operator#1320 (comment)
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1782152#c5
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1781708#c0
[4]: https://github.com/openshift/machine-config-operator/pull/1320/files
[5]: openshift/machine-config-operator#1190
[6]: https://github.com/openshift/machine-config-operator/blob/13f0dda734262c3edbd23c007e42b7704125e88f/docs/MachineConfiguration.md
[7]: https://github.com/openshift/machine-config-operator/blob/13f0dda734262c3edbd23c007e42b7704125e88f/docs/ContainerRuntimeConfigDesign.md
[8]: openshift/machine-config-operator#330 (comment)
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1782153
[10]: openshift/machine-config-operator#1382 (comment)
[11]: openshift/machine-config-operator#1323 (comment)
[12]: https://bugzilla.redhat.com/show_bug.cgi?id=1782149
[13]: openshift/machine-config-operator#1322 (comment)
[14]: https://gitlab.cee.redhat.com/coretools/differ
      Internal link, sorry :/  But you can also browse the history at:
      https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.2&release=42.81.20200114.0 etc.
[15]: https://access.redhat.com/downloads/content/290/ver=4.2/rhel---8/4.2.0/x86_64/packages
[16]: https://access.redhat.com/downloads/content/rhel---8/x86_64/8841/containers-common/0.1.32-5.git1715c90.el8/x86_64/fd431d51/package
[17]: https://github.com/containers/skopeo/
[18]: http://pkgs.devel.redhat.com/cgit/rpms/skopeo/
[19]: https://access.redhat.com/downloads/content/290/ver=4.1/rhel---8/4.1.34/x86_64/packages
[20]: https://access.redhat.com/downloads/content/rhel---8/x86_64/8384/containers-common/0.1.32-5.git1715c90.el8/x86_64/fd431d51/package
wking added a commit to wking/cincinnati-graph-data that referenced this issue Feb 20, 2020
The machine-config operator had a bug where MachineConfig entries lead
the machine-config daemon (MCD) to lay down a storage.conf that
exactly matched the content installed by the containers-common RPM.
On update, the RHCOS machine pivots to a new OSTree image (defined in
the machine-os-content image referenced from the release image).
Seeing storage.conf content that matched the old OSTree image,
libostree replaced storage.conf with the version defined in the new
OSTree image [1].  Then, when the MCD comes back up post-pivot, it
sees the divergent storage.conf content and freaks out with logs like
[2]:

  E1210 16:15:51.105286   11181 daemon.go:1350] content mismatch for file /etc/containers/storage.conf:

and the machine-config operator goes Degraded=True with
RequiredPoolsFailed "nodes are reporting degraded status on sync" [3].

The narrow machine-config fix was to annotate storage.conf that it
writes, libostree doesn't touch the files on pivot [4].  This
addresses the storage.conf case, but leaves the MCD vulnerable to
other instances of "MCD writes exactly the OSTree contents to $FILE
and expects it to remain untouched during an OSTree pivot that bumps
the file".  I'm not aware of a generic fix at the moment, although [5]
might be related.  You can guard a cluster against the narrow bug by
setting a MachineConfig [6] or higher level object such as a
ContainerRuntimeConfig [7] that will cause the MCD to write a
storage.conf that diverges (even just by a comment or whitespace) from
the OSTree original.

Tracking the narrow fix through the various z streams:

The 4.1 machine-config bug was introduced in d2c44d7 [8], which landed
before 4.1.0-rc.0:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.0-rc.0 | grep machine-config
    machine-config-controller                     https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    machine-config-daemon                         https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    machine-config-server                         https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
  $ git --no-pager log --oneline --first-parent de9998eb37 | grep d2c44d7
  d2c44d7c Merge pull request openshift#330 from umohnani8/runtime

The 4.1 machine-config fix was [9], landed in 1301934 [10], which is
new in 4.1.34:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.34-x86_64 | grep machine-config
    machine-config-controller                     https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    machine-config-daemon                         https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    machine-config-server                         https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.31-x86_64 | grep machine-config
    machine-config-controller                     https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    machine-config-daemon                         https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    machine-config-server                         https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
  $ git --no-pager log --oneline --first-parent -2 f56d736e74a
  f56d736e (origin/release-4.1) Merge pull request openshift#1147 from openshift-cherrypick-robot/cherry-pick-1114-to-release-4.1
  1301934a Merge pull request openshift#1382 from vrutkovs/4.1-containers-conf-generated

The 4.2 machine-config fix was [2], landed in bd358bb [11], which is new
in 4.2.18:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.16-x86_64 | grep machine-config
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       31fed93186c9f84708f5cdfd0227ffe4f79b31cd
  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.18-x86_64 | grep machine-config
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       9366460085b2a24d825380759f554769ec5ab4f9
  $ git --no-pager log --oneline --first-parent -2 9366460085
  93664600 Merge pull request openshift#1362 from rphillips/fixes/1787581_4.2
  bd358bb7 Merge pull request openshift#1323 from openshift-cherrypick-robot/cherry-pick-1320-to-release-4.2

The 4.3 machine-config fix was [12], landed in 9fd53bd [13], which
landed early enough for 4.3.0-rc.0:

$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.0-rc.0-x86_64 | grep machine-config
  machine-config-operator                       https://github.com/openshift/machine-config-operator                       23a6e6fb37e73501bc3216183ef5e6ebb15efc7a
$ git --no-pager log --oneline --first-parent -8 23a6e6fb37
23a6e6fb Merge pull request openshift#1348 from openshift-cherrypick-robot/cherry-pick-1285-to-release-4.3
80c8aed7 Merge pull request openshift#1343 from retroflexer/cherry-pick-backup-restore-kube-static-resources
269990a3 Merge pull request openshift#1344 from openshift-cherrypick-robot/cherry-pick-1296-to-release-4.3
fd3ca395 Merge pull request openshift#1338 from runcom/fix-go-mod
ba304dbb Merge pull request openshift#1333 from openshift-cherrypick-robot/cherry-pick-1278-to-release-4.3
787f3fa9 Merge pull request openshift#1332 from runcom/reserved-cpus-4.3
2b85d6ba Merge pull request openshift#1329 from openshift-cherrypick-robot/cherry-pick-1314-to-release-4.3
9fd53bd5 Merge pull request openshift#1322 from openshift-cherrypick-robot/cherry-pick-1320-to-release-4.3

The 4.4 machine-config fix was [3] which has landed before any 4.4 RCs
have been cut.  Even in 4.4, the generated note was the first content
touch to this template:

  $ git --no-pager log --oneline --follow origin/release-4.4 -- templates/common/_base/files/container-storage.yaml
  46c4e27a (origin/pr/1320) templates/container-storage: Add a "this is generated" note
  47a6321c templates: Move container-storage.yaml into common/
  74ae3b31 (origin/pr/330) Add ContainerRuntime CRD and Controller

(47a6321c was a pure rename).

So the MCD has been annotating storage.conf since 4.1.34, 4.2.18, and
all 4.3 and later releases.  When has the RPM-installed storage.conf
changed?  Figuring this part out is a bit awkward, because we need to
drill down machine-os-content -> RHCOS -> RPM -> file.  For example,
from 4.2.16 -> 4.2.18 [14]:

  $ oc image info --output json $(oc adm release info --image-for=machine-os-content quay.io/openshift-release-dev/ocp-release:4.2.16-x86_64) | jq -r .config.config.Labels.version
  42.81.20200114.0
  $ oc image info --output json $(oc adm release info --image-for=machine-os-content quay.io/openshift-release-dev/ocp-release:4.2.18-x86_64) | jq -r .config.config.Labels.version
  42.81.20200203.1
  $ ./differ.py --first-endpoint art --first-version 42.81.20200114.0 --second-endpoint art --second-version 42.81.20200203.1 | jq -r '.diff | keys | sort[]'
  cri-o
  ignition
  libarchive
  machine-config-daemon
  openshift-clients
  openshift-hyperkube
  sqlite-libs

storage.conf is managed by the containers-common RPM, so no change
from 4.2.16 to 4.2.18, and that update will safely pull in the fixed
MCD without a surprising pivot change.  Here are our changes to the
RPM across the various z streams:

  $ for OCP in 4.1.1 4.1.23 4.1.24 4.1.31-x86_64 4.1.34-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.1/${RHCOS}/commitmeta.json" | jq -r '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .[2]')"; echo "${RHCOS} ${COMMON} ${OCP}"; done
  410.8.20190606.0 0.1.32 4.1.1
  410.8.20191030.0 0.1.32 4.1.23
  410.81.20191112.2 0.1.37 4.1.24
  410.81.20200114.0 0.1.37 4.1.31-x86_64
  410.81.20200204.1 0.1.40 4.1.34-x86_64
  $ for OCP in 4.2.0-rc.0 4.2.2 4.2.4 4.2.16-x86_64 4.2.18-x86_64 4.2.19-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.2/${RHCOS}/commitmeta.json" | jq -r '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .[2]')"; echo "${RHCOS} ${COMMON} ${OCP}"; done
  42.80.20190930.1 0.1.32 4.2.0-rc.0
  42.80.20191022.0 0.1.32 4.2.2
  42.81.20191107.0 0.1.37 4.2.4
  42.81.20200114.0 0.1.37 4.2.16-x86_64
  42.81.20200203.1 0.1.37 4.2.18-x86_64
  42.81.20200210.0 0.1.40 4.2.19-x86_64
  $ for OCP in 4.3.0-rc.0-x86_64 4.3.3-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/${RHCOS}/x86_64/commitmeta.json" | jq -r '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .[2]')"; echo "${RHCOS} ${COMMON} ${OCP}"; done
  43.81.202001072253.0 0.1.40 4.3.0-rc.0-x86_64
  43.81.202002170853.0 0.1.40 4.3.3-x86_64

Fetching a source RPM for containers-common, e.g. from [15,16] shows
the source packages coming from skopeo.  Checking [17]:

  $ git --no-pager log --follow --oneline --stat=200 -M50% -- vendor/github.com/containers/storage/storage.conf
  afaa9e7f Bump github.com/containers/storage from 1.15.1 to 1.15.2
   vendor/github.com/containers/storage/storage.conf | 3 ---
   1 file changed, 3 deletions(-)
  39ff039b Image encryption/decryption support in skopeo
   vendor/github.com/containers/storage/storage.conf | 44 +++++++++++++++++++++++++-------------------
   1 file changed, 25 insertions(+), 19 deletions(-)
  05ae513b Bump github.com/containers/buildah from 1.8.4 to 1.11.4
   vendor/github.com/containers/storage/storage.conf | 7 -------
   1 file changed, 7 deletions(-)
  700b3102 update github.com/containers/{image,storage}
   vendor/github.com/containers/storage/storage.conf | 8 ++++++++
   1 file changed, 8 insertions(+)
  033b2902 migrate to go modules
   vendor/github.com/containers/storage/storage.conf | 130 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
   1 file changed, 130 insertions(+)
  $ git --no-pager log --follow --oneline --stat=200 -M50% 033b2902^ -- contrib/storage.conf
  fe259105 add storage.conf and manpage in contrib/
   contrib/storage.conf | 28 ++++++++++++++++++++++++++++
   1 file changed, 28 insertions(+)
  $ for HASH in fe259105 033b2902 700b3102 05ae513b 39ff039b afaa9e7f; do git describe --contains "${HASH}"; done
  v0.1.29~3^2
  v0.1.38~14^2~2
  v0.1.39~1
  v0.1.41~25^2
  v0.1.41~21^2
  v0.1.41~12^2

So changes may have been made in 0.1.29 (when the file landed for the
first time, likely from wherever we store post-Git patches), and were
likely made in 0.1.38, 0.1.39, and 0.1.41.  However, the skopeo and
derivative containers-common RPMs may have had patched versions of the
file tracked in dist-git [18].  Comparing the dist-git 4.1 tip with
the machine-config template:

  $ git -C containers/skopeo remote -v | grep 'dist-git.*fetch'
  dist-git git://pkgs.devel.redhat.com/rpms/skopeo.git (fetch)
  $ git --no-pager -C containers/skopeo log --date=short --format='%ad %h %s' -2 dist-git/rhaos-4.1-rhel-8 -- storage.conf
  2018-07-18 3757b210 add statx to seccomp.json to containers-config add seccomp.json to containers-config
  2017-11-08 284f9024 Force storage.conf to default to overlay
  $ git --no-pager -C containers/skopeo grep '^Version:' 3757b210
  3757b210:skopeo.spec:Version: 0.1.31
  $ diff -U3 <(git -C containers/skopeo cat-file -p 3757b210:storage.conf) <(sed 's/^    //' openshift/machine-config-operator/templates/common/_base/files/container-storage.yaml)--- /dev/fd/63 2020-02-20 01:13:48.073704685 -0800
  +++ /dev/fd/62	 2020-02-20 01:13:48.073704685 -0800
  @@ -1,3 +1,10 @@
  +filesystem: "root"
  +mode: 0644
  +path: "/etc/containers/storage.conf"
  +contents:
  +  inline: |
  +# This file is generated by the Machine Config Operator's containerruntimeconfig controller.
  +#
   # storage.conf is the configuration file for all tools
   # that share the containers/storage libraries
   # See man 5 containers-storage.conf for more information

So the machine-config master (5ed0aee72c) only differs from the old
0.1.31 RPM storage.conf by the "file is generated" marker.

There does not seem to be any 4.2-specific content.  Presumably
they're using the same rhaos-4.1-rhel-8 RPMs.  4.3 has some changes:

  $ git --no-pager log --date=short --format='%ad %h %s' -2 --stat=80 dist-git/rhaos-4.3-rhel-8 -- storage.conf
  2019-12-09 4a131916 skopeo-0.1.40-2.el8

   storage.conf | 39 +++++++++++++++++++++++++++++----------
   1 file changed, 29 insertions(+), 10 deletions(-)
  2019-10-08 13a4ce10 skopeo-1:0.1.40-0.1.gitf72e39f

   storage.conf | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
   1 file changed, 114 insertions(+)

So it looks like we can ignore the dev skopeo repository, focus on the
dist-git skopeo repository, and say that before 0.1.40-2.el8 we had a
version of storage.conf in the RPMs that matched the unpatched
machine-config templates, and with 0.1.40-2.el8 and later the RPMs had
different content.  Sanity checking via [19,20]:

  $ diff -U3 <(rpm2cpio containers-common-0.1.32-5.git1715c90.el8.x86_64.rpm | cpio -i --to-stdout ./etc/containers/storage.conf 2>/dev/null) <(sed 's/^    //' templates/common/_base/files/container-storage.yaml)
  --- /dev/fd/63				2020-02-20 01:36:23.031918968 -0800
  +++ /dev/fd/62				2020-02-20 01:36:23.031918968 -0800
  @@ -1,3 +1,10 @@
  +filesystem: "root"
  +mode: 0644
  +path: "/etc/containers/storage.conf"
  +contents:
  +  inline: |
  +# This file is generated by the Machine Config Operator's containerruntimeconfig controller.
  +#
   # storage.conf is the configuration file for all tools
   # that share the containers/storage libraries
   # See man 5 containers-storage.conf for more information

but I'm not clear on why the product pages are claiming
containers-common-0.1.32 for 4.1.34 [19,20].
FIXME

Comparing with our machine-os-content, that means vulnerable
transitions are:

* 4.1.* -> 4.1.34, since 4.1.31 -> 4.1.34 takes containers-common from
  0.1.37 to 0.1.40, picking up the v0.1.38~14^2~2 and v0.1.39~1 bumps.
  There may be no safe way to get to 4.1.34.

* 4.1.* -> 4.2...  FIXME

* 4.2.16 and earler -> 4.2.19, since 4.2.18 -> 4.2.19 takes
  containers-common from 0.1.37 to 0.1.40, picking up the
  v0.1.38~14^2~2 and v0.1.39~1 bumps.  4.2.16 and earlier -> 4.2.18 is
  fine, because there were no RPM-induced storage.conf bumps.  4.2.18
  -> 4.2.* is fine, because 4.2.18 has the patched machine-config
  source.

* 4.2.16 and earlier -> 4.3, since 4.2.18 -> 4.3 takes
  containers-common from 0.1.37 to 0.1.40, picking up the
  v0.1.38~14^2~2 and v0.1.39~1 bumps.  4.2.18 -> 4.3 is fine, because
  4.2.18 has the patched machine-config source.

* 4.3 -> 4.3 are fine, since they all have the patched machine-config
  source.

So ideally this pull would block edges from 4.2.16 and earlier into
4.3.  But because blocked-edges requires explicit to, I've just added
the 4.3.0 blocker (other 4.3.z releases either already blocked 4.2.*
or only give 4.2.18+ as update sources).  I've also dropped 4.2.16
from the *-4.3 channels with a comment about this bug.  There
shouldn't be much pushback on pulling the edge, because users can
still move from 4.2 to 4.3 via 4.2.19 -> 4.3.2.

Also simplify the wording on the GCP bug 1793635, which remains
unfixed.

[1]: openshift/machine-config-operator#1320 (comment)
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1782152#c5
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1781708#c0
[4]: https://github.com/openshift/machine-config-operator/pull/1320/files
[5]: openshift/machine-config-operator#1190
[6]: https://github.com/openshift/machine-config-operator/blob/13f0dda734262c3edbd23c007e42b7704125e88f/docs/MachineConfiguration.md
[7]: https://github.com/openshift/machine-config-operator/blob/13f0dda734262c3edbd23c007e42b7704125e88f/docs/ContainerRuntimeConfigDesign.md
[8]: openshift/machine-config-operator#330 (comment)
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1782153
[10]: openshift/machine-config-operator#1382 (comment)
[11]: openshift/machine-config-operator#1323 (comment)
[12]: https://bugzilla.redhat.com/show_bug.cgi?id=1782149
[13]: openshift/machine-config-operator#1322 (comment)
[14]: https://gitlab.cee.redhat.com/coretools/differ
      Internal link, sorry :/  But you can also browse the history at:
      https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.2&release=42.81.20200114.0 etc.
[15]: https://access.redhat.com/downloads/content/290/ver=4.2/rhel---8/4.2.0/x86_64/packages
[16]: https://access.redhat.com/downloads/content/rhel---8/x86_64/8841/containers-common/0.1.32-5.git1715c90.el8/x86_64/fd431d51/package
[17]: https://github.com/containers/skopeo/
[18]: http://pkgs.devel.redhat.com/cgit/rpms/skopeo/
[19]: https://access.redhat.com/downloads/content/290/ver=4.1/rhel---8/4.1.34/x86_64/packages
[20]: https://access.redhat.com/downloads/content/rhel---8/x86_64/8384/containers-common/0.1.32-5.git1715c90.el8/x86_64/fd431d51/package
wking added a commit to wking/cincinnati-graph-data that referenced this issue Feb 21, 2020
The machine-config operator had a bug where MachineConfig entries lead
the machine-config daemon (MCD) to lay down a storage.conf that
exactly matched the content installed by the containers-common RPM.
On update, the RHCOS machine pivots to a new OSTree image (defined in
the machine-os-content image referenced from the release image).
Seeing storage.conf content that matched the old OSTree image,
libostree replaced storage.conf with the version defined in the new
OSTree image [1].  Then, when the MCD comes back up post-pivot, it
sees the divergent storage.conf content and freaks out with logs like
[2]:

  E1210 16:15:51.105286   11181 daemon.go:1350] content mismatch for file /etc/containers/storage.conf:

and the machine-config operator goes Degraded=True with
RequiredPoolsFailed "nodes are reporting degraded status on sync" [3].

The narrow machine-config fix was to annotate storage.conf that it
writes, libostree doesn't touch the files on pivot [4].  This
addresses the storage.conf case, but leaves the MCD vulnerable to
other instances of "MCD writes exactly the OSTree contents to $FILE
and expects it to remain untouched during an OSTree pivot that bumps
the file".  I'm not aware of a generic fix at the moment, although [5]
might be related.  You can guard a cluster against the narrow bug by
setting a MachineConfig [6] or higher level object such as a
ContainerRuntimeConfig [7] that will cause the MCD to write a
storage.conf that diverges (even just by a comment or whitespace) from
the OSTree original.

Tracking the narrow fix through the various z streams:

The 4.1 machine-config bug was introduced in d2c44d7 [8], which landed
before 4.1.0-rc.0:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.0-rc.0 | grep machine-config
    machine-config-controller                     https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    machine-config-daemon                         https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    machine-config-server                         https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
    setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       de9998eb37e90b3ee2fcfdbb3eda7ba26870ab6e
  $ git --no-pager log --oneline --first-parent de9998eb37 | grep d2c44d7
  d2c44d7c Merge pull request openshift#330 from umohnani8/runtime

The 4.1 machine-config fix was [9], landed in 1301934 [10], which is
new in 4.1.34:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.34-x86_64 | grep machine-config
    machine-config-controller                     https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    machine-config-daemon                         https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    machine-config-server                         https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
    setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       f56d736e74af8fb0dc85c4b1ee3cc8d1d1f6600b
  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.1.31-x86_64 | grep machine-config
    machine-config-controller                     https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    machine-config-daemon                         https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    machine-config-server                         https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
    setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       b38afe6e5b79a3e11e881429dc4c7c70e8784e84
  $ git --no-pager log --oneline --first-parent -2 f56d736e74a
  f56d736e (origin/release-4.1) Merge pull request openshift#1147 from openshift-cherrypick-robot/cherry-pick-1114-to-release-4.1
  1301934a Merge pull request openshift#1382 from vrutkovs/4.1-containers-conf-generated

The 4.2 machine-config fix was [2], landed in bd358bb [11], which is new
in 4.2.18:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.16-x86_64 | grep machine-config
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       31fed93186c9f84708f5cdfd0227ffe4f79b31cd
  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.18-x86_64 | grep machine-config
    machine-config-operator                       https://github.com/openshift/machine-config-operator                       9366460085b2a24d825380759f554769ec5ab4f9
  $ git --no-pager log --oneline --first-parent -2 9366460085
  93664600 Merge pull request openshift#1362 from rphillips/fixes/1787581_4.2
  bd358bb7 Merge pull request openshift#1323 from openshift-cherrypick-robot/cherry-pick-1320-to-release-4.2

The 4.3 machine-config fix was [12], landed in 9fd53bd [13], which
landed early enough for 4.3.0-rc.0:

$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.0-rc.0-x86_64 | grep machine-config
  machine-config-operator                       https://github.com/openshift/machine-config-operator                       23a6e6fb37e73501bc3216183ef5e6ebb15efc7a
$ git --no-pager log --oneline --first-parent -8 23a6e6fb37
23a6e6fb Merge pull request openshift#1348 from openshift-cherrypick-robot/cherry-pick-1285-to-release-4.3
80c8aed7 Merge pull request openshift#1343 from retroflexer/cherry-pick-backup-restore-kube-static-resources
269990a3 Merge pull request openshift#1344 from openshift-cherrypick-robot/cherry-pick-1296-to-release-4.3
fd3ca395 Merge pull request openshift#1338 from runcom/fix-go-mod
ba304dbb Merge pull request openshift#1333 from openshift-cherrypick-robot/cherry-pick-1278-to-release-4.3
787f3fa9 Merge pull request openshift#1332 from runcom/reserved-cpus-4.3
2b85d6ba Merge pull request openshift#1329 from openshift-cherrypick-robot/cherry-pick-1314-to-release-4.3
9fd53bd5 Merge pull request openshift#1322 from openshift-cherrypick-robot/cherry-pick-1320-to-release-4.3

The 4.4 machine-config fix was [3] which has landed before any 4.4 RCs
have been cut.  Even in 4.4, the generated note was the first content
touch to this template:

  $ git --no-pager log --oneline --follow origin/release-4.4 -- templates/common/_base/files/container-storage.yaml
  46c4e27a (origin/pr/1320) templates/container-storage: Add a "this is generated" note
  47a6321c templates: Move container-storage.yaml into common/
  74ae3b31 (origin/pr/330) Add ContainerRuntime CRD and Controller

(47a6321c was a pure rename).

So the MCD has been annotating storage.conf since 4.1.34, 4.2.18, and
all 4.3 and later releases.  When has the RPM-installed storage.conf
changed?  Figuring this part out is a bit awkward, because we need to
drill down machine-os-content -> RHCOS -> RPM -> file.  For example,
from 4.2.16 -> 4.2.18 [14]:

  $ oc image info --output json $(oc adm release info --image-for=machine-os-content quay.io/openshift-release-dev/ocp-release:4.2.16-x86_64) | jq -r .config.config.Labels.version
  42.81.20200114.0
  $ oc image info --output json $(oc adm release info --image-for=machine-os-content quay.io/openshift-release-dev/ocp-release:4.2.18-x86_64) | jq -r .config.config.Labels.version
  42.81.20200203.1
  $ ./differ.py --first-endpoint art --first-version 42.81.20200114.0 --second-endpoint art --second-version 42.81.20200203.1 | jq -r '.diff | keys | sort[]'
  cri-o
  ignition
  libarchive
  machine-config-daemon
  openshift-clients
  openshift-hyperkube
  sqlite-libs

storage.conf is managed by the containers-common RPM, so no change
from 4.2.16 to 4.2.18, and that update will safely pull in the fixed
MCD without a surprising pivot change.  Here are our changes to the
RPM across the various z streams:

  $ for OCP in 4.1.1 4.1.16 4.1.17 4.1.23 4.1.24 4.1.28 4.1.29 4.1.31-x86_64 4.1.34-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.1/${RHCOS}/commitmeta.json" | jq -c '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .')"; echo "${COMMON} ${RHCOS} ${OCP}"; done
  ["containers-common","1","0.1.32","4.git1715c90.el8","x86_64"] 410.8.20190606.0 4.1.1
  ["containers-common","1","0.1.32","4.git1715c90.el8","x86_64"] 410.8.20190910.1 4.1.16
  ["containers-common","1","0.1.32","5.git1715c90.el8","x86_64"] 410.8.20190918.0 4.1.17
  ["containers-common","1","0.1.32","5.git1715c90.el8","x86_64"] 410.8.20191030.0 4.1.23
  ["containers-common","1","0.1.37","5.module+el8.1.0+4240+893c1ab8","x86_64"] 410.81.20191112.2 4.1.24
  ["containers-common","1","0.1.37","5.module+el8.1.0+4240+893c1ab8","x86_64"] 410.81.20191210.0 4.1.28
  ["containers-common","1","0.1.37","6.module+el8.1.0+4876+e678a192","x86_64"] 410.81.20191223.0 4.1.29
  ["containers-common","1","0.1.37","6.module+el8.1.0+4876+e678a192","x86_64"] 410.81.20200114.0 4.1.31-x86_64
  ["containers-common","1","0.1.40","8.module+el8.1.1+5351+506397b0","x86_64"] 410.81.20200204.1 4.1.34-x86_64
  $ for OCP in 4.2.0-rc.0 4.2.2 4.2.4 4.2.12 4.2.13 4.2.18-x86_64 4.2.19-x86_64 4.2.20-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.2/${RHCOS}/commitmeta.json" | jq -c '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .')"; echo "${COMMON} ${RHCOS} ${OCP}"; done
  ["containers-common","1","0.1.32","5.git1715c90.el8","x86_64"] 42.80.20190930.1 4.2.0-rc.0
  ["containers-common","1","0.1.32","5.git1715c90.el8","x86_64"] 42.80.20191022.0 4.2.2
  ["containers-common","1","0.1.37","5.module+el8.1.0+4240+893c1ab8","x86_64"] 42.81.20191107.0 4.2.4
  ["containers-common","1","0.1.37","5.module+el8.1.0+4240+893c1ab8","x86_64"] 42.81.20191210.1 4.2.12
  ["containers-common","1","0.1.37","6.module+el8.1.0+4876+e678a192","x86_64"] 42.81.20191223.0 4.2.13
  ["containers-common","1","0.1.37","6.module+el8.1.0+4876+e678a192","x86_64"] 42.81.20200203.1 4.2.18-x86_64
  ["containers-common","1","0.1.40","8.module+el8.1.1+5351+506397b0","x86_64"] 42.81.20200210.0 4.2.19-x86_64
  ["containers-common","1","0.1.40","8.module+el8.1.1+5351+506397b0","x86_64"] 42.81.20200217.0 4.2.20-x86_64
  $ for OCP in 4.3.0-rc.0-x86_64 4.3.0-x86_64 4.3.1-x86_64 4.3.2-x86_64 4.3.3-x86_64; do RHCOS="$(oc image info --output json $(oc adm release info --image-for=machine-os-content "quay.io/openshift-release-dev/ocp-release:${OCP}") | jq -r .config.config.Labels.version)"; COMMON="$(curl -s "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/${RHCOS}/x86_64/commitmeta.json" | jq -c '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "containers-common") | .')"; echo "${COMMON} ${RHCOS} ${OCP}"; done
  ["containers-common","1","0.1.40","2.el8","x86_64"] 43.81.202001072253.0 4.3.0-rc.0-x86_64
  ["containers-common","1","0.1.40","2.el8","x86_64"] 43.81.202001142154.0 4.3.0-x86_64
  ["containers-common","1","0.1.40","3.rhaos.el8","x86_64"] 43.81.202002032142.0 4.3.1-x86_64
  ["containers-common","1","0.1.40","8.module+el8.1.1+5351+506397b0","x86_64"] 43.81.202002110953.0 4.3.2-x86_64
  ["containers-common","1","0.1.40","8.module+el8.1.1+5351+506397b0","x86_64"] 43.81.202002170853.0 4.3.3-x86_64

Fetching a source RPM for containers-common, e.g. from [15,16] shows
the source packages coming from skopeo.  Checking [17]:

  $ git --no-pager log --follow --oneline --stat=200 -M50% -- vendor/github.com/containers/storage/storage.conf
  afaa9e7f Bump github.com/containers/storage from 1.15.1 to 1.15.2
   vendor/github.com/containers/storage/storage.conf | 3 ---
   1 file changed, 3 deletions(-)
  39ff039b Image encryption/decryption support in skopeo
   vendor/github.com/containers/storage/storage.conf | 44 +++++++++++++++++++++++++-------------------
   1 file changed, 25 insertions(+), 19 deletions(-)
  05ae513b Bump github.com/containers/buildah from 1.8.4 to 1.11.4
   vendor/github.com/containers/storage/storage.conf | 7 -------
   1 file changed, 7 deletions(-)
  700b3102 update github.com/containers/{image,storage}
   vendor/github.com/containers/storage/storage.conf | 8 ++++++++
   1 file changed, 8 insertions(+)
  033b2902 migrate to go modules
   vendor/github.com/containers/storage/storage.conf | 130 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
   1 file changed, 130 insertions(+)
  $ git --no-pager log --follow --oneline --stat=200 -M50% 033b2902^ -- contrib/storage.conf
  fe259105 add storage.conf and manpage in contrib/
   contrib/storage.conf | 28 ++++++++++++++++++++++++++++
   1 file changed, 28 insertions(+)
  $ for HASH in fe259105 033b2902 700b3102 05ae513b 39ff039b afaa9e7f; do git describe --contains "${HASH}"; done
  v0.1.29~3^2
  v0.1.38~14^2~2
  v0.1.39~1
  v0.1.41~25^2
  v0.1.41~21^2
  v0.1.41~12^2

So changes may have been made in 0.1.29 (when the file landed for the
first time, likely from wherever we store post-Git patches), and were
likely made in 0.1.38, 0.1.39, and 0.1.41.  However, the skopeo and
derivative containers-common RPMs may have had patched versions of the
file tracked in dist-git [18].  Comparing the dist-git 4.1 tip with
the machine-config template:

  $ git -C containers/skopeo remote -v | grep 'dist-git.*fetch'
  dist-git git://pkgs.devel.redhat.com/rpms/skopeo.git (fetch)
  $ git --no-pager -C containers/skopeo log --date=short --format='%ad %h %s' -2 dist-git/rhaos-4.1-rhel-8 -- storage.conf
  2018-07-18 3757b210 add statx to seccomp.json to containers-config add seccomp.json to containers-config
  2017-11-08 284f9024 Force storage.conf to default to overlay
  $ git --no-pager -C containers/skopeo grep '^Version:' 3757b210
  3757b210:skopeo.spec:Version: 0.1.31
  $ diff -U3 <(git -C containers/skopeo cat-file -p 3757b210:storage.conf) <(sed 's/^    //' openshift/machine-config-operator/templates/common/_base/files/container-storage.yaml)--- /dev/fd/63 2020-02-20 01:13:48.073704685 -0800
  +++ /dev/fd/62	 2020-02-20 01:13:48.073704685 -0800
  @@ -1,3 +1,10 @@
  +filesystem: "root"
  +mode: 0644
  +path: "/etc/containers/storage.conf"
  +contents:
  +  inline: |
  +# This file is generated by the Machine Config Operator's containerruntimeconfig controller.
  +#
   # storage.conf is the configuration file for all tools
   # that share the containers/storage libraries
   # See man 5 containers-storage.conf for more information

So the machine-config master (5ed0aee72c) only differs from the old
0.1.31 RPM storage.conf by the "file is generated" marker.

There does not seem to be any 4.2-specific content.  Presumably
they're using the same rhaos-4.1-rhel-8 RPMs.  4.3 has some changes:

  $ git --no-pager log --date=short --format='%ad %h %s' -2 --stat=80 dist-git/rhaos-4.3-rhel-8 -- storage.conf
  2019-12-09 4a131916 skopeo-0.1.40-2.el8

   storage.conf | 39 +++++++++++++++++++++++++++++----------
   1 file changed, 29 insertions(+), 10 deletions(-)
  2019-10-08 13a4ce10 skopeo-1:0.1.40-0.1.gitf72e39f

   storage.conf | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
   1 file changed, 114 insertions(+)

So it looks like we can ignore the dev skopeo repository, focus on the
dist-git skopeo repository, and say that before 0.1.40-2.el8 we had a
version of storage.conf in the RPMs that matched the unpatched
machine-config templates, and with 0.1.40-2.el8 and later the RPMs had
different content.  Can we check the RPMs to confirm?

The product pages are claiming containers-common-0.1.32 for 4.1.34
[19,20].  Those product pages are fed from RPM Errata reports, and ART
builds those Errata by sweeping RPM repositories in the viscinity of
the RHCOS builds.  So there's a potential for races like:

1. RPM Errata sweep fires and grabs RPM A v1.
2. New RPM A v2 pushed to the repository.
3. RHCOS build hits repositories and grabs RPM A v2.

The RPMs referenced by releases-rhcos-art.cloud are reliable, but
actually tracking down the referenced RPMs to download them is
complicated (especially for module builds like containers-common).

But here are two RPM-lookup procedures that seem more reliable:

A. From [21]:

   1. On [21], find the matching skopeo package,
      e.g. skopeo-0.1.40-2.el8.  Click through to the Advisory,
      e.g. [22].

   2. On [22], find the matching skopeo package, expand the CDN RPMs
      section to see the containers-common RPM link, e.g. [23].

   3. Click through to /etc/containers/storage.conf, e.g. [24].

   4. See the sha256, e.g. a6423cca39d0cde0d6ee82163630d288e8876ab7d39d2678f6d86d804bf61044.

B. From [25].  This works better for module builds.

   1. Search for the skopeo package from [25], e.g. [26], takes me to
      [27].

   2. Find the matching package,
      e.g. skopeo-0.1.37-5.module+el8.1.0+4240+893c1ab8, and click
      through to [28].

   3. Find the x86_64 containers-common RPM, and click through to info
      [29].  Continue from step A.3.

Summarizing storage.conf digests for the various RPMs:

* containers-common-1:0.1.32-4.git1715c90.el8.x86_64
  Used for 4.1.1 through 4.1.16.
  ee7daca89532d5a80da391fc358776ec11eff256c497652c49505acc70b96822 [30]
* containers-common-1:0.1.32-5.git1715c90.el8.x86_64
  Used for 4.1.7 through 4.1.23, 4.2.0-rc.0 through 4.2.2.
  ee7daca89532d5a80da391fc358776ec11eff256c497652c49505acc70b96822 [31]
* containers-common-1:0.1.37-5.module+el8.1.0+4240+893c1ab8.x86_64
  Used for 4.1.24 through 4.1.28, 4.2.4 through 4.2.12.
  ee7daca89532d5a80da391fc358776ec11eff256c497652c49505acc70b96822 [32]
* containers-common-1:0.1.37-6.module+el8.1.0+4876+e678a192.x86_64
  Used for 4.1.29 through 4.1.31, 4.2.13 through 4.2.18.
  ee7daca89532d5a80da391fc358776ec11eff256c497652c49505acc70b96822 [33]
* containers-common-1:0.1.40-2.el8.x86_64.rpm
  Used for 4.3.0-rc.0 through 4.3.0.
  a6423cca39d0cde0d6ee82163630d288e8876ab7d39d2678f6d86d804bf61044 [24]
* containers-common-1:0.1.40-3.rhaos.el8.x86_64
  Used for 4.3.1.
  a6423cca39d0cde0d6ee82163630d288e8876ab7d39d2678f6d86d804bf61044 [34]
* containers-common-1:0.1.40-8.module+el8.1.1+5351+506397b0.x86_64
  Used for 4.2.19, 4.2.20, and 4.1.34.
  a6423cca39d0cde0d6ee82163630d288e8876ab7d39d2678f6d86d804bf61044 [35]

So there are only two versions in the RPMs, ee7daca895 used for all
4.1 and 4.2, and a6423cca39 used for all 4.3.  That means that the
vulnerable transitions are 4.2.16 and earlier going into 4.3.  It also
means that there's a potential for future trouble in transitions from
4.1.31 and earlier to a future 4.1 or 4.2 where the RPM-installed
content is different, and from 4.2.16 and earlier to a future 4.2
where the RPM-installed content is different, but that we have no such
4.1 or 4.2 changes at the moment.

So ideally this pull would block edges from 4.2.16 and earlier into
4.3.  This commit drops 4.2.16 from the *-4.3 channels with a comment
about this bug.  This also explicitly blocks edges from 4.2 into
4.3.0, because 4.3.0 is the only 4.3 release which recommends 4.2.16
or earlier as an update edge.

  $ for i in $(seq 0 3); do echo -n "$i "; oc adm release info "quay.io/openshift-release-dev/ocp-release:4.3.$i-x86_64" | grep Upgrades; done
  0   Upgrades: 4.2.16, 4.3.0-rc.0, 4.3.0-rc.1, 4.3.0-rc.2, 4.3.0-rc.3
  1   Upgrades: 4.2.18, 4.3.0-rc.0, 4.3.0-rc.3, 4.3.0
  2   Upgrades: 4.2.19, 4.3.0, 4.3.1
  3   Upgrades: 4.2.20, 4.3.0, 4.3.1, 4.3.2

There shouldn't be much pushback on pulling the edge, because users
can still move from 4.2 to 4.3 via 4.2.18 -> 4.3.1, both of which are
already in fast-4.3.

Also simplify the wording on the GCP bug 1793635, which remains
unfixed.

[1]: openshift/machine-config-operator#1320 (comment)
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1782152#c5
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1781708#c0
[4]: https://github.com/openshift/machine-config-operator/pull/1320/files
[5]: openshift/machine-config-operator#1190
[6]: https://github.com/openshift/machine-config-operator/blob/13f0dda734262c3edbd23c007e42b7704125e88f/docs/MachineConfiguration.md
[7]: https://github.com/openshift/machine-config-operator/blob/13f0dda734262c3edbd23c007e42b7704125e88f/docs/ContainerRuntimeConfigDesign.md
[8]: openshift/machine-config-operator#330 (comment)
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1782153
[10]: openshift/machine-config-operator#1382 (comment)
[11]: openshift/machine-config-operator#1323 (comment)
[12]: https://bugzilla.redhat.com/show_bug.cgi?id=1782149
[13]: openshift/machine-config-operator#1322 (comment)
[14]: https://gitlab.cee.redhat.com/coretools/differ
      Internal link, sorry :/  But you can also browse the history at:
      https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.2&release=42.81.20200114.0 etc.
[15]: https://access.redhat.com/downloads/content/290/ver=4.2/rhel---8/4.2.0/x86_64/packages
[16]: https://access.redhat.com/downloads/content/rhel---8/x86_64/8841/containers-common/0.1.32-5.git1715c90.el8/x86_64/fd431d51/package
[17]: https://github.com/containers/skopeo/
[18]: http://pkgs.devel.redhat.com/cgit/rpms/skopeo/
[19]: https://access.redhat.com/downloads/content/290/ver=4.1/rhel---8/4.1.34/x86_64/packages
[20]: https://access.redhat.com/downloads/content/rhel---8/x86_64/8384/containers-common/0.1.32-5.git1715c90.el8/x86_64/fd431d51/package
[21]: https://errata.devel.redhat.com/package/show/skopeo
[22]: https://errata.devel.redhat.com/errata/content/46255
[23]: https://brewweb.engineering.redhat.com/brew/rpminfo?rpmID=7604818
[24]: https://brewweb.engineering.redhat.com/brew/fileinfo?rpmID=7604818&filename=/etc/containers/storage.conf
[25]: https://brewweb.engineering.redhat.com/brew/search
[26]: https://brewweb.engineering.redhat.com/brew/search?match=glob&type=package&terms=skopeo
[27]: https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=58395
[28]: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=971200
[29]: https://brewweb.engineering.redhat.com/brew/rpminfo?rpmID=7349205
[30]: https://brewweb.engineering.redhat.com/brew/fileinfo?rpmID=6958325&filename=/etc/containers/storage.conf
[31]: https://brewweb.engineering.redhat.com/brew/fileinfo?rpmID=7334504&filename=/etc/containers/storage.conf
[32]: https://brewweb.engineering.redhat.com/brew/fileinfo?rpmID=7349205&filename=/etc/containers/storage.conf
[33]: https://brewweb.engineering.redhat.com/brew/fileinfo?rpmID=7550403&filename=/etc/containers/storage.conf
[34]: https://brewweb.engineering.redhat.com/brew/fileinfo?rpmID=7727297&filename=/etc/containers/storage.conf
[35]: https://brewweb.engineering.redhat.com/brew/fileinfo?rpmID=7656074&filename=/etc/containers/storage.conf
@cgwalters
Copy link
Member Author

Dropping this here for lack of a better place: A much bigger path we could take would be to have the MCO build a derived OSTree commit from the rendered MachineConfig (like as a build process) and serve that to other nodes in the cluster. This means we're not going into each node and changing config files (or things like the kernel-rt overrides) and also leads more strongly towards having /etc mounted read-only by default.

But going this path takes us entirely away from supporting traditional RHEL - so we'd also need to do #1592

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 24, 2020
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link
Contributor

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

/lifecycle frozen

@cgwalters cgwalters reopened this Feb 5, 2021
@openshift-ci-robot openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Feb 5, 2021
@cgwalters
Copy link
Member Author

xref ostreedev/ostree#2220

@cgwalters
Copy link
Member Author

Also of note, rpm-ostree is getting closer to stabilizing our "apply-live" code, xref https://lists.fedoraproject.org/archives/list/[email protected]/thread/MQWBKRFCYH2GB3CW5CG722RGQAEPHHAN/ - once that happens we can also support e.g. live-applying a subset of the changes to /etc - and as well potentially things like "just cherry pick the new kubelet package" and take the kernel update downtime later.

@cgwalters
Copy link
Member Author

In recent discussion it was realized that a slightly hacky but totally viable way to do this would be for the MCO to take the Ignition config content under filesystem: and synthesize an RPM from it, then rpm-ostree install ./machine-config.rpm.

We've thought about this some in the context of non-RPM content; some discussion related to that in coreos/rpm-ostree#2326
And various prep work for it has landed, but it's not going to be soon. Whereas layering RPMs works today.

@cgwalters
Copy link
Member Author

One thing the MCO could do today is: Before initiating any node level changes to /etc, run ostree admin deploy --not-as-default <current commit> (and we could wrap this in rpm-ostree too, but the ostree API exists today).

The effect of this is that we snapshot the current /etc before the MCO mutates it, and it would help ensure that rpm-ostree rollback (or just picking the previous bootloader entry) has a consistent state.

A downside of this approach is that by snapshotting the current /etc, it re-introduces race conditions with any processes besides the MCO that are live-mutating /etc. (Which apparently includes various daemonsets, there's also LVM which writes state in /etc).

@sinnykumari
Copy link
Contributor

A downside of this approach is that by snapshotting the current /etc, it re-introduces race conditions with any processes besides the MCO that are live-mutating /etc. (Which apparently includes various daemonsets, there's also LVM which writes state in /etc).

Right. Another example is SRIOV operator that makes direct changes to node as they perform some per node config changes.

@cgwalters
Copy link
Member Author

Closing this as a dup of #3137 which we'll hopefully do with layering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

5 participants