Add concept of state overlays #3120

jlebon · 2023-12-14T21:50:24Z

In the OSTree model, executables go in /usr, state in /var and configuration in /etc. Software that lives in /opt however messes this up because it often mixes code and state, making it harder to manage.

More generally, it's sometimes useful to have the OSTree commit contain code under a certain path, but still allow that path to be writable by software and the sysadmin at runtime (/usr/local is another instance).

Add the concept of state overlays. A state overlay is an overlayfs mount whose upper directory, which contains unmanaged state, is carried forward on top of a lower directory, containing OSTree-managed files.

In the example of /usr/local, OSTree commits can ship content there, all while allowing users to e.g. add scripts in /usr/local/bin when booted into that commit.

Some reconciliation logic is executed whenever the base is updated so that newer files in the base are never shadowed by a copied up version in the upper directory. This matches RPM semantics when upgrading packages whose files may have been modified.

For ease of integration, this is exposed as a systemd template unit which any downstream distro/user can enable. The instance name is the mountpath in escaped systemd path notation (e.g. [email protected]).

See discussions in #3113 for more details.

cgwalters · 2023-12-14T22:54:21Z

There's a lot of appeal to this for sure. I mean, it opens up the question a bit if we were to do this, what would it look like to do it for everything, having it also take over handling of /etc and /var for example too.

We had a realtime chat on this, but for the record: The thing that I fear a bit is basically that while I agree it will largely Just Work for things like RPMs/debs that install in /opt and want to write extra data there...the flip side is anywhere in these directories you can "leak state". Of course, state can be leaked in /etc and /var too today! But a big part of the idea is that admins only have those two places to save/restore. With this, there's potentially new important persistent state in these mount points.

However there are some other things:

Some reconciliation logic is executed whenever the base is updated so that newer files in the base are never shadowed by a copied up version in the upper directory. This matches RPM semantics when upgrading packages whose files may have been modified.

There's a lot of potentially subtle corner cases in this. For example, what happens when the lower replaces a directory with a symlink? (A classic RPM problem). It looks like your logic unconditionally deletes away all content in the upper, which could include extra "local" state files too. That isn't an incorrect behavior, but it's one case which could result in some data loss, depending on the scenario. To be clear there isn't a magic solution here; the crude hammer of root.transient just forces the "data loss" to always happen on every OS update, instead of just sometimes. (Really it forces admins to manually choose what persists by symlinking/mounting those files to /var).

Note that the ostree /etc semantic handles it differently; the "upper" always wins, meaning you can instead silently not get updates in this scenario. Now, changing the "type" of an entry in /etc between dir and !dir is IME exceedingly rare, i.e. I don't know of a time it has happened. But while it is rare, it definitely happens to change between dir and !dir in /usr.

Now, if I'm understanding things right, state overlays are orthogonal to root.transient, which would be great. I could imagine for example wanting to enable root.transient in an OS build, but then also add a state overlay for some specific /opt/someapp.

Visualization/debuggability: overlayfs is a pretty raw kernel feature; but with e.g. container tooling there's tools to introspect the "layers" (that usually map to overlayfs dirs). I am sure we'd need to have something like ostree state-overlay diff like we have ostree admin config-diff too in the future.

cgwalters · 2023-12-14T23:11:32Z

I think my biggest concern in a nutshell is the "unknown unknowns" of this at least to me. The implementation of root.transient is extremely simple, and its semantics very well known and easy to understand and explain due to its prevalence since the creation of docker.

This is a more complex approach, and its semantics required me to think carefully and read the code. I wonder if anyone else is using this approach for anything? I can't imagine it's actually novel, but it'd really help me to see something else doing this for some part of a system and how they document it.

I think https://kubic.opensuse.org/documentation/man-pages/transactional-update.8.html might use something like this for /etc?

May see if I can reach out on some social media.

jlebon · 2023-12-15T04:43:46Z

There's a lot of appeal to this for sure. I mean, it opens up the question a bit if we were to do this, what would it look like to do it for everything, having it also take over handling of /etc and /var for example too.

Yeah, definitely could be interesting. Obviously, it's got a much larger impact (and e.g. the fact that overlayfs isn't fully POSIX compliant has a higher chance to rear its head if all OS state is on it).

We had a realtime chat on this, but for the record: The thing that I fear a bit is basically that while I agree it will largely Just Work for things like RPMs/debs that install in /opt and want to write extra data there...the flip side is anywhere in these directories you can "leak state". Of course, state can be leaked in /etc and /var too today! But a big part of the idea is that admins only have those two places to save/restore. With this, there's potentially new important persistent state in these mount points.

The nice thing though is that the upperdir is on /var and because of the beauty of overlayfs, the dir is quite usable from a backup perspective. There are the whiteout character devices which are weird, but a backup service could pretty much ignore those. Backing up the mountpoint is OK too, but would pick up base content too.

There's a lot of potentially subtle corner cases in this. For example, what happens when the lower replaces a directory with a symlink? (A classic RPM problem). It looks like your logic unconditionally deletes away all content in the upper, which could include extra "local" state files too. That isn't an incorrect behavior, but it's one case which could result in some data loss, depending on the scenario. To be clear there isn't a magic solution here; the crude hammer of root.transient just forces the "data loss" to always happen on every OS update, instead of just sometimes. (Really it forces admins to manually choose what persists by symlinking/mounting those files to /var).

Note that the ostree /etc semantic handles it differently; the "upper" always wins, meaning you can instead silently not get updates in this scenario. Now, changing the "type" of an entry in /etc between dir and !dir is IME exceedingly rare, i.e. I don't know of a time it has happened. But while it is rare, it definitely happens to change between dir and !dir in /usr.

Indeed. I'll note though that the condition here would have to be that it's a file in e.g. /opt or /usr/local which changes from dir to !dir and that it's in the upper dir (i.e. users/apps are expected to write to that dir). Given that RPM itself errors out, I wonder how likely this is. Of course, users could do whatever they want and write to places they shouldn't, so definitely something to be aware of. We could also instead of deleting, rename the directory out of the way and warn.

Now, if I'm understanding things right, state overlays are orthogonal to root.transient, which would be great. I could imagine for example wanting to enable root.transient in an OS build, but then also add a state overlay for some specific /opt/someapp.

Correct!

Visualization/debuggability: overlayfs is a pretty raw kernel feature; but with e.g. container tooling there's tools to introspect the "layers" (that usually map to overlayfs dirs). I am sure we'd need to have something like ostree state-overlay diff like we have ostree admin config-diff too in the future.

Could be interesting down the line. In the vanilla additive case, tree /usr/ostree/state-overlays/$overlay/upper gives you that, but definitely whiteouts and opaque dirs could benefit from better visualization.

I think my biggest concern in a nutshell is the "unknown unknowns" of this at least to me. The implementation of root.transient is extremely simple, and its semantics very well known and easy to understand and explain due to its prevalence since the creation of docker.

This is a more complex approach, and its semantics required me to think carefully and read the code.

I think this is worth teasing apart a bit more. Semantics-wise, I would describe it as "package manager-like", which is of course familiar to a lot of people in this space. If you have to describe it to an end-user, that would suffice. Where it's more complex is in its implementation, which yes, if you hit some corner case you might have to peak under the covers and understand it. We certainly should have docs for that, but the goal is certainly that most people will not need to care.

Agreed re. "unknown unknowns". I'm optimistic about this, but we really need people testing it in realistic scenarios (this is the motivation behind the last commit in coreos/rpm-ostree#4728). Then we can see how well this works and whether we should move forward trying to polish and stabilize it.

I wonder if anyone else is using this approach for anything? I can't imagine it's actually novel, but it'd really help me to see something else doing this for some part of a system and how they document it.

Yeah, hard to find prior art on this specifically. Definitely also interested to hear from overlayfs SMEs. Maybe @rhvgoyal?

cgwalters · 2023-12-15T13:25:11Z

@ericcurtin I was just referencing that project as potential prior art to look at for an small implementation detail, not proposing depending on it.

src/ostree/ot-admin-builtin-state-overlay.c

cgwalters · 2023-12-15T15:47:37Z

src/ostree/ot-builtin-admin.c

@@ -42,6 +42,8 @@ static OstreeCommand admin_subcommands[] = {
    "Change the finalization locking state of the staged deployment" },
  { "boot-complete", OSTREE_BUILTIN_FLAG_NO_REPO | OSTREE_BUILTIN_FLAG_HIDDEN,
    ot_admin_builtin_boot_complete, "Internal command to run at boot after an update was applied" },
+  { "state-overlay", OSTREE_BUILTIN_FLAG_NO_REPO | OSTREE_BUILTIN_FLAG_HIDDEN,


The CLI is hidden, but the systemd unit is not; so this would become insta-stable; is that the intention? I'm OK with that, just checking.

cgwalters

I'm OK to merge this to just make it easier to experiment with. We'll definitely want a man page for this though.

I'm not totally sure if this should be marked as experimental or not.

jlebon · 2024-01-09T02:48:55Z

I'm OK to merge this to just make it easier to experiment with. We'll definitely want a man page for this though.

Sure, can add one.

The CLI is hidden, but the systemd unit is not; so this would become insta-stable; is that the intention? I'm OK with that, just checking.
...
I'm not totally sure if this should be marked as experimental or not.

My intent was to not declare it as stabilized yet, because I really feel it needs validation in the real world first. I didn't think about how to gate the systemd unit itself. I guess we could name it e.g. [email protected] and then rename it once it's stabilized? A bit awkward. We'd also have to carry the old unit name for a while.

In practice, I think people aren't going to know to turn this unit on without looking at the docs, where we can explicitly say this is still experimental. (And even more realistically, I think the primary user of this before stabilizing will be the environment client-side knob demo'ed in coreos/rpm-ostree#233 (comment), where it's more clear that it's experimental.)

jlebon · 2024-01-09T21:51:17Z

Updated!

Though this now requires https://gitlab.gnome.org/GNOME/libglnx/-/merge_requests/52.

Edit: Oh right, this still also needs some docs. Done!

In the OSTree model, executables go in `/usr`, state in `/var` and configuration in `/etc`. Software that lives in `/opt` however messes this up because it often mixes code *and* state, making it harder to manage. More generally, it's sometimes useful to have the OSTree commit contain code under a certain path, but still allow that path to be writable by software and the sysadmin at runtime (`/usr/local` is another instance). Add the concept of state overlays. A state overlay is an overlayfs mount whose upper directory, which contains unmanaged state, is carried forward on top of a lower directory, containing OSTree-managed files. In the example of `/usr/local`, OSTree commits can ship content there, all while allowing users to e.g. add scripts in `/usr/local/bin` when booted into that commit. Some reconciliation logic is executed whenever the base is updated so that newer files in the base are never shadowed by a copied up version in the upper directory. This matches RPM semantics when upgrading packages whose files may have been modified. For ease of integration, this is exposed as a systemd template unit which any downstream distro/user can enable. The instance name is the mountpath in escaped systemd path notation (e.g. `[email protected]`). See discussions in ostreedev#3113 for more details.

Bumps libglnx from `aff1eea` to `b415d046`. For https://gitlab.gnome.org/GNOME/libglnx/-/merge_requests/52. Update submodule: libglnx

This was referenced Dec 14, 2023

Support RPMs installing in /opt and /usr/local coreos/rpm-ostree#4728

Merged

Add support for full overlayfs for / #3113

Closed

jlebon force-pushed the pr/state-overlays branch from e389c21 to cb6e0e1 Compare December 14, 2023 21:58

jlebon mentioned this pull request Dec 14, 2023

Move content in /opt to /usr/lib/opt as part of ostree container commit if state overlays enabled ostreedev/ostree-rs-ext#573

Closed

jlebon force-pushed the pr/state-overlays branch 2 times, most recently from fb5050d to 3a8dc91 Compare December 14, 2023 22:17

This comment was marked as outdated.

Sign in to view

This comment was marked as off-topic.

Sign in to view

cgwalters reviewed Dec 15, 2023

View reviewed changes

cgwalters approved these changes Dec 18, 2023

View reviewed changes

cgwalters mentioned this pull request Dec 20, 2023

prepare-root: Add support for root.transient #3114

Merged

jlebon force-pushed the pr/state-overlays branch from 3a8dc91 to 15d7496 Compare January 9, 2024 21:51

jlebon force-pushed the pr/state-overlays branch from 15d7496 to 30ec186 Compare January 9, 2024 21:56

jlebon force-pushed the pr/state-overlays branch from 30ec186 to 92b1a27 Compare January 10, 2024 04:20

build(deps): bump libglnx from aff1eea to b415d046

e233d02

Bumps libglnx from `aff1eea` to `b415d046`. For https://gitlab.gnome.org/GNOME/libglnx/-/merge_requests/52. Update submodule: libglnx

jlebon force-pushed the pr/state-overlays branch from 9ffae90 to e233d02 Compare January 10, 2024 20:41

cgwalters merged commit 6031f1c into ostreedev:main Jan 11, 2024
24 checks passed

cgwalters mentioned this pull request Jan 17, 2024

New image type: Fedora iot-bootable-container osbuild/images#361

Merged

jlebon mentioned this pull request Jan 19, 2024

Determine support path for RPMs that install into /opt, /usr/local etc coreos/rpm-ostree#233

Closed

cgwalters mentioned this pull request Jan 19, 2024

Release 2024.1 #3141

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add concept of state overlays #3120

Add concept of state overlays #3120

jlebon commented Dec 14, 2023

cgwalters commented Dec 14, 2023

cgwalters commented Dec 14, 2023

This comment was marked as outdated.

This comment was marked as off-topic.

This comment was marked as off-topic.

jlebon commented Dec 15, 2023

cgwalters commented Dec 15, 2023

cgwalters Dec 15, 2023

cgwalters left a comment

jlebon commented Jan 9, 2024

jlebon commented Jan 9, 2024 •

edited

Loading

Add concept of state overlays #3120

Add concept of state overlays #3120

Conversation

jlebon commented Dec 14, 2023

cgwalters commented Dec 14, 2023

cgwalters commented Dec 14, 2023

This comment was marked as outdated.

This comment was marked as off-topic.

This comment was marked as off-topic.

jlebon commented Dec 15, 2023

cgwalters commented Dec 15, 2023

cgwalters Dec 15, 2023

Choose a reason for hiding this comment

cgwalters left a comment

Choose a reason for hiding this comment

jlebon commented Jan 9, 2024

jlebon commented Jan 9, 2024 • edited Loading

jlebon commented Jan 9, 2024 •

edited

Loading