[RFC] Hide /sysroot in a private mount namespace #3358

ruihe774 · 2024-12-18T14:09:26Z

Related: #3211

This PR implements a new option sysroot.invisible in prepare-root.conf to hide /sysroot in a private mount namespace /run/ostree/.private/sysroot-ns to prevent the rest of the system from accessing it, while ostree admin commands can still operate on it inside a mount namespace.

This PR also add a new admin command "nsenter" that runs program in the mount namespace where /sysroot is present. This can wrap tools that currently do not support invisible sysroot. As an example, below is an drop-in override for rpm-ostreed.service:

[Unit]
ConditionPathExists=
ConditionKernelCommandLine=ostree

[Service]
ExecStart=
ExecStart=+/usr/bin/ostree admin nsenter --exec /usr/bin/rpm-ostree start-daemon

I know this is a big change, and there may be corner cases that I haven't considered. However, I have tested it on my local machine, and basic functionalities (booting, deploying, integration with rpm-ostree) work fine. And my code is ready for review.

openshift-ci · 2024-12-18T14:09:38Z

Hi @ruihe774. Thanks for your PR.

I'm waiting for a ostreedev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

src/libostree/ostree-impl-system-generator.c

ruihe774 · 2024-12-18T15:35:42Z

The test failed because only very new glibc has wrappers for open_tree and move_mount. I'm going to switch to use raw syscalls instead.

cgwalters

Thanks so much for working on this!

One struggle we have right now is that the prepare-root code is not unit or integration tested well; I wouldn't block this on that, but adding more features there compounds that issue. I may find some time to work on that...

On a different note, in theory this code could be implemented as a service that runs After=ostree-prepare-root.service right? I wonder if it'd be helpful for us to structure it that way. But eh, just a thought.

src/switchroot/ostree-prepare-root.c

src/libostree/ostree-sysroot.c

cgwalters · 2024-12-18T19:01:20Z

The test failed because only very new glibc has wrappers for open_tree and move_mount. I'm going to switch to use raw syscalls instead.

I'm also OK just not having this feature work on older glibc for now.

(Though it's also of course worth noting that in containers/bootc#919 I used the super nice rustix bindings for this which worked fine on C9S even though glibc doesn't have the headers there...and this touches on the ongoing thought of how we might start to use more Rust in libostree in general...or maybe put some of this in bootc to start...)

ruihe774 · 2024-12-19T05:50:06Z

The test failed because only very new glibc has wrappers for open_tree and move_mount. I'm going to switch to use raw syscalls instead.

I'm also OK just not having this feature work on older glibc for now.

(Though it's also of course worth noting that in containers/bootc#919 I used the super nice rustix bindings for this which worked fine on C9S even though glibc doesn't have the headers there...and this touches on the ongoing thought of how we might start to use more Rust in libostree in general...or maybe put some of this in bootc to start...)

I think in first step we can have individual binaries in switchroot rewritten in Rust. Currently we have to maintain two ostree-prepare-root, one dynamically linked and one static. Using Rust we can easily merge them and always generate static builds.

I'll try to make some POC later if I have spare time.

cgwalters · 2024-12-19T13:36:55Z

I think in first step we can have individual binaries in switchroot rewritten in Rust. Currently we have to maintain two ostree-prepare-root, one dynamically linked and one static. Using Rust we can easily merge them and always generate static builds.

We do have a composefs-rs project going on but honestly...I would like to try to remove the static ostree-prepare-root in favor of UKIs basically.

ruihe774 · 2024-12-19T15:46:08Z

I have read some docs about socket activation, fd store, and unix socket. I find it's possible to write a service that uploads the fd obtained by open_tree to systemd and serves it over unix socket on demand using socket activation. Meanwhile, SELinux restrictions can also apply to the path of unix socket. I now consider this a better solution and it does not pollute the global mountinfo.

We can keep current part that hides /sysroot inside sysroot-ns in ostree-prepare-root, and write a service outside initramfs to open_tree, upload, and unmount it. Socket activation can be used to start the service on-daemon to serve the fd. Clients using libostree can obtain the fd through socket. The current part in libostree can be kept as a fallback in case the holder service is not functioning.

However, the problem is that I'm not familiar with systemd and C socket programming. It is difficult for me to implement the aforementioned. I wonder if you could give me some help. Thank you.

cgwalters · 2024-12-19T19:47:33Z

A totally different approach:

Today systemd when using volatile root sets up a special symlink that points to the root block device.

On Linux, it's possible to mount a block device multiple times. So instead of having /sysroot mounted at all, we just create that symlink to the real backing block device (which would be generally useful) and then anything which wants to operate on the physical root (including ostree/bootc) just...mounts it on their own, wherever they want (which may be /sysroot, which we'd leave as an empty directory suitable for this purpose, but again it doesn't need to be there at all).

I have only spent 3 minutes and 5 seconds thinking about this and maybe there's something I'm missing but it would be dramatically simpler than anything discussed so far.

EDIT: (another 20 seconds of thought) A big bonus here is that there's already the ability to have LSM policies which can apply to accessing block devices, so it'd be easy to restrict the set of things accessing the physical root (which would be a special case of general block device access, which really few things should have in general).

cgwalters · 2024-12-19T20:01:33Z

And to be clear I think we should create that symlink unconditionally (should be a separate PR).

Then this PR is just about whether or not to leave /sysroot mounted effectively.

ruihe774 · 2024-12-19T22:27:06Z

A totally different approach:

Today systemd when using volatile root sets up a special symlink that points to the root block device.

On Linux, it's possible to mount a block device multiple times. So instead of having /sysroot mounted at all, we just create that symlink to the real backing block device (which would be generally useful) and then anything which wants to operate on the physical root (including ostree/bootc) just...mounts it on their own, wherever they want (which may be /sysroot, which we'd leave as an empty directory suitable for this purpose, but again it doesn't need to be there at all).

I have only spent 3 minutes and 5 seconds thinking about this and maybe there's something I'm missing but it would be dramatically simpler than anything discussed so far.

EDIT: (another 20 seconds of thought) A big bonus here is that there's already the ability to have LSM policies which can apply to accessing block devices, so it'd be easy to restrict the set of things accessing the physical root (which would be a special case of general block device access, which really few things should have in general).

I don't think this approach can work with Btrfs subvolumes. Sometimes system roots are Btrfs subvolumes, not at the top of a block device. I know OpenSUSE (i.e. snapper) has such configurations. And what if the sysroot is not backed by a block device at all? For example, through network?

I believe there are many corner cases if we assume sysroot is always a root of a block device.

cgwalters · 2024-12-20T00:54:12Z

I don't think this approach can work with Btrfs subvolumes. Sometimes system roots are Btrfs subvolumes, not at the top of a block device.

Yes, a valid point; it is also the setup for Fedora btrfs (at least with Anaconda). The general way to handle this would be to emit a .mount unit that describes how to mount the sysroot.

Actually...we already should have a sysroot.mount from the initramfs, we can just copy it to the final root.

(This intersects a bit with containers/bootc#972 - for the btrfs and other cases we could attempt to re-synthesize the necessary mount information from kargs, but I think the most general fix is to honor however the root was set up in the initramfs, which would be sysroot.mount).

And what if the sysroot is not backed by a block device at all? For example, through network?

I don't know that there's a real use case for accessing the physical root in the network case though. Although it depends on what you mean by "network". iSCSI for example is still network, but is also a block device. NFS-as-root...I think is a bad idea. See also containers/bootc#898 which I think is better for those desiring diskless.

ruihe774 · 2024-12-20T12:24:33Z

I have compared several approaches to protect /sysroot, and find the approach used in this PR (hiding it inside a mount namespace) is relatively advantageous.

Use systemd config InaccessiblePaths=/sysroot
- Pros:
  - Do not require any change to existing code.
- Cons:
  - It's opt-in: users (or distro maintainers) need to add this config to every service that does not access /sysroot.
    - Giving that ostree is not very popular, it is little chance that upstream developers add this config.
  - InaccessiblePaths implicitly creates a private mount namespace, which may break some services.
Bind-mount /sysroot to /run/ostree/.private/sysroot, and unmount /sysroot
- Pros:
  - Do not need to use setns and a private mount namespace when operating on /sysroot
- Cons:
  - Need to use a LSM to prevent unrelated processes from accessing /run/ostree/.private
Hold /sysroot fd in a holder process, unmount it, and serve it through a socket
- Pros:
  - Do not pollute global mount namespace.
- Cons:
  - Need to use a LSM to protect the socket.
  - Need a holder process, long running or activated through socket.
  - Privileged processes may steal the fd through /proc.
    - Need to use systemd config ProtectProc=, which implicitly creates a private mount namespace and is prone to break services.
Hide /sysroot inside a private mount namespace at /run/ostree/.private/sysroot-ns
- Pros:
  - Multiple methods to protect it:
    - use a LSM to prevent unrelated processes from accessing /run/ostree/.private, or
    - use systemd config RestrictNamespaces=:
      - It is a system call filter and does not creates a private mount namespace, less possible to break services.
      - It is likely that for services do not use namespaces, the upstream developers have already hardened their services using this config.
- Cons:
  - The bind mount of /run/ostree/.private/sysroot-ns still pollute the global mount namespace.

cgwalters · 2024-12-20T13:26:39Z

Bind-mount /sysroot to /run/ostree/.private/sysroot, and unmount /sysroot

I would call this: "Move /sysroot to /run/ostree/.private/sysroot" and list its advantages/disadvantages as basically "Makes it somewhat less likely for processes to find and traverse it, otherwise same as status quo".

Re option 3:

Privileged processes may steal the fd through /proc.

Sure but that's really unlikely to happen accidentally, which is about half of what we're trying to improve. A historical issue we've had with the /sysroot mount is that backup tools and things like setfiles for selinux relabeling traverse it by default. But tools like that know not to traverse /run or /proc. And as far as malicious access, well you have to already be uid 0 to get to it; and yes that's where process sandboxing and LSMs come in, there's not much more we can do here.

src/switchroot/ostree-prepare-root.c

src/libostree/ostree-sysroot.c

src/libotutil/ot-fs-utils.c

src/ostree/ot-admin-builtin-nsenter.c

cgwalters · 2024-12-20T15:20:11Z

src/ostree/ot-builtin-admin.c

@@ -121,11 +123,6 @@ ostree_builtin_admin (int argc, char **argv, OstreeCommandInvocation *invocation
            }
        }

-      else if (g_str_equal (argv[in], "--"))


Can you motivate this change?

We need to support command patterns like this:

$ ostree admin nsenter -- sh -c "echo hello"

In this command, if -- is omitted, -c is processed by ostree and it fails.

So, nsenter need to support -- processing. If -- is processed here, nsenter cannot get the arguments after --.

Maybe deleting the logic here may break things. We need to add a OSTREE_BUILTIN_FLAG_ to specify whether a command needs to process -- by itself. I'll do that later.

src/switchroot/ostree-prepare-root.c

…ensure_writable

ruihe774 · 2024-12-20T18:56:13Z

In a big commit 96e2fb8 I try to avoid entering a mount namespace and keep using dir fd for sysroot and boot. The logic is somewhat complicated, especially when involved with remounting in _ostree_sysroot_ensure_writable. Now libostree only enters a mount namespace if OSTREE_ADMIN_BUILTIN_FLAG_ENTER_NS is specified. PTAL.

TODO:

Fix _ostree_sysroot_ensure_writable
Fix _ostree_sysroot_run_in_deployment
Fix ostree_sysroot_write_deployments_with_options
Fix full_system_sync

This reverts commit 1d4dc03.

ruihe774 · 2024-12-20T20:03:54Z

In a big commit 96e2fb8 I try to avoid entering a mount namespace and keep using dir fd for sysroot and boot. The logic is somewhat complicated, especially when involved with remounting in _ostree_sysroot_ensure_writable. Now libostree only enters a mount namespace if OSTREE_ADMIN_BUILTIN_FLAG_ENTER_NS is specified. PTAL.

Ok. Now it's not necessary to wrap rpm-ostree with nsenter anymore!

openshift-ci bot added the needs-ok-to-test label Dec 18, 2024

github-actions bot added the area/prepare-root Issue relates to ostree-prepare-root label Dec 18, 2024

ericcurtin reviewed Dec 18, 2024

View reviewed changes

src/libostree/ostree-impl-system-generator.c Show resolved Hide resolved

ruihe774 force-pushed the sysroot-invisible branch from feebf16 to f53b275 Compare December 18, 2024 15:05

ruihe774 force-pushed the sysroot-invisible branch from f53b275 to 1586d24 Compare December 18, 2024 15:55

cgwalters reviewed Dec 18, 2024

View reviewed changes

ruihe774 force-pushed the sysroot-invisible branch from 8b97cd2 to 160bf13 Compare December 18, 2024 19:02

ruihe774 force-pushed the sysroot-invisible branch 3 times, most recently from 7e427cc to 175a9e7 Compare December 19, 2024 08:24

ruihe774 marked this pull request as draft December 19, 2024 15:46

openshift-ci bot added the do-not-merge/work-in-progress label Dec 19, 2024

ruihe774 added 3 commits December 20, 2024 18:52

prepare-sysroot: Bind var under /run instead of inplace

1db98c0

prepare-root: Unmount temporary var mount after /var is mounted

bfb84a3

prepare-root: It's not necessary to make /var slave anymore

cb03e44

ruihe774 mentioned this pull request Dec 20, 2024

prepare-var: Do not remount stateroot var in-place, and unmount the temporary var mount after real /var is mounted #3363

Open

ruihe774 added 2 commits December 20, 2024 22:10

prepare-sysroot: Add sysroot.invisible

91dda71

remount: Skip remount /sysroot if invisible

412ec41

ruihe774 added 10 commits December 20, 2024 22:10

sysroot: Assert not in root mount namespace

83438a1

sysroot: Do not trust mount_namespace_in_use as it's per-thread

bf10b40

sysroot: Fix _ostree_sysroot_ensure_visible ns

fb8a0ce

prepare-root: Use sysroot.readonly=invisible

52d03d7

prepare-root: Add sysroot.protect as alias of sysroot.readonly

a5c64da

prepare-root: Make leftover /sysroot immutable

1d4dc03

prepare-root: Add some comments

6626f51

prepare-root: Fail if sysroot-ns already exists

3dc167e

sysroot: _ostree_sysroot_invisible & _ostree_in_root_mount_namespace

300d21c

sysroot: rework _ostree_sysroot_ensure_visible

646cc34

cgwalters reviewed Dec 20, 2024

View reviewed changes

ruihe774 added 2 commits December 21, 2024 02:43

sysroot: Do not require enter mount namespace for ensure_visible and …

96e2fb8

…ensure_writable

nsenter: Fix argument parsing

01de3ea

ruihe774 force-pushed the sysroot-invisible branch from cd06b32 to 01de3ea Compare December 20, 2024 18:47

ruihe774 added 4 commits December 21, 2024 03:05

prepare-root: Tidy code for root_upperdir and root_workdir

627b4f8

sysroot: Use glnx_fstatat_allow_noent, drop ot_path_exists

c7caee9

Revert "prepare-root: Make leftover /sysroot immutable"

124035b

This reverts commit 1d4dc03.

sysroot: Fix ostree in ostree nsenter

2bd95ea

ruihe774 added 4 commits December 21, 2024 14:14

sysroot: Use ostree/repo to check visibility

04d36fa

sysroot: Tidy code for _ostree_sysroot_ensure_writable

2960f0a

finalize-staged: Add a comment

e808a69

sysroot: Fix _ostree_sysroot_ensure_writable

b3851de

This comment was marked as resolved.

Sign in to view

ruihe774 added 4 commits December 23, 2024 23:37

sysroot-deploy: Fix _ostree_sysroot_run_in_deployment

e8bd540

sysroot-deploy: Fix ostree_sysroot_write_deployments_with_options

cc10e37

sysroot-deploy: Fix full_system_sync

979ac47

sysroot-deploy: Fix swap_bootloader

594d873

ruihe774 force-pushed the sysroot-invisible branch from 082605e to 594d873 Compare December 23, 2024 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Hide /sysroot in a private mount namespace #3358

[RFC] Hide /sysroot in a private mount namespace #3358

ruihe774 commented Dec 18, 2024 •

edited

Loading

openshift-ci bot commented Dec 18, 2024

ruihe774 commented Dec 18, 2024 •

edited

Loading

cgwalters left a comment

cgwalters commented Dec 18, 2024

ruihe774 commented Dec 19, 2024 •

edited

Loading

cgwalters commented Dec 19, 2024

ruihe774 commented Dec 19, 2024

cgwalters commented Dec 19, 2024 •

edited

Loading

cgwalters commented Dec 19, 2024

ruihe774 commented Dec 19, 2024 •

edited

Loading

cgwalters commented Dec 20, 2024

ruihe774 commented Dec 20, 2024

cgwalters commented Dec 20, 2024

cgwalters Dec 20, 2024

ruihe774 Dec 20, 2024

ruihe774 commented Dec 20, 2024 •

edited

Loading

ruihe774 commented Dec 20, 2024

This comment was marked as resolved.

[RFC] Hide /sysroot in a private mount namespace #3358

Are you sure you want to change the base?

[RFC] Hide /sysroot in a private mount namespace #3358

Conversation

ruihe774 commented Dec 18, 2024 • edited Loading

openshift-ci bot commented Dec 18, 2024

ruihe774 commented Dec 18, 2024 • edited Loading

cgwalters left a comment

Choose a reason for hiding this comment

cgwalters commented Dec 18, 2024

ruihe774 commented Dec 19, 2024 • edited Loading

cgwalters commented Dec 19, 2024

ruihe774 commented Dec 19, 2024

cgwalters commented Dec 19, 2024 • edited Loading

cgwalters commented Dec 19, 2024

ruihe774 commented Dec 19, 2024 • edited Loading

cgwalters commented Dec 20, 2024

ruihe774 commented Dec 20, 2024

cgwalters commented Dec 20, 2024

cgwalters Dec 20, 2024

Choose a reason for hiding this comment

ruihe774 Dec 20, 2024

Choose a reason for hiding this comment

ruihe774 commented Dec 20, 2024 • edited Loading

ruihe774 commented Dec 20, 2024

This comment was marked as resolved.

ruihe774 commented Dec 18, 2024 •

edited

Loading

ruihe774 commented Dec 18, 2024 •

edited

Loading

ruihe774 commented Dec 19, 2024 •

edited

Loading

cgwalters commented Dec 19, 2024 •

edited

Loading

ruihe774 commented Dec 19, 2024 •

edited

Loading

ruihe774 commented Dec 20, 2024 •

edited

Loading