Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Hide /sysroot in a private mount namespace #3358

Draft
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

ruihe774
Copy link
Contributor

@ruihe774 ruihe774 commented Dec 18, 2024

Related: #3211

This PR implements a new option sysroot.invisible in prepare-root.conf to hide /sysroot in a private mount namespace /run/ostree/.private/sysroot-ns to prevent the rest of the system from accessing it, while ostree admin commands can still operate on it inside a mount namespace.

This PR also add a new admin command "nsenter" that runs program in the mount namespace where /sysroot is present. This can wrap tools that currently do not support invisible sysroot. As an example, below is an drop-in override for rpm-ostreed.service:

[Unit]
ConditionPathExists=
ConditionKernelCommandLine=ostree

[Service]
ExecStart=
ExecStart=+/usr/bin/ostree admin nsenter --exec /usr/bin/rpm-ostree start-daemon

I know this is a big change, and there may be corner cases that I haven't considered. However, I have tested it on my local machine, and basic functionalities (booting, deploying, integration with rpm-ostree) work fine. And my code is ready for review.

Copy link

openshift-ci bot commented Dec 18, 2024

Hi @ruihe774. Thanks for your PR.

I'm waiting for a ostreedev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@github-actions github-actions bot added the area/prepare-root Issue relates to ostree-prepare-root label Dec 18, 2024
@ruihe774
Copy link
Contributor Author

ruihe774 commented Dec 18, 2024

The test failed because only very new glibc has wrappers for open_tree and move_mount. I'm going to switch to use raw syscalls instead.

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for working on this!

One struggle we have right now is that the prepare-root code is not unit or integration tested well; I wouldn't block this on that, but adding more features there compounds that issue. I may find some time to work on that...


On a different note, in theory this code could be implemented as a service that runs After=ostree-prepare-root.service right? I wonder if it'd be helpful for us to structure it that way. But eh, just a thought.

src/switchroot/ostree-prepare-root.c Outdated Show resolved Hide resolved
src/switchroot/ostree-prepare-root.c Show resolved Hide resolved
src/switchroot/ostree-prepare-root.c Show resolved Hide resolved
src/libostree/ostree-sysroot.c Outdated Show resolved Hide resolved
src/libostree/ostree-sysroot.c Outdated Show resolved Hide resolved
src/libostree/ostree-sysroot.c Outdated Show resolved Hide resolved
src/libostree/ostree-sysroot.c Outdated Show resolved Hide resolved
src/libostree/ostree-sysroot.c Outdated Show resolved Hide resolved
src/libostree/ostree-sysroot.c Outdated Show resolved Hide resolved
@cgwalters
Copy link
Member

The test failed because only very new glibc has wrappers for open_tree and move_mount. I'm going to switch to use raw syscalls instead.

I'm also OK just not having this feature work on older glibc for now.

(Though it's also of course worth noting that in containers/bootc#919 I used the super nice rustix bindings for this which worked fine on C9S even though glibc doesn't have the headers there...and this touches on the ongoing thought of how we might start to use more Rust in libostree in general...or maybe put some of this in bootc to start...)

@ruihe774
Copy link
Contributor Author

ruihe774 commented Dec 19, 2024

The test failed because only very new glibc has wrappers for open_tree and move_mount. I'm going to switch to use raw syscalls instead.

I'm also OK just not having this feature work on older glibc for now.

(Though it's also of course worth noting that in containers/bootc#919 I used the super nice rustix bindings for this which worked fine on C9S even though glibc doesn't have the headers there...and this touches on the ongoing thought of how we might start to use more Rust in libostree in general...or maybe put some of this in bootc to start...)

I think in first step we can have individual binaries in switchroot rewritten in Rust. Currently we have to maintain two ostree-prepare-root, one dynamically linked and one static. Using Rust we can easily merge them and always generate static builds.

I'll try to make some POC later if I have spare time.

@ruihe774 ruihe774 force-pushed the sysroot-invisible branch 3 times, most recently from 7e427cc to 175a9e7 Compare December 19, 2024 08:24
@cgwalters
Copy link
Member

I think in first step we can have individual binaries in switchroot rewritten in Rust. Currently we have to maintain two ostree-prepare-root, one dynamically linked and one static. Using Rust we can easily merge them and always generate static builds.

We do have a composefs-rs project going on but honestly...I would like to try to remove the static ostree-prepare-root in favor of UKIs basically.

@ruihe774
Copy link
Contributor Author

I have read some docs about socket activation, fd store, and unix socket. I find it's possible to write a service that uploads the fd obtained by open_tree to systemd and serves it over unix socket on demand using socket activation. Meanwhile, SELinux restrictions can also apply to the path of unix socket. I now consider this a better solution and it does not pollute the global mountinfo.

We can keep current part that hides /sysroot inside sysroot-ns in ostree-prepare-root, and write a service outside initramfs to open_tree, upload, and unmount it. Socket activation can be used to start the service on-daemon to serve the fd. Clients using libostree can obtain the fd through socket. The current part in libostree can be kept as a fallback in case the holder service is not functioning.

However, the problem is that I'm not familiar with systemd and C socket programming. It is difficult for me to implement the aforementioned. I wonder if you could give me some help. Thank you.

@ruihe774 ruihe774 marked this pull request as draft December 19, 2024 15:46
@cgwalters
Copy link
Member

cgwalters commented Dec 19, 2024

A totally different approach:

Today systemd when using volatile root sets up a special symlink that points to the root block device.

On Linux, it's possible to mount a block device multiple times. So instead of having /sysroot mounted at all, we just create that symlink to the real backing block device (which would be generally useful) and then anything which wants to operate on the physical root (including ostree/bootc) just...mounts it on their own, wherever they want (which may be /sysroot, which we'd leave as an empty directory suitable for this purpose, but again it doesn't need to be there at all).

I have only spent 3 minutes and 5 seconds thinking about this and maybe there's something I'm missing but it would be dramatically simpler than anything discussed so far.

EDIT: (another 20 seconds of thought) A big bonus here is that there's already the ability to have LSM policies which can apply to accessing block devices, so it'd be easy to restrict the set of things accessing the physical root (which would be a special case of general block device access, which really few things should have in general).

@cgwalters
Copy link
Member

And to be clear I think we should create that symlink unconditionally (should be a separate PR).

Then this PR is just about whether or not to leave /sysroot mounted effectively.

@ruihe774
Copy link
Contributor Author

ruihe774 commented Dec 19, 2024

A totally different approach:

Today systemd when using volatile root sets up a special symlink that points to the root block device.

On Linux, it's possible to mount a block device multiple times. So instead of having /sysroot mounted at all, we just create that symlink to the real backing block device (which would be generally useful) and then anything which wants to operate on the physical root (including ostree/bootc) just...mounts it on their own, wherever they want (which may be /sysroot, which we'd leave as an empty directory suitable for this purpose, but again it doesn't need to be there at all).

I have only spent 3 minutes and 5 seconds thinking about this and maybe there's something I'm missing but it would be dramatically simpler than anything discussed so far.

EDIT: (another 20 seconds of thought) A big bonus here is that there's already the ability to have LSM policies which can apply to accessing block devices, so it'd be easy to restrict the set of things accessing the physical root (which would be a special case of general block device access, which really few things should have in general).

I don't think this approach can work with Btrfs subvolumes. Sometimes system roots are Btrfs subvolumes, not at the top of a block device. I know OpenSUSE (i.e. snapper) has such configurations. And what if the sysroot is not backed by a block device at all? For example, through network?

I believe there are many corner cases if we assume sysroot is always a root of a block device.

@cgwalters
Copy link
Member

I don't think this approach can work with Btrfs subvolumes. Sometimes system roots are Btrfs subvolumes, not at the top of a block device.

Yes, a valid point; it is also the setup for Fedora btrfs (at least with Anaconda). The general way to handle this would be to emit a .mount unit that describes how to mount the sysroot.

Actually...we already should have a sysroot.mount from the initramfs, we can just copy it to the final root.

(This intersects a bit with containers/bootc#972 - for the btrfs and other cases we could attempt to re-synthesize the necessary mount information from kargs, but I think the most general fix is to honor however the root was set up in the initramfs, which would be sysroot.mount).

And what if the sysroot is not backed by a block device at all? For example, through network?

I don't know that there's a real use case for accessing the physical root in the network case though. Although it depends on what you mean by "network". iSCSI for example is still network, but is also a block device. NFS-as-root...I think is a bad idea. See also containers/bootc#898 which I think is better for those desiring diskless.

@ruihe774
Copy link
Contributor Author

I have compared several approaches to protect /sysroot, and find the approach used in this PR (hiding it inside a mount namespace) is relatively advantageous.

  1. Use systemd config InaccessiblePaths=/sysroot
    • Pros:
      • Do not require any change to existing code.
    • Cons:
      • It's opt-in: users (or distro maintainers) need to add this config to every service that does not access /sysroot.
        • Giving that ostree is not very popular, it is little chance that upstream developers add this config.
      • InaccessiblePaths implicitly creates a private mount namespace, which may break some services.
  2. Bind-mount /sysroot to /run/ostree/.private/sysroot, and unmount /sysroot
    • Pros:
      • Do not need to use setns and a private mount namespace when operating on /sysroot
    • Cons:
      • Need to use a LSM to prevent unrelated processes from accessing /run/ostree/.private
  3. Hold /sysroot fd in a holder process, unmount it, and serve it through a socket
    • Pros:
      • Do not pollute global mount namespace.
    • Cons:
      • Need to use a LSM to protect the socket.
      • Need a holder process, long running or activated through socket.
      • Privileged processes may steal the fd through /proc.
        • Need to use systemd config ProtectProc=, which implicitly creates a private mount namespace and is prone to break services.
  4. Hide /sysroot inside a private mount namespace at /run/ostree/.private/sysroot-ns
    • Pros:
      • Multiple methods to protect it:
        • use a LSM to prevent unrelated processes from accessing /run/ostree/.private, or
        • use systemd config RestrictNamespaces=:
          • It is a system call filter and does not creates a private mount namespace, less possible to break services.
          • It is likely that for services do not use namespaces, the upstream developers have already hardened their services using this config.
    • Cons:
      • The bind mount of /run/ostree/.private/sysroot-ns still pollute the global mount namespace.

@cgwalters
Copy link
Member

Bind-mount /sysroot to /run/ostree/.private/sysroot, and unmount /sysroot

I would call this: "Move /sysroot to /run/ostree/.private/sysroot" and list its advantages/disadvantages as basically "Makes it somewhat less likely for processes to find and traverse it, otherwise same as status quo".

Re option 3:

Privileged processes may steal the fd through /proc.

Sure but that's really unlikely to happen accidentally, which is about half of what we're trying to improve. A historical issue we've had with the /sysroot mount is that backup tools and things like setfiles for selinux relabeling traverse it by default. But tools like that know not to traverse /run or /proc. And as far as malicious access, well you have to already be uid 0 to get to it; and yes that's where process sandboxing and LSMs come in, there's not much more we can do here.

src/switchroot/ostree-prepare-root.c Outdated Show resolved Hide resolved
src/libostree/ostree-sysroot.c Outdated Show resolved Hide resolved
src/libostree/ostree-sysroot.c Outdated Show resolved Hide resolved
src/libotutil/ot-fs-utils.c Outdated Show resolved Hide resolved
src/ostree/ot-admin-builtin-nsenter.c Show resolved Hide resolved
src/ostree/ot-admin-builtin-nsenter.c Show resolved Hide resolved
@@ -121,11 +123,6 @@ ostree_builtin_admin (int argc, char **argv, OstreeCommandInvocation *invocation
}
}

else if (g_str_equal (argv[in], "--"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you motivate this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to support command patterns like this:

$ ostree admin nsenter -- sh -c "echo hello"

In this command, if -- is omitted, -c is processed by ostree and it fails.

So, nsenter need to support -- processing. If -- is processed here, nsenter cannot get the arguments after --.

Maybe deleting the logic here may break things. We need to add a OSTREE_BUILTIN_FLAG_ to specify whether a command needs to process -- by itself. I'll do that later.

src/switchroot/ostree-prepare-root.c Outdated Show resolved Hide resolved
@ruihe774
Copy link
Contributor Author

ruihe774 commented Dec 20, 2024

In a big commit 96e2fb8 I try to avoid entering a mount namespace and keep using dir fd for sysroot and boot. The logic is somewhat complicated, especially when involved with remounting in _ostree_sysroot_ensure_writable. Now libostree only enters a mount namespace if OSTREE_ADMIN_BUILTIN_FLAG_ENTER_NS is specified. PTAL.

TODO:

  • Fix _ostree_sysroot_ensure_writable
  • Fix _ostree_sysroot_run_in_deployment
  • Fix ostree_sysroot_write_deployments_with_options
  • Fix full_system_sync

@ruihe774
Copy link
Contributor Author

In a big commit 96e2fb8 I try to avoid entering a mount namespace and keep using dir fd for sysroot and boot. The logic is somewhat complicated, especially when involved with remounting in _ostree_sysroot_ensure_writable. Now libostree only enters a mount namespace if OSTREE_ADMIN_BUILTIN_FLAG_ENTER_NS is specified. PTAL.

Ok. Now it's not necessary to wrap rpm-ostree with nsenter anymore!

@ruihe774

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants