-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
finalize-staged: Ensure /boot and /sysroot automounts don't expire #2544
Conversation
This passed the test suite for me locally but it's otherwise totally untested. |
Hmm, that looks likely to be related to this. |
|
de1a8bc
to
d1e8d2a
Compare
I rewrote this to use |
This is because I changed to to use |
d1e8d2a
to
0bfeb40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks generally good to me. Did you test this locally now?
At least our CI will cover this pretty well, but we just need to check any failures there versus the random flakes.
* FIXME: This overlaps with the mount namespace handling in | ||
* ostree_admin_option_context_parse. That should be factored out. | ||
*/ | ||
if (unshare (CLONE_NEWNS) < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhh, right. I think it would be cleaner is to have systemd do this for us via e.g.
MountFlags=slave
in the unit. See e.g.
https://github.com/coreos/rpm-ostree/blob/baa2eba2d77647204be2baff2b385e16220e0fac/src/daemon/rpm-ostreed.service.in#L11
(In rpm-ostree we always do everything via the daemon; unfortunately that's not true of ostree today, but in this case we know we're running in systemd and can use its features)
(Although I guess this direction conflicts with people who are trying to do ostree without systemd, but...I think the onus is going to be increasingly on them to at least parse a subset of unit files)
But, OK as is too - i.e. we can merge this and do followups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I'm following this right, because the unit has ProtectHome
and ReadOnlyPaths
, then systemd is already going to setup a mount namespace and there's no reason to create another one. Is that right? I sorta feel like you'd want to have code to check whether the process was in its own namespace or not.
Another thing I thought of while looking at this section. The unit talks about how only ProtectHome
can be used, but then the sysroot goes and remounts /boot
and /sysroot
readwrite. Furthermore, it talks about how it needs to remove /var/.updated
, but isn't that within the stateroot /var
? Oh, I see that there's some handling for /var
as a separate mount. Anways, this all looks like it could be handles with:
ProtectSystem=strict
ReadWritePaths=/var
with the mount handling. Maybe another day, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I'm following this right, because the unit has ProtectHome and ReadOnlyPaths, then systemd is already going to setup a mount namespace and there's no reason to create another one.
Ah yep, those already imply MountFlags=slave
indeed.
Is that right? I sorta feel like you'd want to have code to check whether the process was in its own namespace or not.
I think the simplest is to just error out if we're not running under systemd. The pattern I've used elsewhere is to check the INVOCATION_ID env var.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I've definitely called ostree admin finalize-staged
manually when debugging stuff. I could fake out the INVOCATION_ID
I guess, but maybe more elegant to do:
- if in systemd unit, don't unshare
- otherwise unshare or die trying
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, fine by me.
But, one can also run systemctl stop ostree-finalize-staged.service
too - and if we bailed out if INVOCATION_ID
wasn't set, you'd probably know to do so.
I think the LGTM issue should be fixed by #2545. |
Yeah, thanks for the kola tests. I kinda wish I could get that environment going locally. Let me try to take this for a spin in my problematic VM with the |
0bfeb40
to
aad2d75
Compare
No changes in that last force push. Just wanted to rebase on the LGTM fix. |
This pull request introduces 1 alert when merging aad2d75 into 12cafbc - view on LGTM.com new alerts:
|
*/ | ||
if (unshare (CLONE_NEWNS) < 0) | ||
return glnx_throw_errno_prefix (error, "setting up mount namespace: unshare(CLONE_NEWNS)"); | ||
ostree_sysroot_set_mount_namespace_in_use (sysroot); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this doesn't like that the sysroot is already loaded:
Feb 17 13:11:58 endless ostree[2472]: ostree_sysroot_set_mount_namespace_in_use: assertion 'self->loadstate < OSTREE_SYSROOT_LOAD_STATE_LOADED' failed
Since the pre-SIGTERM part just needs ostree_sysroot_get_fd
, I guess what would help is to only initialize but not load. I think that would need a new OSTREE_ADMIN_BUILTIN_FLAG_NO_LOAD
or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it helps we can move this entire thing into a ostree_cmd__private__ ()->ostree_finalize_staged
that can use all the internal APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good reference for this is ostree-system-generator.c
- the binary just calls into a private API from the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, but the complication is in loading the sysroot appropriately, which currently always happens in the builtins (and ostree-system-generator doesn't do). I think there are advantages to moving the whole thing internal, but the downside is repeating all of the setup code in src/ostree
.
Fun times. At least in my VM, when systemd sets up a mount namespace (due to either of |
😢 So...a more complex solution is something like this:
The (There may be a bit more elegant way to do IPC between these two processes; this relates to https://lists.freedesktop.org/archives/systemd-devel/2021-February/046112.html ) Or, we could just drop the EDIT: To explain this more, notice the additional¹ ¹ 😉 |
This is pretty clever, but I don't think it would work. Once the initial process in the root namespace exits, then there won't be anything keeping I think the only way to do this is to handle the mount namespace in process as I was doing on my last attempt while ensuring it's run in the root namespace. Then the initial BTW, I now recreated this autofs bug as simple as possible in a rawhide VM and even made it use automount(8) rather than systemd to ensure it's really an autofs bug and not a systemd automount daemon bug. I'm going to file the kernel bug after I've finished gathering all the data. |
In my proposal, both processes stay active until they are both complete. |
Ah, I missed that. That would work and removes the complication of handling all the mounts manually. I'll try to give that a shot (although I probably have to switch on to other tasks for a while). I filed https://bugzilla.redhat.com/show_bug.cgi?id=2056090. |
Can't a service unit only have a single top-level process though?
And even in the
Hmm, I think another simpler approach would be having the fd holding happen in a separate unit without a mount namespace which runs |
Yeah, also SGTM. |
FWIW, we decided to punt on this and not use staged deployments when |
carbonOS is using a boot automount, but I don't necessarily see a reason to not just set Perhaps it should be documented somewhere that staged deployments are incompatible with /boot timing out (maybe in the API doc for creating a staged deployment?) until this fix lands? |
I moved away from this because we were going to EOL the product that this affects, but now that decision has been reversed. My plan here is:
Seem reasonable? I think another way would be to just add a separate |
Two systemd units as jlebon originally suggested with two distinct CLI verbs seems conceptually cleanest to me, it's just a bit more verbose to type out. You're right that there are a ton of options that will cause a mount namespace to be created, but we can just basically not use any unit options in the |
I'm going with no options in the unit and the explicit I'd like to add a test in |
A cool thing IMO about https://github.com/coreos/coreos-assembler is that the entire build and test tooling is one (large) container image you can run anywhere you like. Though it also requires doing a build in the corresponding userspace (e.g. f36). But it's basically using https://coreos.github.io/coreos-assembler/working/#using-overrides to drop in the new ostree binaries, plus |
aad2d75
to
6101724
Compare
That thing is really cool. I'm sure I wasn't driving it very well, but that was way faster than how I would have normally done something like that. I wish we had something like that for Endless. |
I think this is ready to review again. I ended up not touching the The added kola test was extremely helpful to make sure this was actually doing what I thought it would do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me as is, we can address anything else via followup commits too.
Thanks so much for working on this!
@@ -50,13 +65,57 @@ ot_admin_builtin_finalize_staged (int argc, char **argv, OstreeCommandInvocation | |||
|
|||
g_autoptr(GOptionContext) context = g_option_context_new (""); | |||
g_autoptr(OstreeSysroot) sysroot = NULL; | |||
|
|||
/* First parse the args without loading the sysroot to see what options are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with this but...I think we could also bypass the whole thing and just do
bool opt_hold = argc > 2 && strcmp (argv[1], "--hold") == 0;
or so? Or use an environment variable.
Dunno. I don't feel really strongly. My main concern is that someone later comes by and makes some sort of change that isn't ready for ostree_admin_option_context_parse
being invoked twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, I'm personally not a fan of adding adhoc option handing. What if the first option is --verbose
or something else?
I think I can go back to the refactoring route and split out the sysroot opening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with the code as is too.
Though...hmm, wouldn't it work to use plain old g_option_context_parse()
at least? That would avoid any issues with invoking ostree_admin_option_context_parse
twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would except that you'd have to duplicate the handling of the --sysroot
option at least. I think the refactoring I added to initialize but not load the sysroot should work well. See the new prep commit I added.
It can be useful to parse the options and initialize the sysroot without actually loading it until later. Factor out the sysroot loading to a new `ostree_admin_sysroot_load` and add a new `OSTREE_ADMIN_BUILTIN_FLAG_NO_LOAD` flag to accommodate this.
6101724
to
531597d
Compare
If `/boot` is an automount, then the unit will be stopped as soon as the automount expires. That's would defeat the purpose of using systemd to delay finalizing the deployment until shutdown. This is not uncommon as `systemd-gpt-auto-generator` will create an automount unit for `/boot` when it's the EFI System Partition and there's no fstab entry. To ensure that systemd doesn't stop the service early when the `/boot` automount expires, introduce a new unit that holds `/boot` open until it's sent `SIGTERM`. This uses a new `--hold` option for `finalize-staged` that loads but doesn't lock the sysroot. A separate unit is used since we want the process to remain active throughout the finalization run in `ExecStop`. That wouldn't work if it was specified in `ExecStart` in the same unit since it would be killed before the `ExecStop` action was run. Fixes: ostreedev#2543
531597d
to
f3db79e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks so much for doing this!
gboolean running = TRUE; | ||
g_unix_signal_add (SIGTERM, sigterm_cb, &running); | ||
g_print ("Waiting for SIGTERM\n"); | ||
while (running) | ||
g_main_context_iteration (NULL, TRUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this as is, but in general I think it's even cleaner where possible to not install a SIGTERM
handler and let the kernel simply kill the process. That way there's one fewer context switch at shutdown time.
IOW, we could just do:
while (true)
g_main_context_iteration (NULL, TRUE);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very true. I guess I like the SIGTERM
handler so that the service exits gracefully instead of showing as failed, although in this case you'd probably never look at the status of the service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would show as failed?
[root@cosa-devsh ~]# cat /etc/systemd/system/testunit.service
[Service]
ExecStart=sleep infinity
[root@cosa-devsh ~]# systemctl start testunit
[root@cosa-devsh ~]# systemctl status testunit
● testunit.service
Loaded: loaded (/etc/systemd/system/testunit.service; static)
Active: active (running) since Tue 2022-08-30 20:21:23 UTC; 1s ago
Main PID: 1792 (sleep)
Tasks: 1 (limit: 1042)
Memory: 284.0K
CPU: 1ms
CGroup: /system.slice/testunit.service
└─ 1792 sleep infinity
Aug 30 20:21:23 cosa-devsh systemd[1]: Started testunit.service.
[root@cosa-devsh ~]# systemctl stop testunit
[root@cosa-devsh ~]# systemctl status testunit
○ testunit.service
Loaded: loaded (/etc/systemd/system/testunit.service; static)
Active: inactive (dead)
Aug 30 20:21:27 cosa-devsh systemd[1]: Stopping testunit.service...
Aug 30 20:21:27 cosa-devsh systemd[1]: testunit.service: Deactivated successfully.
Aug 30 20:21:27 cosa-devsh systemd[1]: Stopped testunit.service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR in #2704
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was thinking that systemd would see the non-0 exit status and interpret it as failed. But it makes sense since it's the parent and can process the status appropriately.
Followup from discussion in ostreedev#2544 (comment) This is more efficient; no need to have the kernel context switch us in at shutdown time just so we can turn around and call `exit()`.
Followup from discussion in ostreedev#2544 (comment) This is more efficient; no need to have the kernel context switch us in at shutdown time just so we can turn around and call `exit()`.
Followup from discussion in ostreedev/ostree#2544 (comment) This is more efficient; no need to have the kernel context switch us in at shutdown time just so we can turn around and call `exit()`. (cherry picked from commit 683e4ef)
This reverts commit a19821a. OSTree has been fixed to support this use case by keeping `/boot` open in the root namespace until the staged deployment completes finalization. See ostreedev/ostree#2544 for details. https://phabricator.endlessm.com/T33775
If
/boot
or/sysroot
are automounts, then the unit will be stoppedas soon as the automounts expire. That's would defeat the purpose of
using systemd to delay finalizing the deployment until shutdown. This is
not uncommon as
systemd-gpt-auto-generator
will create an automountunit for
/boot
when it's the EFI System Partition and there's no fstabentry.
Instead of relying on systemd to run the command via
ExecStop
at theappropriate time, have
finalize-staged
open/boot
and/sysroot
andthen block on
SIGTERM
. Having the directories open will prevent theautomounts from expiring, and then we presume that systemd will send
SIGTERM
when it's time for the service to stop. Finalizing thedeployment still happens when the service is stopped. The difference is
that the process is already running.
In order to keep from blocking legitimate sysroot activity prior to
shutdown, the sysroot lock is only taken after the signal has been
received. Similarly, the sysroot is reloaded to ensure the state of the
deployments is current.
Fixes: #2543