-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unmounting/EINVAL flake #18831
Comments
This is likely the result of: containers/storage#1607 cc @mtrmac Also this change is causing the podman upgrade tests to fail. |
As I recall making the storage private was because of leaked mounts when we were using Devicemapper. Not sure if this still applies, but that was a huge problem back in the early Docker days. |
Yes. As I said there, I don’t really know what I’m doing and I wouldn’t be too surprised if ignoring unmount failures were intentional. But then again, there are multiple layers of metadata, reference counting, and extra checks of the tracked state vs. actual state (like https://github.com/mtrmac/storage/blob/da0df59a00e0a2cd2880c7cc82322989834e7bf4/drivers/counter.go#L54 ) that if we get here and we are still trying to unmount something that isn’t mounted at all, that seems like a failure worthy of some attention. By someone who understands things better than I do, if possible… |
(Just to be explicit, that should be possible turn off with |
This is now triggering in main. debian root |
Yes I found that option, 68183b0 |
Is anyone working on fixing this?
|
for containers#18514: if we get a timeout in teardown(), run and show the output of podman system locks for containers#18831: if we hit unmount/EINVAL, nothing will ever work again, so signal all future tests to skip. Signed-off-by: Ed Santiago <[email protected]>
Ping! Hi everyone, this is still hitting us in CI, and in real PRs. Most recent was last night, in a PR that (if I git correctly) includes #18999 (vendor c/*).
|
A friendly reminder that this issue had no activity for 30 days. |
This is still hitting us hard.
|
WOW! Every single |
I just got this on my laptop, rootless non-remote, after ^C'ing from a test loop:
Unlike this issue, a subsequent |
Haven't seen this one in the last 4-5 days, but now all of a sudden:
(in f38 remote root). To recap:
Still remote-only AFAICT. |
I think the original variant is pre-containers/storage#1607 : The second is after containers/storage#1607 : unmount failure is reported. The third is after containers/storage#1687 : unmount failure is ignored again, but deleting is now a hard failure in AFAICT none of this tinkering has substantially improved our understanding of the actual cause. |
Seen just now in int f38 rootless non-remote:
Not quite the same -- this is |
Hi, I know everyone's busy and that this is a hard one, but please don't forget about it.
Seen in: int/sys podman/remote fedora-37/fedora-38/fedora-38-aarch64/rawhide root/rootless host boltdb/sqlite |
Here's a variant that I don't recognize (f37 remote root):
Curiously, this does not hose all subsequent tests. The error persists forevermore, but only in a place where it doesn't really matter. Subsequent tests actually pass. |
Just seen in debian 13 but with yet another surprise symptom:
That looks suspiciously like #17042 ("catatonit et al not found", another hard-to-track-down flake). And, blam, everything after that is hosed. |
Followup: possibly fixed by @giuseppe in containers/storage#1722 |
Sigh..... no such luck. |
Was the change already vendored? I will take another look, maybe it is a different instance of the same problem and needs a similar fix |
Yes, this is #17831 my pet PR on which I churn CI. I believe I correctly vendored c/storage but it's always possible I did something wrong. PS isn't it really, really late where you are?? |
I double checked and think you vendored it correctly. But I am sure @giuseppe has picked up the right sent - there may be more such cases scattered in c/storage (or libpod?). |
the good news is that the occurrence I looked at yesterday doesn't appear in the log, and that was an easy one since I've managed to reproduce it all the time in a VM. I am looking at the new error message right now. |
I think the issue could be fixed with something like: diff --git a/drivers/overlay/overlay.go b/drivers/overlay/overlay.go
index 0f6d74021..f6f0932e0 100644
--- a/drivers/overlay/overlay.go
+++ b/drivers/overlay/overlay.go
@@ -1891,12 +1891,7 @@ func (d *Driver) Put(id string) error {
}
}
- if err := unix.Rmdir(mountpoint); err != nil && !os.IsNotExist(err) {
- logrus.Debugf("Failed to remove mountpoint %s overlay: %s - %v", id, mountpoint, err)
- return fmt.Errorf("removing mount point %q: %w", mountpoint, err)
- }
-
- return nil
+ return system.EnsureRemoveAll(mountpoint)
} but before I open a PR, I'd like to understand how we even get in that situation. @edsantiago do you think we could collect some more information with: diff --git a/test/system/helpers.bash b/test/system/helpers.bash
index 3fcd69a60..f7dad187e 100644
--- a/test/system/helpers.bash
+++ b/test/system/helpers.bash
@@ -150,10 +150,13 @@ function basic_setup() {
echo "# setup(): removing stray external container $1 ($2)" >&3
run_podman '?' rm -f $1
if [[ $status -ne 0 ]]; then
+ run_podman '?' --log-level debug rm -f $1
echo "# [setup] $_LOG_PROMPT podman rm -f $1" >&3
for errline in "${lines[@]}"; do
echo "# $errline" >&3
done
+ echo "# MOUNTINFO"
+ cat /proc/self/mountinfo | while read i; do echo "# $i"; done
# FIXME FIXME FIXME: temporary hack for #18831. If we see the
# unmount/EINVAL flake, nothing will ever work again.
if [[ $output =~ unmounting.*invalid ]]; then I've opened a test PR based on yours: #20183, I'll let it run for a while and see if I am lucky to hit it |
Move the execution of RecordWrite() before the graphDriver Cleanup(). This addresses a longstanding issue that occurs when the Podman cleanup process is forcely terminated and on some occasions the termination happens after the Cleanup() but before the change is recorded. This causes that the next user is not notified about the change and will mount the container without the home directory below (the infamous /var/lib/containers/storage/overlay mount). Then when the next time the graphDriver is initialized, the home directory is mounted on top of the existing mounts causing some containers to fail with ENOENT since all files are hidden and some others cannot be cleaned up since their mount directory is covered by the home directory mount. Closes: containers/podman#18831 Closes: containers/podman#17216 Closes: containers/podman#17042 Signed-off-by: Giuseppe Scrivano <[email protected]>
Two new flakes seen in #13808 (buildah treadmill):
Once they happen, everything is hosed: all subsequent tests fail.
This is sooooooo close to #17216 (unlinkat/EBUSY) that I'm 99.99% convinced it's the same bug, just a different manifestation (presumably because of new containers/whatever). If that turns out to be the case, I'll close this issue and merge these flakes there.
The text was updated successfully, but these errors were encountered: