unmounting/EINVAL flake #18831

edsantiago · 2023-06-08T20:27:12Z

Two new flakes seen in #13808 (buildah treadmill):

<+009ms> # # podman-remote --url unix:/tmp/podman_tmp_9HtW pod rm c3cc31049b088d86ef22ad5b279ce5c1d01b88887116bc9f6486b9b0a28f2080
<+104ms> # Error: 2 errors occurred:
         # 	* removing container 0c7966924a988d4e00f2585da31627a0db0621e6fbd6531cea4d02779e3bcfe2 from pod c3cc31049b088d86ef22ad5b279ce5c1d01b88887116bc9f6486b9b0a28f2080: cleaning up storage: removing container 0c7966924a988d4e00f2585da31627a0db0621e6fbd6531cea4d02779e3bcfe2 root filesystem: unmounting "/var/lib/containers/storage/overlay/4e8aaf7a108d0bf91f873379b4558162620f9e4ffaca942d06ef4d40778456bc/merged": invalid argument
         # 	* removing container 55ade4c16aa0fffb758fc893be4d24c8fdaabcde14eec89df7885b42c4b03881 from pod c3cc31049b088d86ef22ad5b279ce5c1d01b88887116bc9f6486b9b0a28f2080: cleaning up storage: removing container 55ade4c16aa0fffb758fc893be4d24c8fdaabcde14eec89df7885b42c4b03881 root filesystem: unmounting "/var/lib/containers/storage/overlay/1ef8c38e52bd1d345801a0a4a7b2e0631b85965881be472dcb6db0afd6c3a65b/merged": invalid argument

Once they happen, everything is hosed: all subsequent tests fail.

This is sooooooo close to #17216 (unlinkat/EBUSY) that I'm 99.99% convinced it's the same bug, just a different manifestation (presumably because of new containers/whatever). If that turns out to be the case, I'll close this issue and merge these flakes there.

debian-12 : sys remote debian-12 root host boltdb [remote]
- 06-08 15:08 in [sys] podman play
fedora-38-aarch64 : sys remote fedora-38-aarch64 root host boltdb [remote]
- 06-08 15:08 in [sys] pod exit policies

The text was updated successfully, but these errors were encountered:

Luap99 · 2023-06-09T10:32:30Z

This is likely the result of: containers/storage#1607

cc @mtrmac

Also this change is causing the podman upgrade tests to fail.
Looking at the upgrade test setup it is obvious why this is not compatible with the c/storage change. The upgrade tests start a container inside an old podman version container and then try to stop it outside with the new podman. Inside the container the mounts are not shared so when the outside podman tries to stop it the umount call fail with EINVAL as the mount is not visible in the other namespace.
I am working on fix for the tests as the mounts should be shared but it looks like the default c/storage behaviour is to make the container store use private by default.

rhatdan · 2023-06-10T10:57:10Z

As I recall making the storage private was because of leaked mounts when we were using Devicemapper. Not sure if this still applies, but that was a huge problem back in the early Docker days.

mtrmac · 2023-06-12T20:26:50Z

This is likely the result of: containers/storage#1607

Yes. As I said there, I don’t really know what I’m doing and I wouldn’t be too surprised if ignoring unmount failures were intentional.

But then again, there are multiple layers of metadata, reference counting, and extra checks of the tracked state vs. actual state (like https://github.com/mtrmac/storage/blob/da0df59a00e0a2cd2880c7cc82322989834e7bf4/drivers/counter.go#L54 ) that if we get here and we are still trying to unmount something that isn’t mounted at all, that seems like a failure worthy of some attention. By someone who understands things better than I do, if possible…

mtrmac · 2023-06-12T20:30:42Z

it looks like the default c/storage behaviour is to make the container store use private by default.

(Just to be explicit, that should be possible turn off with skip_mount_home=true.)

edsantiago · 2023-06-12T21:37:40Z

This is now triggering in main. debian root

Luap99 · 2023-06-13T08:26:33Z

it looks like the default c/storage behaviour is to make the container store use private by default.

(Just to be explicit, that should be possible turn off with skip_mount_home=true.)

Yes I found that option, 68183b0

edsantiago · 2023-06-13T16:42:43Z

Is anyone working on fixing this?

debian-12 : sys remote debian-12 root host sqlite [remote]
- 06-13 11:03 in [sys] podman run with slirp4ns adds correct dns address to resolv.conf
- 06-12 15:26 in [sys] podman kube play read-only
fedora-37 : sys remote fedora-37 root host sqlite [remote]
- 06-12 18:13 in [sys] podman --group-add keep-groups plus added groups

for containers#18514: if we get a timeout in teardown(), run and show the output of podman system locks for containers#18831: if we hit unmount/EINVAL, nothing will ever work again, so signal all future tests to skip. Signed-off-by: Ed Santiago <[email protected]>

edsantiago · 2023-06-28T18:10:36Z

Ping! Hi everyone, this is still hitting us in CI, and in real PRs. Most recent was last night, in a PR that (if I git correctly) includes #18999 (vendor c/*).

debian-12 : sys remote debian-12 root host boltdb [remote]
- PR auto update: restart instead of stop+start #18929
  - 06-19 04:21 in [sys] podman pod create - hashtag AllTheOptions
fedora-37 : sys remote fedora-37 root host boltdb [remote]
- PR make image listing more resilient #18980
  - 06-26 11:17 in [sys] pod exit policies
fedora-38 : sys remote fedora-38 root host sqlite [remote]
- PR pasta: Fix pasta tests to work on hosts with multiple interfaces #19021
  - 06-27 23:49 in [sys] pod exit policies
rawhide : sys remote rawhide root host sqlite [remote]
- PR Trim whitespace from unit files while parsing #18981
  - 06-25 07:40 in [sys] podman pod create --pod new: --hostname
- PR fix(machine): throw error connection refused after set proxy #18936
  - 06-20 13:49 in [sys] podman run with slirp4ns adds correct dns address to resolv.conf
- PR Ed's pet PR with no flake retries #17831
  - 06-21 22:42 in [sys] podman pod create - hashtag AllTheOptions

github-actions · 2023-07-29T00:06:42Z

A friendly reminder that this issue had no activity for 30 days.

edsantiago · 2023-07-31T13:04:28Z

This is still hitting us hard.

debian-12 : sys remote debian-12 root host boltdb [remote]
- PR System tests: add test tags #19302
  - 07-25 11:12 in [sys] podman play
- PR Fix: cgroup is not set: internal libpod error after os reboot #19189
  - 07-10 17:26 in [sys] podman kube play - default log driver
- PR manifest inspect: support authentication #19180
  - 07-10 11:14 in [sys] podman kube --network
fedora-37 : sys remote fedora-37 root host boltdb [remote]
- PR Podman inspect completion #19261
  - 07-18 10:20 in [sys] podman kube play read-only
- PR Add --podman-only flag to podman generate kube #19211
  - 07-14 10:24 in [sys] podman play
fedora-38 : sys remote fedora-38 root host boltdb [remote]
- PR manifest inspect: support authentication #19180
  - 07-10 11:11 in [sys] podman pod create --share
fedora-38 : sys remote fedora-38 rootless host boltdb [remote]
- PR libpod: fix 'podman kube generate' on FreeBSD #19367
  - 07-27 07:32 in [sys] podman kube play - default log driver
rawhide : sys remote rawhide root host sqlite [remote]
- PR system service: unset listen fds on tcp #19196
  - 07-11 11:27 in [sys] podman pod create --pod new: --hostname
- PR Ed's pet PR with no flake retries #17831
  - 07-10 18:19 in [sys] pod exit policies

edsantiago · 2023-08-02T01:32:01Z

WOW! Every single integration rootless test in a buildah bump-VM PR failed! So this has gone from a root-flake to a hard (rootless) failure. This is probably the best chance we have right now to track this down: someone with get-ci-vm privs, maybe get one of these new VMs and see what buildah is doing?

edsantiago · 2023-08-07T15:01:02Z

I just got this on my laptop, rootless non-remote, after ^C'ing from a test loop:

d6ca8864235b6329facf2cdbb48eef7cffe4c4c1669c7e7932cf95e6d88abeed
^C%                                                                                                                                                             
$ bin/podman rm -f -a -t0
ERRO[0000] cleaning up storage: removing container 4d1b2909ce4444604c3616e268b1be978cf58d091df4130cffbb84c1a81e4532 root filesystem: unmounting "/home/esm/.local/share/containers/storage/overlay/a7660b4fa10f5ec903b8025957747358a532a1e304fffb98dbc0347fef6fba60/merged": invalid argument 
138263879a6f1ae5b80d71ad7362d0928dbf2b45c7bc7050e80a5b357f87d579
Error: cleaning up container 4d1b2909ce4444604c3616e268b1be978cf58d091df4130cffbb84c1a81e4532: unmounting container 4d1b2909ce4444604c3616e268b1be978cf58d091df4130cffbb84c1a81e4532 storage: cleaning up container 4d1b2909ce4444604c3616e268b1be978cf58d091df4130cffbb84c1a81e4532 storage: unmounting container 4d1b2909ce4444604c3616e268b1be978cf58d091df4130cffbb84c1a81e4532 root filesystem: unmounting "/home/esm/.local/share/containers/storage/overlay/a7660b4fa10f5ec903b8025957747358a532a1e304fffb98dbc0347fef6fba60/merged": invalid argument

Unlike this issue, a subsequent !! returned no output, no errors, and all is well. My system is not dead-hosed forever. Everything is back to normal. I don't know why it's ok here but hosed-forever when it happens in CI.

edsantiago · 2023-08-28T12:10:26Z

Haven't seen this one in the last 4-5 days, but now all of a sudden:

# setup(): removing stray external container 8d8e6fe55654 (a669ae9cac29-infra)
         # [setup] # podman rm -f 8d8e6fe55654
         # Error: removing storage for container "8d8e6fe55654": removing mount point "/var/lib/containers/storage/overlay/b14478b250486fcac32b2a4591d781a1aadb4903753c0d8023dee6339778f936/merged": device or resource busy
         # setup(): removing stray external container d0890420d690 (test_pod-test)
         # [setup] # podman rm -f d0890420d690
         # Error: removing storage for container "d0890420d690": removing mount point "/var/lib/containers/storage/overlay/b6747d4b2f72ce2a1df040c116b76fe5b0690bfa6d120e2b62b5d838765813d9/merged": device or resource busy
         # setup(): removing stray image localhost/podman-pause:4.7.0-dev-1693138408
         # setup(): removing stray image 5b53252ec597
[+1191s] not ok 591 podman kube play - default log driver
         # (from function `die' in file test/system/[helpers.bash, line 722](https://github.com/containers/podman/blob/59f66714eee24ac05e1445c5b75d8800c59316b0/test/system/helpers.bash#L722),
         #  from function `run_podman' in file test/system/[helpers.bash, line 381](https://github.com/containers/podman/blob/59f66714eee24ac05e1445c5b75d8800c59316b0/test/system/helpers.bash#L381),
         #  in test file test/system/[700-play.bats, line 468](https://github.com/containers/podman/blob/59f66714eee24ac05e1445c5b75d8800c59316b0/test/system/700-play.bats#L468))
         #   `run_podman kube play $PODMAN_TMPDIR/test.yaml' failed
<+     > # # podman-remote --url unix:/tmp/podman_tmp_GA3f rm -t 0 --all --force --ignore
<+048ms> # # podman-remote --url unix:/tmp/podman_tmp_GA3f ps --all --external --format {{.ID}} {{.Names}}
<+046ms> # 8d8e6fe55654 a669ae9cac29-infra
         # d0890420d690 test_pod-test
<+011ms> # # podman-remote --url unix:/tmp/podman_tmp_GA3f rm -f 8d8e6fe55654
<+046ms> # Error: removing storage for container "8d8e6fe55654": removing mount point "/var/lib/containers/storage/overlay/b14478b250486fcac32b2a4591d781a1aadb4903753c0d8023dee6339778f936/merged": device or resource busy
<+006ms> # [ rc=125 ]

(in f38 remote root).

To recap:

first it was unlinkat/EBUSY
then unmounting/EINVAL
now it's removing storage/EBUSY???!?!?!?

Still remote-only AFAICT.

mtrmac · 2023-08-28T14:22:48Z

I think the original variant is pre-containers/storage#1607 : Driver.Put silently ignored errors both unmounting and deleting, and unlinkat failed in some higher layer.

The second is after containers/storage#1607 : unmount failure is reported.

The third is after containers/storage#1687 : unmount failure is ignored again, but deleting is now a hard failure in Driver.Put.

AFAICT none of this tinkering has substantially improved our understanding of the actual cause.

edsantiago · 2023-08-30T14:27:46Z

Seen just now in int f38 rootless non-remote:

[FAILED] DeferCleanup callback returned error: unlinkat /tmp/ginkgo892567680/root/vfs/dir/..../bin: permission denied

Not quite the same -- this is int tests, not sys; and podman, not remote; and unlinkat, not "removing storage"; and obviously it does not destroy all future test runs because int tests each happen with a different root. Maybe related, maybe not, am posting anyway.

edsantiago · 2023-09-14T12:35:58Z

Hi, I know everyone's busy and that this is a hard one, but please don't forget about it.

fedora-37 : sys remote fedora-37 root host boltdb [remote]
- PR Add support for PidsLimit in quadlet #19955
  - 09-13 15:26 in [sys] podman kube play - hostport
fedora-38 : int podman fedora-38 rootless host sqlite
- PR Small fixes for hyperv #19794
  - 08-29 11:13 in [DeferCleanup (Suite)]
fedora-38 : sys remote fedora-38 root host sqlite [remote]
- PR api docs: document stream format #19696
  - 08-24 11:18 in [sys] podman pod - communicating via /dev/shm
- PR enabled hyperv image downloads #19596
  - 08-15-2023 09:36 in [sys] podman --ipc=host
- PR Ed's pet PR with no flake retries #17831
  - 08-27 08:50 in [sys] podman kube play - default log driver
  - 08-23 17:20 in [sys] podman pod create --pod new: --hostname
fedora-38-aarch64 : sys remote fedora-38-aarch64 root host boltdb [remote]
- PR systests: manifest zstd test: lots of tiny cleanups #19861
  - 09-05 14:24 in [sys] podman kube play --wait - wait for pod to exit
- PR update libhvee #19754
  - 08-28 11:43 in [sys] podman kube --network
rawhide : sys remote rawhide root host sqlite [remote]
- PR fix(deps): update module k8s.io/kubernetes to v1.28.2 #19970
  - 09-13 15:31 in [sys] podman kube play --wait with siginterrupt
- PR e2e: more ExitCleanly() on more test files #19959
  - 09-13 10:27 in [sys] podman play
- PR vendor of containers/(common, storage, image) #19929
  - 09-11 14:47 in [sys] podman play Yaml --no-trunc with annotation > Max

Seen in: int/sys podman/remote fedora-37/fedora-38/fedora-38-aarch64/rawhide root/rootless host boltdb/sqlite

edsantiago · 2023-09-19T14:23:18Z

Here's a variant that I don't recognize (f37 remote root):

<+027ms> # podman run --rm --pod new:xjbur0qugi --hostname foobar quay.io/libpod/testimage:20221018 hostname
<+374ms> Error: failed to make pod: building local pause image: deleting build container "e40fcfa654072d9434bebfed49e6b3724c1f04f569adc601e91bf39e83b3e8c0": removing mount point "/var/lib/containers/storage/overlay/d8935b35f51d27896f81816f58dd60efb8f6860f795dd5a6fddf0e3ab48c6daf/merged": device or resource busy

Curiously, this does not hose all subsequent tests. The error persists forevermore, but only in a place where it doesn't really matter. Subsequent tests actually pass.

edsantiago · 2023-09-26T17:08:51Z

Just seen in debian 13 but with yet another surprise symptom:

starting container <SHA>: runc: runc create failed: unable to start container process: exec: "touch": executable file not found in $PATH: OCI runtime attempted to invoke a command that was not found

That looks suspiciously like #17042 ("catatonit et al not found", another hard-to-track-down flake). And, blam, everything after that is hosed.

edsantiago · 2023-09-27T14:07:02Z

Followup: possibly fixed by @giuseppe in containers/storage#1722

edsantiago · 2023-09-27T22:10:03Z

Sigh..... no such luck.

giuseppe · 2023-09-27T23:26:18Z

Was the change already vendored? I will take another look, maybe it is a different instance of the same problem and needs a similar fix

edsantiago · 2023-09-27T23:30:58Z

Yes, this is #17831 my pet PR on which I churn CI. I believe I correctly vendored c/storage but it's always possible I did something wrong.

PS isn't it really, really late where you are??

vrothberg · 2023-09-28T07:48:25Z

Yes, this is #17831 my pet PR on which I churn CI. I believe I correctly vendored c/storage but it's always possible I did something wrong.

I double checked and think you vendored it correctly. But I am sure @giuseppe has picked up the right sent - there may be more such cases scattered in c/storage (or libpod?).

giuseppe · 2023-09-28T07:51:31Z

the good news is that the occurrence I looked at yesterday doesn't appear in the log, and that was an easy one since I've managed to reproduce it all the time in a VM. I am looking at the new error message right now.

giuseppe · 2023-09-28T09:47:52Z

I think the issue could be fixed with something like:

diff --git a/drivers/overlay/overlay.go b/drivers/overlay/overlay.go
index 0f6d74021..f6f0932e0 100644
--- a/drivers/overlay/overlay.go
+++ b/drivers/overlay/overlay.go
@@ -1891,12 +1891,7 @@ func (d *Driver) Put(id string) error {
                }
        }
 
-       if err := unix.Rmdir(mountpoint); err != nil && !os.IsNotExist(err) {
-               logrus.Debugf("Failed to remove mountpoint %s overlay: %s - %v", id, mountpoint, err)
-               return fmt.Errorf("removing mount point %q: %w", mountpoint, err)
-       }
-
-       return nil
+       return system.EnsureRemoveAll(mountpoint)
 }

but before I open a PR, I'd like to understand how we even get in that situation.

@edsantiago do you think we could collect some more information with:

diff --git a/test/system/helpers.bash b/test/system/helpers.bash
index 3fcd69a60..f7dad187e 100644
--- a/test/system/helpers.bash
+++ b/test/system/helpers.bash
@@ -150,10 +150,13 @@ function basic_setup() {
         echo "# setup(): removing stray external container $1 ($2)" >&3
         run_podman '?' rm -f $1
         if [[ $status -ne 0 ]]; then
+            run_podman '?' --log-level debug rm -f $1
             echo "# [setup] $_LOG_PROMPT podman rm -f $1" >&3
             for errline in "${lines[@]}"; do
                 echo "# $errline" >&3
             done
+            echo "# MOUNTINFO"
+            cat /proc/self/mountinfo | while read i; do echo "# $i"; done
             # FIXME FIXME FIXME: temporary hack for #18831. If we see the
             # unmount/EINVAL flake, nothing will ever work again.
             if [[ $output =~ unmounting.*invalid ]]; then

I've opened a test PR based on yours: #20183, I'll let it run for a while and see if I am lucky to hit it

Move the execution of RecordWrite() before the graphDriver Cleanup(). This addresses a longstanding issue that occurs when the Podman cleanup process is forcely terminated and on some occasions the termination happens after the Cleanup() but before the change is recorded. This causes that the next user is not notified about the change and will mount the container without the home directory below (the infamous /var/lib/containers/storage/overlay mount). Then when the next time the graphDriver is initialized, the home directory is mounted on top of the existing mounts causing some containers to fail with ENOENT since all files are hidden and some others cannot be cleaned up since their mount directory is covered by the home directory mount. Closes: containers/podman#18831 Closes: containers/podman#17216 Closes: containers/podman#17042 Signed-off-by: Giuseppe Scrivano <[email protected]>

edsantiago added flakes Flakes from Continuous Integration remote Problem is in podman-remote labels Jun 8, 2023

This was referenced Jun 14, 2023

race: new rm -fa with dependencies is not actually waiting for container removal #18874

Closed

executable file /catatonit [et al] not found in $PATH: No such file or directory #17042

Closed

Epic: ginkgo: remove -flakeAttempts 3 #17967

Closed

edsantiago mentioned this issue Jun 21, 2023

systests: test instrumentation #18963

Merged

edsantiago mentioned this issue Jun 26, 2023

make image listing more resilient #18980

Merged

github-actions bot added the stale-issue label Jul 29, 2023

rhatdan removed the stale-issue label Jul 29, 2023

edsantiago mentioned this issue Aug 8, 2023

podman failed to remove a container created with buildah #19491

Closed

Luap99 mentioned this issue Sep 19, 2023

kube down should be able to remove "external" storage containers #20025

Closed

giuseppe mentioned this issue Oct 4, 2023

store: call RecordWrite() before graphDriver Cleanup() containers/storage#1724

Merged

flouthoc closed this as completed in containers/storage#1724 Oct 5, 2023

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jan 4, 2024

github-actions bot locked as resolved and limited conversation to collaborators Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unmounting/EINVAL flake #18831

unmounting/EINVAL flake #18831

edsantiago commented Jun 8, 2023

Luap99 commented Jun 9, 2023

rhatdan commented Jun 10, 2023

mtrmac commented Jun 12, 2023

mtrmac commented Jun 12, 2023

edsantiago commented Jun 12, 2023

Luap99 commented Jun 13, 2023

edsantiago commented Jun 13, 2023

edsantiago commented Jun 28, 2023

github-actions bot commented Jul 29, 2023

edsantiago commented Jul 31, 2023

edsantiago commented Aug 2, 2023

edsantiago commented Aug 7, 2023

edsantiago commented Aug 28, 2023

mtrmac commented Aug 28, 2023

edsantiago commented Aug 30, 2023

edsantiago commented Sep 14, 2023

edsantiago commented Sep 19, 2023

edsantiago commented Sep 26, 2023

edsantiago commented Sep 27, 2023

edsantiago commented Sep 27, 2023

giuseppe commented Sep 27, 2023

edsantiago commented Sep 27, 2023

vrothberg commented Sep 28, 2023

giuseppe commented Sep 28, 2023

giuseppe commented Sep 28, 2023 •

edited

Loading

unmounting/EINVAL flake #18831

unmounting/EINVAL flake #18831

Comments

edsantiago commented Jun 8, 2023

Luap99 commented Jun 9, 2023

rhatdan commented Jun 10, 2023

mtrmac commented Jun 12, 2023

mtrmac commented Jun 12, 2023

edsantiago commented Jun 12, 2023

Luap99 commented Jun 13, 2023

edsantiago commented Jun 13, 2023

edsantiago commented Jun 28, 2023

github-actions bot commented Jul 29, 2023

edsantiago commented Jul 31, 2023

edsantiago commented Aug 2, 2023

edsantiago commented Aug 7, 2023

edsantiago commented Aug 28, 2023

mtrmac commented Aug 28, 2023

edsantiago commented Aug 30, 2023

edsantiago commented Sep 14, 2023

edsantiago commented Sep 19, 2023

edsantiago commented Sep 26, 2023

edsantiago commented Sep 27, 2023

edsantiago commented Sep 27, 2023

giuseppe commented Sep 27, 2023

edsantiago commented Sep 27, 2023

vrothberg commented Sep 28, 2023

giuseppe commented Sep 28, 2023

giuseppe commented Sep 28, 2023 • edited Loading

giuseppe commented Sep 28, 2023 •

edited

Loading