Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executable file /catatonit [et al] not found in $PATH: No such file or directory #17042

Closed
edsantiago opened this issue Jan 9, 2023 · 33 comments · Fixed by containers/storage#1724 or #20299
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless

Comments

@edsantiago
Copy link
Member

edsantiago commented Jan 9, 2023

Issue Description

Weird new flake, seen twice so far, both times in the same test:

not ok 491 podman play --service-container
...
$ podman pod restart test_pod
time="2023-01-03T16:16:36-06:00" level=error msg="Storage for container 36a44967cfc50aa8815018d902e09ab5a55fb0a8da70a9180a96c64bf6ed0685 has been removed"
Error: crun: executable file `/catatonit` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found
[ rc=125 (** EXPECTED 0 **) ]
@edsantiago edsantiago added flakes Flakes from Continuous Integration rootless kind/bug Categorizes issue or PR as related to a bug. labels Jan 9, 2023
@edsantiago
Copy link
Member Author

New symptom, but I think it's likely to be the same root cause:

not ok 510 podman kube play read-only
...
# podman-remote --url unix:/tmp/podman_tmp_ytPc kube play --replace /tmp/podman_bats.aJCFSg/test.yml
Pods stopped:
7e51a461b3a474bd8acbec733d0459345e49789c441343a31f26546ebc2513e1
Pods removed:
Pods removed:
7e51a461b3a474bd8acbec733d0459345e49789c441343a31f26546ebc2513e1
Volumes removed:
Pod:
85546bc1df8e723b92108d6fc2ece0a81e04cd4df5b22353349f9045a6e2aaa2
Containers:
e9e0e350b1c41527bca2c69c5b7e7732dc4366f51a9e8bba0f84a26652839933
4ffd5b7234aa686162bcb82f9d1f033becd430fd34b565bd84ea3501f9b02d5f
69342bbfc7f1f6cda933ec17586078359fc3060caafd9f9de7490c5662739e91

starting container e9e0e350b1c41527bca2c69c5b7e7732dc4366f51a9e8bba0f84a26652839933: crun: executable file `touch` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found

Error: error tearing down workloads "json: cannot unmarshal string into Go struct field PodRmReport.RmReport.Err of type error" after kube play error "failed to start 1 containers"

Worse, though, this is one of those "leave-everything-hosed" issues. Subsequent pod rm fails:

# podman-remote --url unix:/tmp/podman_tmp_ytPc pod rm -t 0 -f -a
Error: 1 error occurred:
	* removing container 28abdcb12349964aa232cd84f5a0e8f72ff0b393636860b53e848d50e3c6f978 from pod 85546bc1df8e723b92108d6fc2ece0a81e04cd4df5b22353349f9045a6e2aaa2: removing container 28abdcb12349964aa232cd84f5a0e8f72ff0b393636860b53e848d50e3c6f978 root filesystem: 2 errors occurred:
	* deleting layer "0ca1bc5840cef6ce57fb0a660343b09a3868aaad4373e5ff7d733e5e21b6bc5f": unlinkat /var/lib/containers/storage/overlay/0ca1bc5840cef6ce57fb0a660343b09a3868aaad4373e5ff7d733e5e21b6bc5f/merged: device or resource busy
	* deleting layer "8d0f0581195fce5cde3bf490a3bded794b036f5a31d66a079386534301dfbb27": unlinkat /var/lib/containers/storage/overlay/8d0f0581195fce5cde3bf490a3bded794b036f5a31d66a079386534301dfbb27/merged: device or resource busy

...then more:

# podman-remote --url unix:/tmp/podman_tmp_ytPc image list --format {{.ID}} {{.Repository}}
timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman-remote’

...and then (almost) all subsequent tests fail.

f37 remote root

@dzintars
Copy link

I got into this while trying to fix slow ZFS performace. I just installed fresh Fedora root on ZFS and got slow MySQL, podman build (COPY) and this one issue now. Still have no idea where it comes from, but i will leave this trail there while i'm there. First time i run Play, it went OK. Now it fails.

@edsantiago
Copy link
Member Author

Someone please pay attention here. See also #17216.

@edsantiago
Copy link
Member Author

Another one, f37 remote rootless:

not ok 511 podman kube play read-only
$ podman-remote --url unix:/tmp/podman_tmp_KN8w kube play --replace /tmp/podman_bats.cs7isZ/test.yml
Pods stopped:
942cfb497e3a3263e13489ad7d1fc34b29ac696c25c2af184ffd07d7d6319643
Pods removed:
942cfb497e3a3263e13489ad7d1fc34b29ac696c25c2af184ffd07d7d6319643
Volumes removed:
Pod:
118559121f4ff7f5f8abcfea3619dcbc74a28f91735518dca6d52f8d3e5affff
Containers:
2e1cc5a0f26563b17becb1d22ed68b1ea5eea35c3e67a4f732f98b594e92fb5d
3bfb0392a9fcb094c7f81b4fa482c3393aa0253f738ffa6471cca1c0ecdd85f8
4a94393f3bd916e6c696759c61534f48aad43cc007cdca7e46ca970d6f6a917e

starting container 4a94393f3bd916e6c696759c61534f48aad43cc007cdca7e46ca970d6f6a917e: crun: executable file `touch` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found

Error: error tearing down workloads "json: cannot unmarshal string into Go struct field PodRmReport.RmReport.Err of type error" after kube play error "failed to start 1 containers"

All subsequent tests fail.

@containers/podman-maintainers pretty please TAL

@edsantiago
Copy link
Member Author

Another one:

I just created a Jira issue for it, in hopes that it will get some attention. Unfortunately Jira disappears issues as soon as they're created, so I have no way of actually linking to it, sorry.

@vrothberg
Copy link
Member

@edsantiago you can add the created issue to the current sprint. Then it'll pop up on the board.

@edsantiago
Copy link
Member Author

@vrothberg what happened is, I created a Jira issue, there was a brief popup saying "Successfully created issue", and it disappeared, and now I don't know how to find it.

@vrothberg
Copy link
Member

@edsantiago click on "Issues" in the top bar. Then select "Reported by me". The issue you've created should show up there.

@edsantiago
Copy link
Member Author

edsantiago commented Feb 6, 2023

Thank you! Issue is: https://issues.redhat.com/browse/RUN-1773 (Sorry, I can find no way to add it to a sprint.)

@vrothberg
Copy link
Member

I moved it. Click on "Edit" then select the "Prioritization" field. Not intuitive but at some point I could recall.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Mar 15, 2023

@edsantiago still an issue?

@edsantiago
Copy link
Member Author

Yes, just today, f37 remote rootless, in a kube test:

[+1295s] not ok 526 podman kube play read-only
...
starting container f907f2aef743f7beb8fd611a95b0ae980ea8d663fa336fff70f40d544c933827: 
      crun: executable file `touch` not found in $PATH:
      No such file or directory:
      OCI runtime attempted to invoke a command that was not found

@edsantiago
Copy link
Member Author

Two yesterday, in the same CI run:

@vrothberg
Copy link
Member

OK, this is not only /catatonit but many other commands and tests that fail as well.

@giuseppe could you take at a look at https://api.cirrus-ci.com/v1/artifact/task/4732541031153664/html/int-podman-fedora-37-rootless-host-boltdb.log.html#t--Aardvark-Test-2--Two-containers--same-subnet--1?

@edsantiago
Copy link
Member Author

Plot twist: it just happened in debian with runc:

# podman-remote kube play --replace /tmp/podman_bats.MU5ITd/test.yml
starting container <sha>: runc: runc create failed: unable to start container process: exec: "touch":
      executable file not found in $PATH: OCI runtime attempted to invoke a command that was not found

Not only that, but we go into everything-hosed territory (#17216) after that failure.

@vrothberg
Copy link
Member

Maybe it's just another symptom of the "everything's hosed" issue?

@vrothberg
Copy link
Member

It seems to relate to exec in a way

@giuseppe
Copy link
Member

this error usually happens when there is no storage mounted at that path and the rootfs directory is empty.

@edsantiago
Copy link
Member Author

Another one today, f37 rootless.

So far we have this happening on debian, f36, f37. Root and rootless. Podman and podman-remote. boltdb and sqlite. int and sys tests.

@edsantiago
Copy link
Member Author

This continues to happen. Yesterday I saw an interesting variation in f37 remote aarch64 root:

not ok 537 podman kube play read-only
...
# podman-remote kube play --replace /tmp/podman_bats.7MRfgJ/test.yml
starting container <CID>: workdir "/home/podman" does not exist on container <CID>

From there, of course, it goes into the unlinkat/EBUSY/hosed pit of despair.

This is a different symptom -- the typical one is "executable not found" -- but this is too close to be coincidence. To me, it all points to a problem unpacking or mounting images. And this is starting to scare me because the behavior seems to manifest as "random files/directories missing from container", which in turn can lead to countless failure possibilities.

@edsantiago edsantiago changed the title Error: crun: executable file /catatonit not found in $PATH: No such file or directory executable file /catatonit [et al] not found in $PATH: No such file or directory Apr 27, 2023
@edsantiago
Copy link
Member Author

Updated title. Removed 'crun', because this happens with runc too. I'm pretty sure it has nothing to do with the runtime, or if it does, it's maybe a race between unpacking image and starting runtime.

@edsantiago
Copy link
Member Author

I am treating this as a weird variation of this bug:

not ok 242 podman kill - test signal handling in containers
...
$ podman run -d quay.io/libpod/testimage:20221018 sh -c for i in 1 2 3 4 5 6 8 10 12 13 14 15 16 20 21 22 23 24 25 26 64; do trap "echo got: $i" $i; done;
        echo READY;
        while ! test -e /stop; do sleep 0.1; done;
        echo DONE
e6e85baacc7c04c9312b3650aeb82bb877e4f8e17a32f8d3511996137fef1772
time="2023-05-01T12:00:53-05:00" level=error msg="Failed to get journal entry: failed to read journal entry: error resolving symbol \"sd_journal_enumerate_data\": libsystemd-journal.so.0: cannot open shared object file: No such file or directory"

@edsantiago
Copy link
Member Author

Here's another variation, f38 rootless, where a subsequent network rm gets the "FS magic" flake (#17903):

$ podman [options] exec aone dig +short atwo
Error: crun: executable file `dig` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found
[repeat five more times]

$ podman [options] network rm -f Testd8dc524952
time="2023-05-02T06:23:22-05:00" level=error msg="getting rootless network namespace: unknown FS magic on \"/run/user/3031/netns/rootless-netns-ea85f07403a18951a2fc\": 1021994"
time="2023-05-02T06:23:22-05:00" level=error msg="Unable to clean up network for container 7a0d923968fcdd14098c5f5eadcdc73f93f37e219efeae14622af0e77caf597e: \"unmounting network namespace for container 7a0d923968fcdd14098c5f5eadcdc73f93f37e219efeae14622af0e77caf597e: failed to unmount NS: at /run/user/3031/netns/netns-e52d9c8b-fee2-7d11-a237-1df80a9c2157: invalid argument\""
time="2023-05-02T06:23:22-05:00" level=error msg="getting rootless network namespace: unknown FS magic on \"/run/user/3031/netns/rootless-netns-ea85f07403a18951a2fc\": 1021994"
time="2023-05-02T06:23:22-05:00" level=error msg="Unable to clean up network for container e63334413cd64415273903ca641139ca450ad9af59589ea9c0f284d4a2e432e6: \"unmounting network namespace for container e63334413cd64415273903ca641139ca450ad9af59589ea9c0f284d4a2e432e6: failed to unmount NS: at /run/user/3031/netns/netns-30567630-8c68-c167-e0f4-a8c650707a87: invalid argument\""

@Luap99
Copy link
Member

Luap99 commented May 2, 2023

If these are only rootless then it could be the same problem (killed pause process). Basically the exec ends up in a different mountns so it cannot see the overlay container mount thus the mountpoint is empty and the exec fails.

But I don't really see the connection the the other errors in the system test as these things shouldn't happen there.

@edsantiago
Copy link
Member Author

The overall issue is not rootless; here's the list of incidents as of right now. Please, huge grain of salt, because some of the logs below are different manifestations that I believe are the same issue, but I can easily be wrong.

@edsantiago
Copy link
Member Author

I'm seeing a strong correlation between this bug and the unlinkat-ebusy one (#17216).

@edsantiago
Copy link
Member Author

Another one, debian root remote, and now this one is followed by unmounting/EINVAL (#18831) instead of unlinkat/EBUSY.

@edsantiago
Copy link
Member Author

Here's a completely baffling one:

[+0323s] not ok 123 podman logs - --until --follow k8s-file
...
<+039ms> # # podman run --log-driver=k8s-file --name iW0TN8JkHe -d quay.io/libpod/testimage:20221018 sh -c n=1;while :; do echo Z9FGyLq2HO--$n; n=$((n+1));sleep 0.1; done
<+270ms> # 4616840454fd99ee2f98a8ad0038a81a778c30e23982dc3e743766b20a5e4ce6
<+009ms> # # podman logs --until 3s -f iW0TN8JkHe
<+3.05s> # Z9FGyLq2HO--1
         # time="2023-07-12T14:22:51-05:00" level=warning msg="cannot set data threshold: error resolving symbol \"sd_journal_set_data_threshold\": libsystemd-journal.so.0: cannot open shared object file: No such file or directory"
         # Z9FGyLq2HO--2
         # Z9FGyLq2HO--3
         # Z9FGyLq2HO--4

I can't tell if the spurious message is coming from podman logs or from podman run. I also can't tell if this is related to the missing-logs flake, #18501, which I've spent most of this bug week unsuccessfully trying to reproduce.

@odockal
Copy link

odockal commented Aug 16, 2023

I have encounter something similar when trying a docker-compose with a podman 4.6.1 installed via podman desktop. Check out this issue-3560 in podman desktop repo pls.

@edsantiago
Copy link
Member Author

Another one (remote f39beta root) where "crun can't find touch", then everything gets hosed:

[+1232s] not ok 591 podman kube play read-only
...
<+068ms> # # podman-remote --url unix:///tmp/podman_tmp_dVRg kube play --replace /tmp/podman_bats.60PZ2z/test.yml
<+1.23s> # Pods stopped:
         # 20132e558e28f8aed193c9faf8ba5aa58e2fb799556b784c4f85b31790cd4442
         # Pods removed:
         # 20132e558e28f8aed193c9faf8ba5aa58e2fb799556b784c4f85b31790cd4442
         # Secrets removed:
         # Volumes removed:
         # Pod:
         # fd3dfd6dd9b85b0ae9fce22cf368783f39155fcc275d100971a0068fe1625f23
         # Containers:
         # 64710c679e18906ed3a4f79a4d14558ebff4d1355020af76d563f8a6f8df970f
         # 53ef8353cb11071772dcff166d98f6c65b1db801bce252175ddddc53fa4e6394
         # 03c6affb6ff54077ffe1c10fe09586c70e87c87a90ca34260c602451cf509ca9
         #
         # starting container 64710c679e18906ed3a4f79a4d14558ebff4d1355020af76d563f8a6f8df970f: crun: executable file `touch` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found
         #
         # Error: failed to start 1 containers
<+009ms> # [ rc=125 (** EXPECTED 0 **) ]

Then, after that, all tests fail with:

<+014ms> # # podman-remote --url unix:///tmp/podman_tmp_dVRg rm -f 64710c679e18
<+045ms> # Error: removing storage for container "64710c679e18": removing mount point "/var/lib/containers/storage/overlay/7023308de3a259c3392a50eccd9b855ae7d6a9a7a4d8590f6da1975e79b7805c/merged": device or resource busy

This looks a LOT like #18831 and all of its everything-is-hosed variations, but here the strings "umount" or "unmount" do not appear in the error logs, so, keeping separate. Anyhow, @giuseppe, maybe there's a connection? Here are the recent ones:

Seen in: sys podman/remote fedora-37/fedora-39? root/rootless host boltdb/sqlite

giuseppe added a commit to giuseppe/storage that referenced this issue Oct 4, 2023
Move the execution of RecordWrite() before the graphDriver Cleanup().
This addresses a longstanding issue that occurs when the Podman
cleanup process is forcely terminated and on some occasions the
termination happens after the Cleanup() but before the change is
recorded.  This causes that the next user is not notified about the
change and will mount the container without the home directory
below (the infamous /var/lib/containers/storage/overlay mount).
Then when the next time the graphDriver is initialized, the home
directory is mounted on top of the existing mounts causing some
containers to fail with ENOENT since all files are hidden and some
others cannot be cleaned up since their mount directory is covered by
the home directory mount.

Closes: containers/podman#18831
Closes: containers/podman#17216
Closes: containers/podman#17042

Signed-off-by: Giuseppe Scrivano <[email protected]>
@edsantiago
Copy link
Member Author

Not fixed. Flaked in my PR with c/storage vendored in. remote f39 root:

[It] podman ensure always init containers always run
...
# podman-remote [options] create --pod foobar -t quay.io/libpod/alpine:latest top
18b89aa07de8ead06331d7a38806eca61b6661ee9783b6d4d998a7218251fa37
# podman-remote [options] pod start foobar
d8d27e61637fd7db3f79af43dac1b881c3e19ff02be584ca484d534f98150bed
# podman-remote [options] exec 18b89aa07de8ead06331d7a38806eca61b6661ee9783b6d4d998a7218251fa37 cat /dev/shm/TXLQjCRunbCg
22:25:16.540067400
# podman-remote [options] pod stop foobar
d8d27e61637fd7db3f79af43dac1b881c3e19ff02be584ca484d534f98150bed
# podman-remote [options] pod start foobar
time="2023-10-05T17:25:18-05:00" level=error msg="Storage for container 18b89aa07de8ead06331d7a38806eca61b6661ee9783b6d4d998a7218251fa37 has been removed"
Error: starting container 18b89aa07de8ead06331d7a38806eca61b6661ee9783b6d4d998a7218251fa37: crun: executable file `top` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found

@giuseppe
Copy link
Member

giuseppe commented Oct 7, 2023

Not fixed. Flaked in my PR with c/storage vendored in. remote f39 root:

yeah, this is a different flake.

I think it is fixed with #20299

Without the PR, I could reproduce the flake with the following script (it would reproduce in around 5 minutes on a CI machine):

#!/bin/sh

set -xeuo pipefail

PODMAN=bin/podman-remote

filename="/dev/shm/foo123"
$PODMAN create --init-ctr always --net host --pod new:foobar fedora /bin/sh -c "date +%T.%N >  $filename"
$PODMAN create --name verify  --net host --pod foobar -t alpine top
$PODMAN pod start foobar
$PODMAN pod stop foobar
$PODMAN pod start foobar
$PODMAN rm -fa
$PODMAN pod rm -fa

and run it in a loop.

It is already running for longer than an hour with the patch

giuseppe added a commit to giuseppe/libpod that referenced this issue Oct 9, 2023
when running as a service, the c.state.Mounted flag could get out of
sync if the container is cleaned up through the cleanup process.

To avoid this, always check if the mountpoint is really present before
skipping the mount.

[NO NEW TESTS NEEDED]

Closes: containers#17042

Signed-off-by: Giuseppe Scrivano <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jan 8, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless
Projects
None yet
7 participants