race: new rm -fa with dependencies is not actually waiting for container removal #18874

edsantiago · 2023-06-13T14:45:20Z

[It] podman rm -fa with dependencies
...
 podman [options] rm -fa
fdb8e9ac4cdf66db5548e8dbf8b0fc73f9c8cbb5fee01efa480787a1706928c4
d85817e78f2be4c058f93bc912f960d513a405260ba213757577b44b21973399
# podman [options] ps -aq
d85817e78f2b

[FAILED] expected 1 to equal 0 (as in, expected no output from "ps -aq")

This is a big one, failing often; it's only a matter of time before it hits us even in main.

fedora-37 : int podman fedora-37 rootless host sqlite
- 06-13 09:53 in Podman rm podman rm -fa with dependencies
fedora-38 : int podman fedora-38 root container sqlite
- 06-13 09:50 in Podman rm podman rm -fa with dependencies
- 06-13 08:27 in Podman rm podman rm -fa with dependencies
fedora-38 : int podman fedora-38 root host sqlite
- 06-13 09:54 in Podman rm podman rm -fa with dependencies

The text was updated successfully, but these errors were encountered:

Luap99 · 2023-06-13T14:54:05Z

@mheon PTAL

mheon · 2023-06-13T14:55:13Z

I'll take a look this afternoon

mheon · 2023-06-13T17:32:42Z

@edsantiago I don't suppose you have any idea what makes this reproduce? I'm not seeing any obvious reason why podman rm -a would exit early.

Is there a possibility it's not removing the second container at all? Or is it definitely being removed, just after the command exits?

edsantiago · 2023-06-13T17:37:50Z

Sorry, I haven't looked into it beyond reporting. I will do so now - will start by looking at the test for obvious races, and will then see if I can reproduce it.

Luap99 · 2023-06-13T17:55:26Z

@mheon If you look at the logs you can see the the that in AfterEach() both podman stop --all and podman rm -fa both output this cid, so yes I think this is not a race it was not removed at all for some reason.

edsantiago · 2023-06-13T18:31:54Z

Gosh, I thought it'd be so easy to reproduce. It's not. Giving up for now; here's my attempt at a reproducer:

#!/bin/bash

td=/tmp/mypodmantmpdir
mkdir -p $td/root $td/runroot

PM="bin/podman --cgroup-manager cgroupfs --storage-driver vfs --events-backend file --db-backend sqlite --network-backend netavark --tmpdir $td --root $td/root --runroot $td/runroot"
IMAGE=quay.io/libpod/alpine:latest

$PM rm -fa

set -e

while :;do
    $PM run -d --name mytop $IMAGE top
    $PM run -d --network container:mytop $IMAGE top

    $PM rm -fa
    left=$($PM ps -aq)
    echo $left
    test -z "$left"
done

edsantiago · 2023-06-14T19:59:24Z

Could not reproduce on my f38 laptop, over many hours. Tried 1mt. Reproduced on the second iteration, podman main @ 7c76907. Given how quickly it reproduces -- less than a minute, usually less than five seconds -- I confirmed that none of the options matter. Here is the barest-bones reproducer:

#!/bin/bash

PM=bin/podman
IMAGE=quay.io/libpod/alpine:latest

$PM rm -fa

set -e

while :;do
    $PM run -d --name mytop $IMAGE top
    $PM run -d --network container:mytop $IMAGE top

    $PM rm -fa
    left=$($PM ps -aq)
    echo $left
    test -z "$left"
done

@mheon in answer to your question, the container is not being removed. Here's my ps a few minutes after script failure:

# bin/podman ps
CONTAINER ID  IMAGE                         COMMAND     CREATED        STATUS        PORTS       NAMES
659d48b527d8  quay.io/libpod/alpine:latest  top         4 minutes ago  Up 4 minutes              mytop

(that's "ps", not "ps -a").

edsantiago · 2023-06-14T20:03:48Z

Incidentally, my observations (Cirrus logs as well as dozens of iterations on 1mt) show that it is always the first container that survives. This may be just chance, but it seems unlikely over this many failures.

edsantiago · 2023-06-14T20:33:39Z

More (possibly useless) data. This reproduces so quickly on 1mt that I wrote a test to confirm my assertion above, that the survivor is always cid1. It is (so far, over hundreds of iterations) but I'm seeing, in my loop, rather a lot of these warnings:

error opening file `/run/crun/SHA/status`: No such file or directory

I don't know where they're coming from, and I don't know what that SHA refers to. It's not an error, because set -e.

Oh, here's more fun. In the process of playing, I ended up with the umount/EINVAL bug (#18831)!

# ./reproducer
Error: creating container storage: the container name "mytop" is already in use by 335f21f6bee6407c79bf12181427c7db01b431fa2533112e44c9487002a14bd2. You have to remove that container to be able to reuse that name: that name is already in use

# bin/podman ps -aq   <<<--- no output
# bin/podman ps -a --external
CONTAINER ID  IMAGE                         COMMAND     CREATED        STATUS      PORTS       NAMES
335f21f6bee6  quay.io/libpod/alpine:latest  storage     6 minutes ago  Storage                 mytop

# bin/podman rm -af    <<<-- no output, no effect

#  bin/podman rm -f 335f
WARN[0000] Unmounting container "335f" while attempting to delete storage: unmounting "/var/lib/containers/storage/overlay/e3149559a6e1ae65a7b0b2e140ceb2bb7e7bdc06b9654bd5d46ba5d3adf7ad46/merged": invalid argument 
Error: removing storage for container "335f": unmounting "/var/lib/containers/storage/overlay/e3149559a6e1ae65a7b0b2e140ceb2bb7e7bdc06b9654bd5d46ba5d3adf7ad46/merged": invalid argument

From this point on, even podman system reset -f is of no help. Podman is completely hosed, nothing else works. bin/podman umount -a fails, EINVAL. mount (system mount, not podman mount) shows nothing unusual. I guess this means I am done with this 1mt.

In case it helps, here is my final confirm-the-survivor reproducer:

#!/bin/bash                                                                                                                                                     
                                                                                                                                                                
PM=bin/podman                                                                                                                                                   
IMAGE=quay.io/libpod/alpine:latest                                                                                                                              
                                                                                                                                                                
$PM rm -fa                                                                                                                                                      
echo                                                                                                                                                            
                                                                                                                                                                
set -e                                                                                                                                                          
                                                                                                                                                                
while :;do                                                                                                                                                      
    cid1=$($PM run -d --name mytop $IMAGE top)                                                                                                                  
    cid2=$($PM run -d --network container:mytop $IMAGE top)                                                                                                     
                                                                                                                                                                
    rm=$($PM rm -fa)                                                                                                                                            
# Order is not predictable. Usually it's cid2 cid1, but sometimes reverse                                                                                       
#    if [[ "$(echo $rm)" != "$cid2 $cid1" ]]; then                                                                                                              
#        echo                                                                                                                                                   
#        echo "rm: got  $rm"                                                                                                                                    
#       echo "expected $cid2 $cid1"                                                                                                                             
#       exit 1                                                                                                                                                  
#    fi                                                                                                                                                         
                                                                                                                                                                
    left=$($PM ps -aq --notruncate)                                                                                                                             
    if [[ -n "$left" ]]; then                                                                                                                                   
        if [[ "$left" != "$cid1" ]]; then                                                                                                                       
            echo                                                                                                                                                
            echo "WHOA! Leftover container is not cid1!"                                                                                                        
            echo " cid1 : $cid1"                                                                                                                                
            echo " cid2 : $cid2"                                                                                                                                
            echo " left : $left"                                                                                                                                
            exit 1                                                                                                                                              
        fi                                                                                                                                                      
        echo "Triggered, usual case."                                                                                                                           
        $PM rm -fa >/dev/null                                                                                                                                   
    fi                                                                                                                                                          
done

Luap99 · 2023-06-22T09:06:04Z

@mheon Can you take a look? I think this should be addressed before the next release.

edsantiago · 2023-06-22T12:26:53Z

I don't think this helps in any way, given the super-easy reproducer, but just in case:

debian-12 : int podman debian-12 root host sqlite
- 06-20 20:05 in Podman rm podman rm -fa with dependencies
fedora-37 : int podman fedora-37 root container sqlite
- 06-20 20:10 in Podman rm podman rm -fa with dependencies
- 06-15 08:56 in Podman rm podman rm -fa with dependencies
- 06-14 23:32 in Podman rm podman rm -fa with dependencies
fedora-37 : int podman fedora-37 root host sqlite
- 06-21 15:11 in Podman rm podman rm -fa with dependencies
- 06-15 18:12 in Podman rm podman rm -fa with dependencies
- 06-15 15:23 in Podman rm podman rm -fa with dependencies
fedora-37 : int podman fedora-37 rootless host sqlite
- 06-15 18:06 in Podman rm podman rm -fa with dependencies
- 06-13 14:11 in Podman rm podman rm -fa with dependencies
- 06-13 11:20 in Podman rm podman rm -fa with dependencies
- 06-13 09:53 in Podman rm podman rm -fa with dependencies
fedora-38 : int podman fedora-38 root container sqlite
- 06-15 18:08 in Podman rm podman rm -fa with dependencies
- 06-15 15:17 in Podman rm podman rm -fa with dependencies
- 06-14 20:46 in Podman rm podman rm -fa with dependencies
- 06-13 09:50 in Podman rm podman rm -fa with dependencies
- 06-13 08:27 in Podman rm podman rm -fa with dependencies
fedora-38 : int podman fedora-38 root host sqlite
- 06-13 09:54 in Podman rm podman rm -fa with dependencies
fedora-38 : int podman fedora-38 rootless host sqlite
- 06-15 16:46 in Podman rm podman rm -fa with dependencies
- 06-14 20:47 in Podman rm podman rm -fa with dependencies
- 06-14 14:17 in Podman rm podman rm -fa with dependencies

Basically, all distros, root & rootless. Have not seen it happen in podman-remote nor in aarch64 (yet)

edsantiago · 2023-07-07T22:06:37Z

New variation:

# podman [options] run --http-proxy=false --name ctr1 -d quay.io/libpod/alpine:latest top
2e485c2f06869bf64450cc183c807ff4946cf6ab58a8166a36da427bd6089a19
# podman [options] run -d --network container:ctr1 quay.io/libpod/alpine:latest top
1e73e0e60800b74061afc1f6dcb332a63f5ec6236ec7a86567bd8acdceec6997
# podman [options] rm -fa
send signal to pidfd: No such process
1e73e0e60800b74061afc1f6dcb332a63f5ec6236ec7a86567bd8acdceec6997
2e485c2f06869bf64450cc183c807ff4946cf6ab58a8166a36da427bd6089a19

[FAILED] rm -fa error logged
Expected
    <string>: send signal to pidfd: No such process
to be empty

That is: both containers were stopped and rm'ed (at least according to the log), but there was an error/warning logged.

UPDATE: never mind, this is probably #18452 (the "open pidfd" flake)

nifoc · 2023-07-27T10:20:49Z

Given that a reproducer already exists, I have nothing specific to contribute.

Other than to say that after upgrading to 4.6.0, this has started to affect us as well.

mheon · 2023-07-27T16:51:17Z

I'll take a further look tomorrow

edsantiago · 2023-08-14T12:23:25Z

Welcome back, @mheon! I hope your PTO was restful. Quick reminder that this is still a huge problem, flaking very often (but passing on ginkgo retry):

debian-12 : int podman debian-12 root host sqlite
- 07-31 16:41 in Podman rm podman rm -fa with dependencies
debian-13 : int podman debian-13 root host sqlite
- 08-09 20:44 in Podman rm podman rm -fa with dependencies
debian-13 : int podman debian-13 rootless host sqlite
- 08-03 22:41 in TOP-LEVEL [AfterEach] Podman pod restart podman pod restart multiple pods
fedora-37 : int podman fedora-37 root container sqlite
- 08-13 10:25 in Podman rm podman rm -fa with dependencies
- 08-10 23:07 in Podman rm podman rm -fa with dependencies
- 08-10 08:48 in Podman rm podman rm -fa with dependencies
- 08-03 23:59 in Podman rm podman rm -fa with dependencies
- 08-03 19:01 in Podman rm podman rm -fa with dependencies
fedora-37 : int podman fedora-37 root host sqlite
- 08-11 22:33 in Podman rm podman rm -fa with dependencies
fedora-38 : int podman fedora-38 root container sqlite
- 08-12 09:19 in Podman rm podman rm -fa with dependencies
- 08-09 20:44 in Podman rm podman rm -fa with dependencies
fedora-38 : int podman fedora-38 root host sqlite
- 08-03 23:55 in Podman rm podman rm -fa with dependencies
- 08-03 13:25 in Podman rm podman rm -fa with dependencies
fedora-38 : int podman fedora-38 rootless host sqlite
- 08-11 15:09 in Podman rm podman rm -fa with dependencies
- 08-09 23:27 in Podman rm podman rm -fa with dependencies
rawhide : int podman rawhide root host sqlite
- 08-09 23:25 in Podman rm podman rm -fa with dependencies
- 08-08 17:49 in Podman rm podman rm -fa with dependencies
- 08-03 13:24 in Podman rm podman rm -fa with dependencies

When removing a container's dependency, getting an error that the container has already been removed (ErrNoSuchCtr and ErrCtrRemoved) should not be fatal. We wanted the container gone, it's gone, no need to error out. [NO NEW TESTS NEEDED] This is a race and thus hard to test for. Fixes containers#18874 Signed-off-by: Matt Heon <[email protected]>

edsantiago added the flakes Flakes from Continuous Integration label Jun 13, 2023

edsantiago mentioned this issue Jun 20, 2023

Epic: ginkgo: remove -flakeAttempts 3 #17967

Closed

nifoc mentioned this issue Jul 27, 2023

podman: 4.5.1 -> 4.6.0 NixOS/nixpkgs#244931

Merged

12 tasks

mheon self-assigned this Jul 27, 2023

sstosh mentioned this issue Aug 3, 2023

podman failed to remove a container created with buildah #19491

Closed

mheon mentioned this issue Sep 5, 2023

Ignore spurious container-removal errors #19863

Merged

openshift-merge-robot closed this as completed in #19863 Sep 6, 2023

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Dec 6, 2023

github-actions bot locked as resolved and limited conversation to collaborators Dec 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

race: new rm -fa with dependencies is not actually waiting for container removal #18874

race: new rm -fa with dependencies is not actually waiting for container removal #18874

edsantiago commented Jun 13, 2023

Luap99 commented Jun 13, 2023

mheon commented Jun 13, 2023

mheon commented Jun 13, 2023

edsantiago commented Jun 13, 2023

Luap99 commented Jun 13, 2023

edsantiago commented Jun 13, 2023

edsantiago commented Jun 14, 2023

edsantiago commented Jun 14, 2023

edsantiago commented Jun 14, 2023

Luap99 commented Jun 22, 2023

edsantiago commented Jun 22, 2023

edsantiago commented Jul 7, 2023 •

edited

Loading

nifoc commented Jul 27, 2023

mheon commented Jul 27, 2023

edsantiago commented Aug 14, 2023

race: new rm -fa with dependencies is not actually waiting for container removal #18874

race: new rm -fa with dependencies is not actually waiting for container removal #18874

Comments

edsantiago commented Jun 13, 2023

Luap99 commented Jun 13, 2023

mheon commented Jun 13, 2023

mheon commented Jun 13, 2023

edsantiago commented Jun 13, 2023

Luap99 commented Jun 13, 2023

edsantiago commented Jun 13, 2023

edsantiago commented Jun 14, 2023

edsantiago commented Jun 14, 2023

edsantiago commented Jun 14, 2023

Luap99 commented Jun 22, 2023

edsantiago commented Jun 22, 2023

edsantiago commented Jul 7, 2023 • edited Loading

nifoc commented Jul 27, 2023

mheon commented Jul 27, 2023

edsantiago commented Aug 14, 2023

edsantiago commented Jul 7, 2023 •

edited

Loading