Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race: completion: Top layer ... not found ... storage may be corrupted #23331

Closed
edsantiago opened this issue Jul 18, 2024 · 24 comments · Fixed by containers/common#2125, #23807 or containers/buildah#5710
Assignees
Labels
flakes Flakes from Continuous Integration jira kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

Another parallel-system-test flake in command completion, seen in f39 rootless:

<+048ms> # $ podman __completeNoDesc  image tag
<+097ms> # [Debug] 2 arg(s), received 1
         # time="2024-07-18T12:56:33-05:00" level=warning msg="Top layer f6adb48d9895082456c5aa8074e44f3d32852702f03d54110bb34f01bb8821c0 of image 5fd625cf9c7cb8392d7acb93aac1276dd4fcb0c06ebb891485d01384f99932dc not found in layer tree. The storage may be corrupted, consider running `podman system reset`."
         # localhost/i_t109-kniryp3x:58ibj
         # quay.io/libpod/testimage:20240123
         # localhost/podman-pause:5.2.0-dev-1721323759
         # quay.io/libpod/systemd-image:20240124
         # :4
         # Completion ended with directive: ShellCompDirectiveNoFileComp
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I haven't tried to look for a reproducer yet.

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Jul 18, 2024
@Luap99
Copy link
Member

Luap99 commented Jul 19, 2024

To be clear I don't think any of the races are related to completion at all. It just happens that the shell completion test has to do a ton of listing (podman ps, podman images,...) in a long loop so it is super likely to catch races there.

Seems to related to db37d66
and you allowed the warning in ff07abe so seems like we just have been ignoring it.

@Luap99
Copy link
Member

Luap99 commented Jul 19, 2024

And I must say the suggestion of consider running podman system reset is horrible if this actually can happen on a normal image listing without having a corrupted store...

@Luap99 Luap99 added the kind/bug Categorizes issue or PR as related to a bug. label Jul 19, 2024
@Luap99
Copy link
Member

Luap99 commented Jul 19, 2024

diff --git a/test/system/010-images.bats b/test/system/010-images.bats
index 1548615a7..3653d242d 100644
--- a/test/system/010-images.bats
+++ b/test/system/010-images.bats
@@ -394,7 +394,7 @@ EOF
 
 @test "podman images with concurrent removal" {
     skip_if_remote "following test is not supported for remote clients"
-    local count=5
+    local count=100
 
     # First build $count images
     for i in $(seq --format '%02g' 1 $count); do
@@ -411,21 +411,9 @@ EOF
     for i in $(seq --format '%02g' 1 $count); do
         timeout --foreground -v --kill=10 60 \
                 $PODMAN rmi i$i &
+        run_podman images --format "{{.ID}} {{.Names}}"
     done
 
-    tries=100
-    while [[ ${#lines[*]} -gt 1 ]] && [[ $tries -gt 0 ]]; do
-        # Prior to #18980, 'podman images' during rmi could fail with 'image not known'
-        # '0+w' because we sometimes get warnings.
-        run_podman 0+w images --format "{{.ID}} {{.Names}}"
-        allow_warnings "Top layer .* of image .* not found in layer tree"
-        tries=$((tries - 1))
-    done
-
-    if [[ $tries -eq 0 ]]; then
-        die "Timed out waiting for images to be removed"
-    fi
-
     wait
 }

then run with hack/bats --rootless 010:concurrent reproduces almost every time for me locally.

@Luap99
Copy link
Member

Luap99 commented Jul 19, 2024

Ok so the libimage code is racy by design. First the code gets a list of all images[1], then it tries to build a layer tree[2] and for that must get a list of all layers[3].
Then the code checks each image top layer against the layers it got. However when a image (including its layers) has been removed between both calls then this cannot work and throw this warning.

[1] https://github.com/containers/common/blob/fdbae3a180cb1d81cb4316e0077b3906672c75a8/libimage/runtime.go#L587

[2] https://github.com/containers/common/blob/fdbae3a180cb1d81cb4316e0077b3906672c75a8/libimage/runtime.go#L611
[3] https://github.com/containers/common/blob/fdbae3a180cb1d81cb4316e0077b3906672c75a8/libimage/layer_tree.go#L95

I don't see any way to fix this. We could just stop throwing this bogus warning but I assume it was added for a reason in cases where the store is actually corrupt and we have a image without layer... (is that even possible with the c/storage locking mechanism?)

I guess ideally we would get all images and layers in one locked c/storage call, i.e. something like store.ImagesAndLayers()?

@mtrmac @nalind @giuseppe opinions on this?

@mtrmac
Copy link
Collaborator

mtrmac commented Jul 19, 2024

Ok so the libimage code is racy by design.

Yes.

I don't see any way to fix this. We could just stop throwing this bogus warning but I assume it was added for a reason in cases where the store is actually corrupt and we have a image without layer...

git blame finds adfcb74 .

(is that even possible with the c/storage locking mechanism?)

It shouldn’t be possible: deleting a layer verifies that there are no references to it (in particular no TopLayer references), and, nowadays, if a layer is incomplete after a restart, it is deleted when opening the store, making it impossible to reference it when creating an image (the deletion of incomplete layers didn’t always exist in the past).

But bugs happen, and I think if the storage were corrupt, the user should be told something to allow diagnosing inexplicably missing images.

I guess ideally we would get all images and layers in one locked c/storage call, i.e. something like store.ImagesAndLayers()?

Yes, I think that would make sense. (Also note the layerTree.ociCache; that would also have to be populated atomically.) It might even be a rather nice performance improvement, for cases where all images are actually consumed, we would only take the lock once instead of O(images) times; OTOH perhaps a bit costly for users who don’t actually want the data of all images.


And I must say the suggestion of consider running podman system reset is horrible if this actually can happen on a normal image listing without having a corrupted store...

I think qualifying this with “if there are no concurrent container operations happening, …” is the least we should do. I don’t have a strong opinion on silencing the warning completely — my weak preference would be to keep it for user-initiated operations, but to be silent for command completion.

Maybe command completion should silence warnings universally; I’m not at all sure.

@Luap99
Copy link
Member

Luap99 commented Jul 19, 2024

But bugs happen, and I think if the storage were corrupt, the user should be told something to allow diagnosing inexplicably missing images.

We recently merged podman system check so I guess this is a much more useful recommendation now?

I guess ideally we would get all images and layers in one locked c/storage call, i.e. something like store.ImagesAndLayers()?

Yes, I think that would make sense. (Also note the layerTree.ociCache; that would also have to be populated atomically.) It might even be a rather nice performance improvement, for cases where all images are actually consumed, we would only take the lock once instead of O(images) times; OTOH perhaps a bit costly for users who don’t actually want the data of all images.

I would like this because we can then clearly assume the store is inconsistent if we hit this condition.
But yes there are more cases in libimage that need fixing, the layer tree seems to be called from many palaces there.


Maybe command completion should silence warnings universally; I’m not at all sure.

Note this is not related to shell completion at all, the test reverse engineers how shell completion works via the cobra library by manually calling the hidden completion command. All stderr lines of the command are ignored and never seen when actually uses shell completion interactively so this is already the case.

In general the shell completion test has found a lot of issues because it essentially does the listing over and over for all our commands and other tests do other stuff in parallel now.

@mtrmac
Copy link
Collaborator

mtrmac commented Jul 19, 2024

We recently merged podman system check so I guess this is a much more useful recommendation now?

Nice, yes.


All stderr lines of the command are ignored and never seen when actually uses shell completion interactively so this is already the case.

Thanks, I didn’t know that.

@mtrmac
Copy link
Collaborator

mtrmac commented Jul 19, 2024

Also note the layerTree.ociCache

Oops, this is the hard part. ociCache does not contain just storage.Image (we can easily read all of them atomically), but the image config (a separate file on disk for each image; we could still read all of them atomically, I guess), and it does that through c/image, which does c/storage accesses to read the config behind an abstraction. Crossing the layers to provide a pre-read config seems pretty ugly to me. Luckily, the callers of layerTree.toOCI seem to all, already, silently ignore missing images.

So we would probably not touch ociCache, and just obtain consistent []*storage.Layer and []*storage.Image in Runtime.layerTree (or provide it by a caller). That simplifies things.

@rhatdan
Copy link
Member

rhatdan commented Jul 20, 2024

@vrothberg FYI

@Luap99 Luap99 added the jira label Jul 22, 2024
Luap99 added a commit to Luap99/common that referenced this issue Jul 23, 2024
podman system reset nukes everything which makes not a great suggestion.
Now that we have the podman system check recommend that instead as it
actually checks the storage for corruption and might be able to repair
it.

This is important because as it turns out this warning can pop up even
in totally normal circumstances as there is a race between image and
layer listing[1]. And even if that get fixed the check command will be
the better suggestion.

[1] containers/podman#23331

Signed-off-by: Paul Holzinger <[email protected]>
@edsantiago
Copy link
Member Author

Slightly different symptom seen rawhide rootless:

<+026ms> # $ podman system df --format {{"\n"}}
<+159ms> # Error: Image 4494503e9cb6 exists in local storage but may be corrupted (remove the image to resolve the issue): reading image "4494503e9cb696a2cfbe03f491c55e8f784d81384493d312ab2937df2b0211c4": locating image with ID "4494503e9cb696a2cfbe03f491c55e8f784d81384493d312ab2937df2b0211c4": image not known

I'm treating it as a different manifestation of this same bug.

@Luap99
Copy link
Member

Luap99 commented Jul 23, 2024

Slightly different symptom seen rawhide rootless:

<+026ms> # $ podman system df --format {{"\n"}}
<+159ms> # Error: Image 4494503e9cb6 exists in local storage but may be corrupted (remove the image to resolve the issue): reading image "4494503e9cb696a2cfbe03f491c55e8f784d81384493d312ab2937df2b0211c4": locating image with ID "4494503e9cb696a2cfbe03f491c55e8f784d81384493d312ab2937df2b0211c4": image not known

I'm treating it as a different manifestation of this same bug.

Yes I think that is fundamentally the same race issue between listing two times, however this one might be even harder fix...
containers/common@ade9985

@edsantiago
Copy link
Member Author

Another new symptom that I'm assigning here, seen in debian root:

<+023ms> # # podman system df --format {{"\n"}}
<+116ms> # Error: failed to get root file system size of container b04134c7e12da38c71331be6a023fc780eee605b9d86f1a25eb0563aab41552a: lstat /var/lib/containers/storage/overlay-containers/b04134c7e12da38c71331be6a023fc780eee605b9d86f1a25eb0563aab41552a/userdata/attach: no such file or directory
<+011ms> # [ rc=125 ]

@mtrmac
Copy link
Collaborator

mtrmac commented Jul 24, 2024

I guess ideally we would get all images and layers in one locked c/storage call, i.e. something like store.ImagesAndLayers()?

Implementation note for c/storage: add a new method to the Store interface and implementation, combining the existing Images and Layers, but holding both kinds of locks simultaneously. The # Locking hierarchy: documentation says in which order to lock; see store.ImageSize for an ~example.

@Luap99
Copy link
Member

Luap99 commented Jul 24, 2024

I guess ideally we would get all images and layers in one locked c/storage call, i.e. something like store.ImagesAndLayers()?

Implementation note for c/storage: add a new method to the Store interface and implementation, combining the existing Images and Layers, but holding both kinds of locks simultaneously. The # Locking hierarchy: documentation says in which order to lock; see store.ImageSize for an ~example.

Would it sense to add a more generic function List() that accepts what kind of objects to list as argument, that way we can extend it without adding more API function calls all the time. Because today we might only care about images and layers but there might be other callers that like containers too.

i.e.

func List(ListOptions) (ListResult, error)

type ListOptions struct {
	Images     bool
	Layers     bool
	Containers bool
}

type ListResult struct {
	Images     []Image
	Layers     []Layer
	Containers []Container
}

and then we only populate the result based on what was requested in the options?
Also not sure if there are other c/storage objects that one might need to list...

@mtrmac
Copy link
Collaborator

mtrmac commented Jul 24, 2024

I’m very weakly opposed, but I wouldn’t reject a PR of that kind. (I’m more dubious about the concept of wanting to add more options than an API with exactly these three fields.)

Conceptually, c/storage locks are (not measured, intuitively) quite expensive and it might be an Interesting Project to try to decrease the number of times we take them. Preferably we would want to hold them for short periods, i.e. quickly getting a snapshot of data under a lock, and then computing LayerTree / history can be done without the lock. OTOH the ultimate consumer might only care about one or two fields, so getting a full []Images copy would cause unnecessary CPU & memory usage to copy all that data. That’s a fairly difficult trade-off — having a universally-available API now would make it easier to avoid taking locks, OTOH adding special-purpose API would be better if we did take the effort.

Pragmatically, I think the main obstacle is actually going to be the abstraction barriers in the callers anyway — IIRC there are various places in libimage where the code already has an Image, if only it could hold the value for all the c/common/Podman processing, instead of looking the data up again. (But then again, in some places it seems better to ask again than to worry about stale data…)

[And then there’s the dream of a single c/storage+Podman database, maybe Sqlite, with a single set of locks? I don’t know how we could get there.]


Imagining the implementation, the code structure doing the right kind of locking for these three bools would be a little complex but certainly very manageable. But to the extent this anticipates adding more kinds of objects to query, ones that might require taking specific combinations of locks (“the user asked for StorageChar”, so we need to take the containers+layers lock”), I think it the universal List code could get fairly messy, and more importantly the code would be structured as “which locks to take for data; take locks; read data”, which risks the lock specification and the reading getting out of sync. I can imagine mechanisms that abstract this… but for simple reads, c/storage already has helpers like readAllLayerStores, so it seems to me that adding an entirely new top-level API would be very comparable in complexity to adding a new ListOptions field.

Either way, the callers are blocked on adding a c/storage PR and rebasing c/storage, and to me that’s the larger hurdle than what are the exact details of the PR.

But then again, there might very well be at least a single caller which currently lists images+layers+containers in sequence and expects the data to be consistent; if we had that single caller call the proposed new ListOptions API, and even never added any new field to ListOptions, that would still be a nice improvement.

@Luap99
Copy link
Member

Luap99 commented Jul 24, 2024

so it seems to me that adding an entirely new top-level API would be very comparable in complexity to adding a new ListOptions field.

Maybe, if we assume we only want images + layers then yes one function for that is simple. But if we have other callers that maybe want containers, images and layers then we have to add another again, then another callers wants containers and images and so on. Thus my suggestion of something more generic. If we know this is unlikely to happen (I don't know that) then sure let's opt for the simple fix for the issue at hand.

To me the Store interface is overly bloated with functions so I didn't want to add much more but if you don't mind that fine.

Regardless the storage part is seems to me like the easy part, fixing c/common callers is harder as one must audit all code paths for such races.

@mtrmac
Copy link
Collaborator

mtrmac commented Jul 24, 2024

To me the Store interface is overly bloated with functions so I didn't want to add much more but if you don't mind that fine.

TL;DR: Very generally speaking, I’m not worried about the number of functions.

I find it easier to evolve APIs that have many operations, each for a very specific purpose (“what is the size of the image on disk, counting shared data and image-specific data separately”?) than APIs that have a small surface and give the user everything (“inspect an image, I will then read the fields I care about”), primarily because for the specific APIs, we know exactly what the caller wants, we can find the callers (maybe there aren’t any any more), and we have much more freedom to change the underlying implementation. The general APIs are simpler to write, and maybe compose better for a bit, but the “everything” concept means that users end up hard-coding assumptions about the designs and relationships that can never be changed later. Of course there is a balance to be struck, users do need some way to inspect actual system state down to possible implementation details — but on balance, I am biased towards many specific APIs (whether that’s functions/objects/fields) with limited purpose and limited promises; and I’m willing to pay the price of more symbols and longer names.

E.g. the whole layerTree structure is an example, where c/common has a deeply hard-coded concept of layers that have parents, forming a DAG. That might become irrelevant if we rebuilt storage around requiring overlay-like filesystems. With the old graph drivers, especially now-removed device-mapper, we must store two separate copies of layers with images composed of layers AB and BA; with overlay, we could store layers A and B on disk only once, and only cheaply mount that together. Right now, c/storage still uses the DAG structure and stores the AB/BA layers completely separately, but for overlay, that is not always necessary. (And yes, that also illustrates the “balance” — to a large extent, users don’t care and c/common should not care either, but users do, in the end, need to understand where the disk space is consumed.)

@Honny1
Copy link
Member

Honny1 commented Aug 29, 2024

Changes need to be propagated to Podman.

@Honny1 Honny1 reopened this Aug 29, 2024
@Luap99
Copy link
Member

Luap99 commented Aug 29, 2024

@Honny1 You can open a PR here to update to c/common@main so we can get this race fixed for @edsantiago's parallel test work.

@Honny1
Copy link
Member

Honny1 commented Aug 29, 2024

@Luap99 I'm going for it.

@Honny1
Copy link
Member

Honny1 commented Aug 29, 2024

@Luap99 Should I vendor it in other projects? If so, which ones?

@mtrmac
Copy link
Collaborator

mtrmac commented Aug 29, 2024

libimage is, IIRC, used in Buildah and Podman; doing this for Buildah as well seems valuable to me.

@mtrmac
Copy link
Collaborator

mtrmac commented Aug 29, 2024

@Honny1 More importantly, the c/common work changes c/common API — so please prepare PRs to update all users, to avoid a later unexpected surprise.

@Honny1
Copy link
Member

Honny1 commented Aug 30, 2024

Here is PR vendoring for Buildah: containers/buildah#5710

Honny1 added a commit to Honny1/podman that referenced this issue Aug 30, 2024
This commit vendor pre-release version of `c/common:8483ef6022b4`.
It also adapts the code to the new `c/common/libimage` API, which
fixes an image listing race that was listing false warnings.

fixes: containers#23331

Signed-off-by: Jan Rodák <[email protected]>
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 29, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Nov 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration jira kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
5 participants