Epic: ginkgo: remove -flakeAttempts 3 #17967

edsantiago · 2023-03-28T17:13:38Z

It's hurting us. We need to make a concerted effort to remove it.

Best I can tell, there are three kinds of e2e flakes:

network or registry (or other resource) flakes
real podman bugs
bugs in the tests themselves

I believe the vast majority of the test failures I'm seeing are 3, test bugs. Ginkgo runs tests in parallel, and I don't think we're doing enough locking around (e.g.) podman system reset or network or gpg tests. There are also simple race conditions. I suspect, but cannot be certain, that these are test bugs: #17940, #17946, #17957, #17958, #17966.

Unfortunately, by hiding flakes we also hide number 2, real podman bugs. You all know the nightmares we're having right now with sqlite: those were in the test logs all along, we could've caught them earlier if we hadn't been retrying flakes. I hope that's a strong enough argument for my position, because we truly have no idea how many other such bugs are hiding in plain sight.

And, sigh, network/registry flakes. This is a valid use case for retrying; I just don't know how to do it. It might be necessary to go test-by-test, looking for search or pull, and wrap those in a retry mechanism. Thoughts welcome.

Reminder: my flake logger only catches triple failures, the ones that force you to press Rerun. When e2e tests single-fail, I have no way of knowing about it. So I'm pretty sure these single-fails are happening every day, on every run, and we're just not seeing them.

And, don't panic: we can't remove -flakeAttempts any time soon. We have way way way way way too many test-bug flakes, so many that it takes 4-10 retries, and many hours, just to get CI passing on a single PR. We first need to fix the bug-test flakes, which means we need to identify them, which means I need to find a better way of tracking them. Although Evil-Twin Ed thinks that disabling retries right now would make everyone highly motivated to fix the test-bug flakes...

The text was updated successfully, but these errors were encountered:

vrothberg · 2023-03-29T07:39:28Z

+1 from my side

We must collectively pay more attention to flakes. I also agree that we first need to tackle the known flakes before opening Pandora's box.

vrothberg · 2023-03-29T07:39:50Z

@mheon something to consider for bug week(s).

Luap99 · 2023-03-29T11:46:29Z

In general I agree with the assessment however as we talked some weeks ago it is painful having to rerun the full test suite just because a single test flaked, it would be much better if just this single flake was rerun.

Reminder: my flake logger only catches triple failures, the ones that force you to press Rerun. When e2e tests single-fail, I have no way of knowing about it. So I'm pretty sure these single-fails are happening every day, on every run, and we're just not seeing them.

Is there any reason to not just fix that? Let's take https://api.cirrus-ci.com/v1/artifact/task/5464387922690048/html/int-podman-debian-12-root-host-boltdb.log.html for example, the log exactly shows that there was a single flake so why do you have no way of knowing?

edsantiago · 2023-03-29T12:09:04Z

The logfile shows a flake happening... but I do not see logfiles. What I do is ask Cirrus: "for build X, please give me the status of all tasks that ran". ("X" is a long Cirrus number associated with a PR). I then run through that list and fetch logs only for tasks that failed. I do not fetch logs for tasks that passed, hence do not see tasks with retry-and-pass flakes.

I may need to reevaluate that design decision. It will not be easy.

edsantiago · 2023-04-13T18:53:27Z

For future reference: ginkgo v2 includes a FlakeAttempts decorator which could be especially useful in podman search tests.

Luap99 · 2023-04-14T10:57:02Z

For future reference: ginkgo v2 includes a FlakeAttempts decorator which could be especially useful in podman search tests.

Yes exactly , I was going to suggest this as better solution for this. FlakeAttempts should then only be used for flakes we know are unfixable (pull/search on external registries), everything else must be fixed instead.

github-actions · 2023-05-21T00:06:56Z

A friendly reminder that this issue had no activity for 30 days.

- trust_test: adding 'Ordered' seems to resolve a very common flake. I've tested this for dozens of CI runs, and haven't seen the flake recur (normally it fails every few runs). - exec and search tests: add FlakeAttempts(3). This is a NOP under our current CI setup, in which we run ginkgo with a global --flake-attempts=3. I am submitting this as an optimistic step toward a no-flake-attempts world (containers#17967) Fixes: containers#18358 Signed-off-by: Ed Santiago <[email protected]>

edsantiago · 2023-06-07T17:04:37Z

With the merge of #18816 we are much, much closer to accomplishing this.

edsantiago · 2023-06-20T20:59:36Z

Current list of blockers:

restart: failed to allocate for range 0: 10.88.109.255 has been allocated to XXX, duplicate allocation is not allowed #18753
e2e authenticated push test: multiple failures #18355
podman logs: missing output #18501
exec: crun: setns(pid=87793, CLONE_SOMETHING): Operation not permitted: OCI permission denied crun#1264
rm -fa + dependencies is still issuing warnings #18865
unmounting/EINVAL flake #18831
race: new rm -fa with dependencies is not actually waiting for container removal #18874
e2e: podman system reset: unlinkat: directory not empty #17957
podman search with wildcards: redhat registry: unknown, not found #18484 (not like we can do anything about this one)
invalid internal status, try resetting the pause process with \"/var/tmp/go/src/github.com/containers/podman/bin/podman system migrate\": setting up the process: open /run/user/3271/libpod/tmp/pause.pid: no such file or directory #18448

And of course our old favorite #10927

edsantiago changed the title ~~ginkgo: remove -flakeAttempts 3~~ Epic: ginkgo: remove -flakeAttempts 3 Apr 20, 2023

edsantiago mentioned this issue Apr 20, 2023

podman exec into a "-it" container: container create failed (no logs from conmon): EOF #10927

Open

Luap99 mentioned this issue May 11, 2023

ginkgo tests EPIC #18540

Open

13 tasks

github-actions bot added the stale-issue label May 21, 2023

rhatdan removed the stale-issue label May 26, 2023

edsantiago mentioned this issue Jun 7, 2023

e2e: add ginkgo decorators to address flakes #18816

Merged

edsantiago mentioned this issue Aug 19, 2024

CI: disable ginkgo flake retries #23662

Merged

openshift-merge-bot bot closed this as completed in #23662 Aug 19, 2024

openshift-merge-bot bot closed this as completed in 145c751 Aug 19, 2024

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 18, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: ginkgo: remove -flakeAttempts 3 #17967

Epic: ginkgo: remove -flakeAttempts 3 #17967

edsantiago commented Mar 28, 2023

vrothberg commented Mar 29, 2023

vrothberg commented Mar 29, 2023

Luap99 commented Mar 29, 2023

edsantiago commented Mar 29, 2023

edsantiago commented Apr 13, 2023

Luap99 commented Apr 14, 2023

github-actions bot commented May 21, 2023

edsantiago commented Jun 7, 2023

edsantiago commented Jun 20, 2023

Epic: ginkgo: remove -flakeAttempts 3 #17967

Epic: ginkgo: remove -flakeAttempts 3 #17967

Comments

edsantiago commented Mar 28, 2023

vrothberg commented Mar 29, 2023

vrothberg commented Mar 29, 2023

Luap99 commented Mar 29, 2023

edsantiago commented Mar 29, 2023

edsantiago commented Apr 13, 2023

Luap99 commented Apr 14, 2023

github-actions bot commented May 21, 2023

edsantiago commented Jun 7, 2023

edsantiago commented Jun 20, 2023