Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: ginkgo: remove -flakeAttempts 3 #17967

Closed
edsantiago opened this issue Mar 28, 2023 · 9 comments · Fixed by #23662
Closed

Epic: ginkgo: remove -flakeAttempts 3 #17967

edsantiago opened this issue Mar 28, 2023 · 9 comments · Fixed by #23662
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

It's hurting us. We need to make a concerted effort to remove it.

Best I can tell, there are three kinds of e2e flakes:

  1. network or registry (or other resource) flakes
  2. real podman bugs
  3. bugs in the tests themselves

I believe the vast majority of the test failures I'm seeing are 3, test bugs. Ginkgo runs tests in parallel, and I don't think we're doing enough locking around (e.g.) podman system reset or network or gpg tests. There are also simple race conditions. I suspect, but cannot be certain, that these are test bugs: #17940, #17946, #17957, #17958, #17966.

Unfortunately, by hiding flakes we also hide number 2, real podman bugs. You all know the nightmares we're having right now with sqlite: those were in the test logs all along, we could've caught them earlier if we hadn't been retrying flakes. I hope that's a strong enough argument for my position, because we truly have no idea how many other such bugs are hiding in plain sight.

And, sigh, network/registry flakes. This is a valid use case for retrying; I just don't know how to do it. It might be necessary to go test-by-test, looking for search or pull, and wrap those in a retry mechanism. Thoughts welcome.

Reminder: my flake logger only catches triple failures, the ones that force you to press Rerun. When e2e tests single-fail, I have no way of knowing about it. So I'm pretty sure these single-fails are happening every day, on every run, and we're just not seeing them.

And, don't panic: we can't remove -flakeAttempts any time soon. We have way way way way way too many test-bug flakes, so many that it takes 4-10 retries, and many hours, just to get CI passing on a single PR. We first need to fix the bug-test flakes, which means we need to identify them, which means I need to find a better way of tracking them. Although Evil-Twin Ed thinks that disabling retries right now would make everyone highly motivated to fix the test-bug flakes...

@vrothberg
Copy link
Member

+1 from my side

We must collectively pay more attention to flakes. I also agree that we first need to tackle the known flakes before opening Pandora's box.

@vrothberg
Copy link
Member

@mheon something to consider for bug week(s).

@Luap99
Copy link
Member

Luap99 commented Mar 29, 2023

In general I agree with the assessment however as we talked some weeks ago it is painful having to rerun the full test suite just because a single test flaked, it would be much better if just this single flake was rerun.

Reminder: my flake logger only catches triple failures, the ones that force you to press Rerun. When e2e tests single-fail, I have no way of knowing about it. So I'm pretty sure these single-fails are happening every day, on every run, and we're just not seeing them.

Is there any reason to not just fix that? Let's take https://api.cirrus-ci.com/v1/artifact/task/5464387922690048/html/int-podman-debian-12-root-host-boltdb.log.html for example, the log exactly shows that there was a single flake so why do you have no way of knowing?

@edsantiago
Copy link
Member Author

The logfile shows a flake happening... but I do not see logfiles. What I do is ask Cirrus: "for build X, please give me the status of all tasks that ran". ("X" is a long Cirrus number associated with a PR). I then run through that list and fetch logs only for tasks that failed. I do not fetch logs for tasks that passed, hence do not see tasks with retry-and-pass flakes.

I may need to reevaluate that design decision. It will not be easy.

@edsantiago
Copy link
Member Author

For future reference: ginkgo v2 includes a FlakeAttempts decorator which could be especially useful in podman search tests.

@Luap99
Copy link
Member

Luap99 commented Apr 14, 2023

For future reference: ginkgo v2 includes a FlakeAttempts decorator which could be especially useful in podman search tests.

Yes exactly , I was going to suggest this as better solution for this. FlakeAttempts should then only be used for flakes we know are unfixable (pull/search on external registries), everything else must be fixed instead.

@edsantiago edsantiago changed the title ginkgo: remove -flakeAttempts 3 Epic: ginkgo: remove -flakeAttempts 3 Apr 20, 2023
@Luap99 Luap99 mentioned this issue May 11, 2023
13 tasks
@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

edsantiago added a commit to edsantiago/libpod that referenced this issue Jun 7, 2023
- trust_test: adding 'Ordered' seems to resolve a very common
  flake. I've tested this for dozens of CI runs, and haven't
  seen the flake recur (normally it fails every few runs).

- exec and search tests: add FlakeAttempts(3). This is a NOP
  under our current CI setup, in which we run ginkgo with
  a global --flake-attempts=3. I am submitting this as an
  optimistic step toward a no-flake-attempts world (containers#17967)

Fixes: containers#18358

Signed-off-by: Ed Santiago <[email protected]>
@edsantiago
Copy link
Member Author

With the merge of #18816 we are much, much closer to accomplishing this.

@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 18, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Nov 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants