-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: ginkgo: remove -flakeAttempts 3 #17967
Comments
+1 from my side We must collectively pay more attention to flakes. I also agree that we first need to tackle the known flakes before opening Pandora's box. |
@mheon something to consider for bug week(s). |
In general I agree with the assessment however as we talked some weeks ago it is painful having to rerun the full test suite just because a single test flaked, it would be much better if just this single flake was rerun.
Is there any reason to not just fix that? Let's take https://api.cirrus-ci.com/v1/artifact/task/5464387922690048/html/int-podman-debian-12-root-host-boltdb.log.html for example, the log exactly shows that there was a single flake so why do you have no way of knowing? |
The logfile shows a flake happening... but I do not see logfiles. What I do is ask Cirrus: "for build X, please give me the status of all tasks that ran". ("X" is a long Cirrus number associated with a PR). I then run through that list and fetch logs only for tasks that failed. I do not fetch logs for tasks that passed, hence do not see tasks with retry-and-pass flakes. I may need to reevaluate that design decision. It will not be easy. |
For future reference: ginkgo v2 includes a FlakeAttempts decorator which could be especially useful in |
Yes exactly , I was going to suggest this as better solution for this. |
A friendly reminder that this issue had no activity for 30 days. |
- trust_test: adding 'Ordered' seems to resolve a very common flake. I've tested this for dozens of CI runs, and haven't seen the flake recur (normally it fails every few runs). - exec and search tests: add FlakeAttempts(3). This is a NOP under our current CI setup, in which we run ginkgo with a global --flake-attempts=3. I am submitting this as an optimistic step toward a no-flake-attempts world (containers#17967) Fixes: containers#18358 Signed-off-by: Ed Santiago <[email protected]>
With the merge of #18816 we are much, much closer to accomplishing this. |
It's hurting us. We need to make a concerted effort to remove it.
Best I can tell, there are three kinds of e2e flakes:
I believe the vast majority of the test failures I'm seeing are 3, test bugs. Ginkgo runs tests in parallel, and I don't think we're doing enough locking around (e.g.)
podman system reset
ornetwork
orgpg
tests. There are also simple race conditions. I suspect, but cannot be certain, that these are test bugs: #17940, #17946, #17957, #17958, #17966.Unfortunately, by hiding flakes we also hide number 2, real podman bugs. You all know the nightmares we're having right now with sqlite: those were in the test logs all along, we could've caught them earlier if we hadn't been retrying flakes. I hope that's a strong enough argument for my position, because we truly have no idea how many other such bugs are hiding in plain sight.
And, sigh, network/registry flakes. This is a valid use case for retrying; I just don't know how to do it. It might be necessary to go test-by-test, looking for
search
orpull
, and wrap those in a retry mechanism. Thoughts welcome.Reminder: my flake logger only catches triple failures, the ones that force you to press Rerun. When e2e tests single-fail, I have no way of knowing about it. So I'm pretty sure these single-fails are happening every day, on every run, and we're just not seeing them.
And, don't panic: we can't remove
-flakeAttempts
any time soon. We have way way way way way too many test-bug flakes, so many that it takes 4-10 retries, and many hours, just to get CI passing on a single PR. We first need to fix the bug-test flakes, which means we need to identify them, which means I need to find a better way of tracking them. Although Evil-Twin Ed thinks that disabling retries right now would make everyone highly motivated to fix the test-bug flakes...The text was updated successfully, but these errors were encountered: