Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make nightly failures more visible to developers #127

Open
vyasr opened this issue Dec 12, 2024 · 4 comments
Open

Make nightly failures more visible to developers #127

vyasr opened this issue Dec 12, 2024 · 4 comments

Comments

@vyasr
Copy link
Contributor

vyasr commented Dec 12, 2024

Currently when the RAPIDS nightly runs fail, we rely on developers to actively monitor either the GHA tab or the Slack channels where we post these results. This results in some projects having their nightly CI broken for long periods of time, often indicating real bugs that go unfixed until release (or in the worst case, never). To improve this situation, I propose that we introduce an extra check to our PR CI that verifies how long it has been since the last failure of a CI job, and if it has been too long (by some metric) then we block PR merging by failing the job. This check will force more developers to be aware of the failures and deal with them relatively proactively.

@bdice
Copy link
Contributor

bdice commented Dec 12, 2024

Additional points from offline discussion:

  • We think a week with no nightly CI successes is a good starting point for when to block PRs
  • We will start with opt-in behavior and switch to opt-out during 25.02 burndown. This gives a bit of time before we force CI to fail

@jameslamb
Copy link
Member

I'm generally supportive of making nightly test failures harder to ignore, and I think blocking PR CI is an effective tool for that. Support this!

When this rolls out, let's be vigilant in packaging-codeowners against "fixes" for the tests that over-tighten runtime pins.

@pentschev
Copy link
Member

What's the strategy in case CI is failing for 1+ week and we need an urgent fix? For example, now with the holidays if some upstream package breaks CI immediately as everyone goes out the door it means when we come back it will be hard to get a fix merged.

@vyasr
Copy link
Contributor Author

vyasr commented Dec 20, 2024

The rest of CI will run even if the nightly job fails, so we can request admin merges if we see that a PR is otherwise passing CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants