-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unit tests seem seem to fail silently and intermittently (Error: The operation was canceled) #689
Comments
From the summary page: https://github.com/nod-ai/shark-ai/actions/runs/12303597939?pr=663
If the job was cancelled by another job starting or by a manual trigger then that would have been reflected in the UI. If this happens again, then resource exhaustion is a possibility. Otherwise, looks like an issue on GitHub's side. |
Similar failure on another run, using a self-hosted runner: https://github.com/nod-ai/shark-ai/actions/runs/12303430880 https://www.githubstatus.com/ reports no current incidents... |
I think the job should not have been canceled. So some sort of resource exhaustion is maybe is the most reasonable assumption. I will try with 3 jobs instead of 4. |
Oh, wait. This is on a pull request that adds a new test: #663? Make sure that unit test jobs are only running unit tests, not larger integration tests. Standard GitHub-hosted runners currently have 16GB of RAM, 14GB of SSD, and 4 processors: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories Unit tests should also run in seconds (or low minutes). |
@ScottTodd you were right about the resource exhaustion. It turned out that some tests were exporting MLIR into memory with large constants. |
I fixed the offending tests in this PR #663. They are still using larger tensors then they should. It will require a bit more work to create toy-sized model tests. We will address this at some point. |
https://github.com/nod-ai/shark-ai/actions/runs/12303597939/job/34339067931?pr=663
I tried the same with only one test worker and it passed.
I am not sure what is going on. Maybe some native lib is crashing.
The text was updated successfully, but these errors were encountered: