Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit tests seem seem to fail silently and intermittently (Error: The operation was canceled) #689

Closed
sogartar opened this issue Dec 12, 2024 · 6 comments

Comments

@sogartar
Copy link
Contributor

Run pytest -n 4 sharktank/ --durations=10
============================= test session starts ==============================
platform linux -- Python 3.11.11, pytest-8.0.0, pluggy-1.5.0
rootdir: /home/runner/work/shark-ai/shark-ai/sharktank
configfile: pyproject.toml
plugins: anyio-4.7.0, xdist-3.5.0, html-4.1.1, metadata-3.1.1
created: 4/4 workers
4 workers [287 items]

ssssssssssssssss.............sss.......................sssss....ss...... [ 25%]
..........................ssssssssss......................ss.....s...... [ 50%]
.s...................................................................... [ 75%]
.......................................
Error: The operation was canceled.

https://github.com/nod-ai/shark-ai/actions/runs/12303597939/job/34339067931?pr=663

I tried the same with only one test worker and it passed.

I am not sure what is going on. Maybe some native lib is crashing.

@ScottTodd
Copy link
Member

From the summary page: https://github.com/nod-ai/shark-ai/actions/runs/12303597939?pr=663

Unit Tests and Type Checking (3.11, ubuntu-24.04)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

If the job was cancelled by another job starting or by a manual trigger then that would have been reflected in the UI. If this happens again, then resource exhaustion is a possibility. Otherwise, looks like an issue on GitHub's side.

@ScottTodd
Copy link
Member

Similar failure on another run, using a self-hosted runner: https://github.com/nod-ai/shark-ai/actions/runs/12303430880

https://www.githubstatus.com/ reports no current incidents...

@sogartar
Copy link
Contributor Author

sogartar commented Dec 12, 2024

I think the job should not have been canceled. So some sort of resource exhaustion is maybe is the most reasonable assumption. I will try with 3 jobs instead of 4.

@ScottTodd
Copy link
Member

Oh, wait. This is on a pull request that adds a new test: #663?

Make sure that unit test jobs are only running unit tests, not larger integration tests.

Standard GitHub-hosted runners currently have 16GB of RAM, 14GB of SSD, and 4 processors: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories

Unit tests should also run in seconds (or low minutes).

@sogartar
Copy link
Contributor Author

@ScottTodd you were right about the resource exhaustion. It turned out that some tests were exporting MLIR into memory with large constants.

@sogartar
Copy link
Contributor Author

I fixed the offending tests in this PR #663. They are still using larger tensors then they should. It will require a bit more work to create toy-sized model tests. We will address this at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants