Unit tests seem seem to fail silently and intermittently (Error: The operation was canceled) #689

sogartar · 2024-12-12T19:55:06Z

Run pytest -n 4 sharktank/ --durations=10
============================= test session starts ==============================
platform linux -- Python 3.11.11, pytest-8.0.0, pluggy-1.5.0
rootdir: /home/runner/work/shark-ai/shark-ai/sharktank
configfile: pyproject.toml
plugins: anyio-4.7.0, xdist-3.5.0, html-4.1.1, metadata-3.1.1
created: 4/4 workers
4 workers [287 items]

ssssssssssssssss.............sss.......................sssss....ss...... [ 25%]
..........................ssssssssss......................ss.....s...... [ 50%]
.s...................................................................... [ 75%]
.......................................
Error: The operation was canceled.

https://github.com/nod-ai/shark-ai/actions/runs/12303597939/job/34339067931?pr=663

I tried the same with only one test worker and it passed.

I am not sure what is going on. Maybe some native lib is crashing.

The text was updated successfully, but these errors were encountered:

ScottTodd · 2024-12-12T20:02:04Z

From the summary page: https://github.com/nod-ai/shark-ai/actions/runs/12303597939?pr=663

Unit Tests and Type Checking (3.11, ubuntu-24.04)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

If the job was cancelled by another job starting or by a manual trigger then that would have been reflected in the UI. If this happens again, then resource exhaustion is a possibility. Otherwise, looks like an issue on GitHub's side.

ScottTodd · 2024-12-12T20:03:27Z

Similar failure on another run, using a self-hosted runner: https://github.com/nod-ai/shark-ai/actions/runs/12303430880

https://www.githubstatus.com/ reports no current incidents...

sogartar · 2024-12-12T20:53:42Z

I think the job should not have been canceled. So some sort of resource exhaustion is maybe is the most reasonable assumption. I will try with 3 jobs instead of 4.

ScottTodd · 2024-12-12T21:01:57Z

Oh, wait. This is on a pull request that adds a new test: #663?

Make sure that unit test jobs are only running unit tests, not larger integration tests.

Standard GitHub-hosted runners currently have 16GB of RAM, 14GB of SSD, and 4 processors: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories

Unit tests should also run in seconds (or low minutes).

sogartar · 2024-12-13T01:38:11Z

@ScottTodd you were right about the resource exhaustion. It turned out that some tests were exporting MLIR into memory with large constants.

sogartar · 2024-12-13T13:52:25Z

I fixed the offending tests in this PR #663. They are still using larger tensors then they should. It will require a bit more work to create toy-sized model tests. We will address this at some point.

sogartar closed this as completed Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unit tests seem seem to fail silently and intermittently (Error: The operation was canceled) #689

Unit tests seem seem to fail silently and intermittently (Error: The operation was canceled) #689

sogartar commented Dec 12, 2024

ScottTodd commented Dec 12, 2024

ScottTodd commented Dec 12, 2024

sogartar commented Dec 12, 2024 •

edited

Loading

ScottTodd commented Dec 12, 2024

sogartar commented Dec 13, 2024

sogartar commented Dec 13, 2024

Unit tests seem seem to fail silently and intermittently (Error: The operation was canceled) #689

Unit tests seem seem to fail silently and intermittently (Error: The operation was canceled) #689

Comments

sogartar commented Dec 12, 2024

ScottTodd commented Dec 12, 2024

ScottTodd commented Dec 12, 2024

sogartar commented Dec 12, 2024 • edited Loading

ScottTodd commented Dec 12, 2024

sogartar commented Dec 13, 2024

sogartar commented Dec 13, 2024

sogartar commented Dec 12, 2024 •

edited

Loading