-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sharktank] Update shark-ai CIs with latest install #609
base: main
Are you sure you want to change the base?
Conversation
We want to pull in iree pre-release rather than stable to catch regressions earlier and fix it. |
.github/workflows/ci_eval.yaml
Outdated
pip install --no-compile -f https://iree.dev/pip-release-links.html --src deps \ | ||
-e "git+https://github.com/iree-org/iree-turbine.git#egg=iree-turbine" | ||
pip install shark-ai[apps] | ||
python -m pip install sharktank -f https://github.com/nod-ai/SHARK-Platform/releases/expanded_assets/dev-wheels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want --upgrade --pre
here, to use the latest nightly. This did not install the latest nightly, it kept some cached version of the dev packages: https://github.com/nod-ai/shark-ai/actions/runs/12024091575/job/33519042561?pr=609#step:5:88
Looking in links: https://github.com/nod-ai/SHARK-Platform/releases/expanded_assets/dev-wheels
Requirement already satisfied: sharktank in /data/actions-runner-llama/_work/_tool/Python/3.11.10/x64/lib/python3.11/site-packages (3.1.0.dev0)
https://github.com/nod-ai/shark-ai/releases/tag/dev-wheels latest is sharktank-3.1.0rc20241126-py3-none-any.whl
We should update the docs (https://github.com/nod-ai/shark-ai/blob/main/docs/nightly_releases.md) to cover the upgrade case as well as the fresh install case. These runners using preexisting (or cached) virtual environments makes that especially important.
Most jobs running on |
Makes sense. But I think we can at least use stable instead of pinned version (which needs to be updated every time). So pre-submits will have stable to unblock PRs & track |
.github/workflows/ci-shark-ai.yml
Outdated
pip install -f https://iree.dev/pip-release-links.html --upgrade --pre \ | ||
iree-base-compiler \ | ||
iree-base-runtime | ||
pip install shark-ai[apps] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer that this test stay as building from source, as opposed to using nightlies for sharktank/shortfin
, and the pinned version of IREE for those packages.
It's a PR triggered test, and the intent is to catch regressions before they're merged in both sharktank and shortfin. IIUC by using nightly dev wheels
we would end up catching these failures later (overnight), instead of before they get committed in the first place.
It may be useful though to create a shark-ai-nightly
ci that essentially repeats the same tests with the nightly dev-wheels,shark-ai[apps]
, and pinned IREE, because that does add coverage for the code that we are actually releasing. Just feels like it would be a mistake to trade catching potential regressions at the source.
It makes a ton of sense to do this for the sglang_benchmark_test
though, which is performance regression focused, as opposed to functional regression focused, like this one is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading above comments, I do see how it would be helpful to pin to an IREE version and not gate merging PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep coverage for the latest packages while not blocking development entirely.
Could add a matrix to each job that tests both configurations, then only mark the workflows that use pinned packages "required checks".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see we want to follow the developer guide instructions for pre-submit and the nightly releases install for nightly CIs. Will revert pre-submit changes.
# wheels saves multiple minutes and a lot of bandwidth on runner setup. | ||
pip install --no-compile -r pytorch-cpu-requirements.txt | ||
pip install --no-compile -f https://iree.dev/pip-release-links.html --src deps \ | ||
pip install shark-ai[apps] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comments for this test as well. It runs periodically and "pushes" the LLM server more than the CPU integration test does. It also specifically tests for GPU.
The reason for setting it periodically was that if we were to find regressions in sharktank, shortfin, or IREE, we would be able to triage it to a small subset of commits and hopefully trouble shoot faster than with a days worth of commits. This should theoretically work well, absent of infrastructure failures, which made us lose our "window" in most recent regressions.
Thanks for adding the --pre
flag. I realized that was missing in #602, which is blocked until the compilation failures are fixed. Having it missing did cause some confusion when I was troubleshooting it.
But, yeah I think this test has more value, in terms of catching regressions, building from source than from nightly packages.
.github/workflows/ci-sharktank.yml
Outdated
pip install -f https://iree.dev/pip-release-links.html --upgrade --pre \ | ||
iree-base-compiler iree-base-runtime --src deps \ | ||
-e "git+https://github.com/iree-org/iree-turbine.git#egg=iree-turbine" | ||
pip install shark-ai[apps] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comments for this test as well. It's PR triggered and has unit tests. Seems like it would push us back in catching sharktank regressions by using nightly releases, instead of building directly from source
…-ai/shark-ai into update-perplexity-ci-install
…-ai/shark-ai into update-perplexity-ci-install
@ScottTodd As you can see the pre-submit CIs seem to be using different iree-turbine/compiler versions although workflow yaml has same command, possible caching env. |
Can you link specific logs? This PR now is changing much more than originally stated. I'd like to highlight two things:
|
Agree the PR is growing bigger, intention was to align all pre-submits. |
This reverts commit 3f1fe11.
…-ai/shark-ai into update-perplexity-ci-install
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are parts of this that are a step in the right direction, but I'm hesitant to approve because there are several outstanding issues and I have much deeper refactoring in progress for the package setup steps in these workflows that will get at the root causes of those issues:
- https://github.com/nod-ai/shark-ai/actions/runs/12130121459/job/33819823958?pr=609#step:5:175 attempts to install
iree-turbine-3.0.0
but because the runners are persistent and the workflows don't either clean up their working directories or use virtual environments, there are packages already installed that conflict:Downloading iree_turbine-3.0.0-py3-none-any.whl (274 kB) Installing collected packages: iree-turbine Attempting uninstall: iree-turbine Found existing installation: iree-turbine 3.1.0 Uninstalling iree-turbine-3.1.0: Successfully uninstalled iree-turbine-3.1.0 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. shark-ai 3.0.0 requires iree-base-runtime==3.0.*, but you have iree-base-runtime 3.1.0rc20241202 which is incompatible. shark-ai 3.0.0 requires shortfin==3.0.0, but you have shortfin 3.1.0.dev0 which is incompatible. Successfully installed iree-turbine-3.0.0
- The
iree-turbine
source install should be replaced with nightly packages (Start publishing nightly Python packages iree-org/iree-turbine#305). - The workflows that install sharktank/shortfin from source builds to run integration tests should use prebuilt packages, either dev/nightly/stable (Rework GitHub Actions workflows to build packages --> test packages #584)
My first bullet point there should be addressed with #640. We could rebase this on top once that lands. I don't have a very clear timeline yet for the two other items. |
Great, I believe the only dependency of this PR to work as expected is #640 which resolves the unstable behavior. So when we install |
I'm more than aware - re-architecting the workflows so this class of issues is removed has been my main priority these past few weeks. |
iree-turbine
version for pre-submit CIssharktank
,shortfin
andiree-turbine
for nightly CIs