Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sharktank] Update shark-ai CIs with latest install #609

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

archana-ramalingam
Copy link
Collaborator

@archana-ramalingam archana-ramalingam commented Nov 26, 2024

  • Pin iree-turbine version for pre-submit CIs
  • Pull nightly releases of sharktank, shortfin and iree-turbine for nightly CIs

@archana-ramalingam
Copy link
Collaborator Author

We want to pull in iree pre-release rather than stable to catch regressions earlier and fix it.

@archana-ramalingam archana-ramalingam deleted the update-perplexity-ci-install branch November 26, 2024 17:31
pip install --no-compile -f https://iree.dev/pip-release-links.html --src deps \
-e "git+https://github.com/iree-org/iree-turbine.git#egg=iree-turbine"
pip install shark-ai[apps]
python -m pip install sharktank -f https://github.com/nod-ai/SHARK-Platform/releases/expanded_assets/dev-wheels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want --upgrade --pre here, to use the latest nightly. This did not install the latest nightly, it kept some cached version of the dev packages: https://github.com/nod-ai/shark-ai/actions/runs/12024091575/job/33519042561?pr=609#step:5:88

Looking in links: https://github.com/nod-ai/SHARK-Platform/releases/expanded_assets/dev-wheels
Requirement already satisfied: sharktank in /data/actions-runner-llama/_work/_tool/Python/3.11.10/x64/lib/python3.11/site-packages (3.1.0.dev0)

https://github.com/nod-ai/shark-ai/releases/tag/dev-wheels latest is sharktank-3.1.0rc20241126-py3-none-any.whl

We should update the docs (https://github.com/nod-ai/shark-ai/blob/main/docs/nightly_releases.md) to cover the upgrade case as well as the fresh install case. These runners using preexisting (or cached) virtual environments makes that especially important.

@ScottTodd
Copy link
Member

We want to pull in iree pre-release rather than stable to catch regressions earlier and fix it.

Most jobs running on pull_request and push triggers should use pinned versions, so they are predictable. Scheduled jobs can use unpinned / latest versions for early signal.

@archana-ramalingam archana-ramalingam restored the update-perplexity-ci-install branch November 26, 2024 17:43
@archana-ramalingam
Copy link
Collaborator Author

Makes sense. But I think we can at least use stable instead of pinned version (which needs to be updated every time). So pre-submits will have stable to unblock PRs & track shark-ai regressions and nightly will have pre-release tested. Reopening this.

pip install -f https://iree.dev/pip-release-links.html --upgrade --pre \
iree-base-compiler \
iree-base-runtime
pip install shark-ai[apps]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that this test stay as building from source, as opposed to using nightlies for sharktank/shortfin, and the pinned version of IREE for those packages.

It's a PR triggered test, and the intent is to catch regressions before they're merged in both sharktank and shortfin. IIUC by using nightly dev wheels we would end up catching these failures later (overnight), instead of before they get committed in the first place.

It may be useful though to create a shark-ai-nightly ci that essentially repeats the same tests with the nightly dev-wheels,shark-ai[apps], and pinned IREE, because that does add coverage for the code that we are actually releasing. Just feels like it would be a mistake to trade catching potential regressions at the source.

It makes a ton of sense to do this for the sglang_benchmark_test though, which is performance regression focused, as opposed to functional regression focused, like this one is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading above comments, I do see how it would be helpful to pin to an IREE version and not gate merging PRs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep coverage for the latest packages while not blocking development entirely.

Could add a matrix to each job that tests both configurations, then only mark the workflows that use pinned packages "required checks".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see we want to follow the developer guide instructions for pre-submit and the nightly releases install for nightly CIs. Will revert pre-submit changes.

# wheels saves multiple minutes and a lot of bandwidth on runner setup.
pip install --no-compile -r pytorch-cpu-requirements.txt
pip install --no-compile -f https://iree.dev/pip-release-links.html --src deps \
pip install shark-ai[apps]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comments for this test as well. It runs periodically and "pushes" the LLM server more than the CPU integration test does. It also specifically tests for GPU.

The reason for setting it periodically was that if we were to find regressions in sharktank, shortfin, or IREE, we would be able to triage it to a small subset of commits and hopefully trouble shoot faster than with a days worth of commits. This should theoretically work well, absent of infrastructure failures, which made us lose our "window" in most recent regressions.

Thanks for adding the --pre flag. I realized that was missing in #602, which is blocked until the compilation failures are fixed. Having it missing did cause some confusion when I was troubleshooting it.

But, yeah I think this test has more value, in terms of catching regressions, building from source than from nightly packages.

pip install -f https://iree.dev/pip-release-links.html --upgrade --pre \
iree-base-compiler iree-base-runtime --src deps \
-e "git+https://github.com/iree-org/iree-turbine.git#egg=iree-turbine"
pip install shark-ai[apps]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comments for this test as well. It's PR triggered and has unit tests. Seems like it would push us back in catching sharktank regressions by using nightly releases, instead of building directly from source

@archana-ramalingam
Copy link
Collaborator Author

@ScottTodd As you can see the pre-submit CIs seem to be using different iree-turbine/compiler versions although workflow yaml has same command, possible caching env.
What's a reliable way to pull the latest stable IREE release? Or do we just pin to latest known good version? If latter, we need a schedule to update good IREE version across all pre-submits.

@ScottTodd
Copy link
Member

@ScottTodd As you can see the pre-submit CIs seem to be using different iree-turbine/compiler versions although workflow yaml has same command, possible caching env. What's a reliable way to pull the latest stable IREE release? Or do we just pin to latest known good version? If latter, we need a schedule to update good IREE version across all pre-submits.

Can you link specific logs? This PR now is changing much more than originally stated.

I'd like to highlight two things:

@archana-ramalingam
Copy link
Collaborator Author

archana-ramalingam commented Nov 27, 2024

@ScottTodd As you can see the pre-submit CIs seem to be using different iree-turbine/compiler versions although workflow yaml has same command, possible caching env. What's a reliable way to pull the latest stable IREE release? Or do we just pin to latest known good version? If latter, we need a schedule to update good IREE version across all pre-submits.

Can you link specific logs? This PR now is changing much more than originally stated.

I'd like to highlight two things:

Agree the PR is growing bigger, intention was to align all pre-submits.
For context though, here are the logs for inconsistent installation (probably linked to PR 19305). Perplexity CI has pre-release versions but consistent installs but benchmark has inconsistent installs, probably cached installs.

@archana-ramalingam archana-ramalingam changed the title [sharktank] Update perplexity CI install [sharktank] Update shark-ai CIs with latest install Nov 28, 2024
Copy link
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are parts of this that are a step in the right direction, but I'm hesitant to approve because there are several outstanding issues and I have much deeper refactoring in progress for the package setup steps in these workflows that will get at the root causes of those issues:

  • https://github.com/nod-ai/shark-ai/actions/runs/12130121459/job/33819823958?pr=609#step:5:175 attempts to install iree-turbine-3.0.0 but because the runners are persistent and the workflows don't either clean up their working directories or use virtual environments, there are packages already installed that conflict:
    Downloading iree_turbine-3.0.0-py3-none-any.whl (274 kB)
    Installing collected packages: iree-turbine
      Attempting uninstall: iree-turbine
        Found existing installation: iree-turbine 3.1.0
        Uninstalling iree-turbine-3.1.0:
          Successfully uninstalled iree-turbine-3.1.0
    ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
    shark-ai 3.0.0 requires iree-base-runtime==3.0.*, but you have iree-base-runtime 3.1.0rc20241202 which is incompatible.
    shark-ai 3.0.0 requires shortfin==3.0.0, but you have shortfin 3.1.0.dev0 which is incompatible.
    Successfully installed iree-turbine-3.0.0
    
  • The iree-turbine source install should be replaced with nightly packages (Start publishing nightly Python packages iree-org/iree-turbine#305).
  • The workflows that install sharktank/shortfin from source builds to run integration tests should use prebuilt packages, either dev/nightly/stable (Rework GitHub Actions workflows to build packages --> test packages #584)

@ScottTodd
Copy link
Member

My first bullet point there should be addressed with #640. We could rebase this on top once that lands. I don't have a very clear timeline yet for the two other items.

@archana-ramalingam
Copy link
Collaborator Author

My first bullet point there should be addressed with #640. We could rebase this on top once that lands. I don't have a very clear timeline yet for the two other items.

Great, I believe the only dependency of this PR to work as expected is #640 which resolves the unstable behavior. So when we install iree-turbine==3.0.0, it will get us iree-base-runtime==3.0.0 and iree-base-compiler==3.0.0. Points 2 and 3 can be addressed in a separate PR.
Reason for urgency is every time an IREE regression happens, unpinned IREE crashes pre-submits and blocks PR merges, like today's regression. Pre-submits must only catch sharktank regressions and nightly should catch IREE/sharktank+IREE ones.

@ScottTodd
Copy link
Member

Reason for urgency is every time an IREE regression happens, unpinned IREE crashes pre-submits and blocks PR merges, like today's regression. Pre-submits must only catch sharktank regressions and nightly should catch IREE/sharktank+IREE ones.

I'm more than aware - re-architecting the workflows so this class of issues is removed has been my main priority these past few weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants