-
Notifications
You must be signed in to change notification settings - Fork 625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate both llvm-project@2083e97e (+1 ↩️, +1 🍒) and torch-mlir@bce800a3 #17330
Conversation
Could you bump torch-mlir to one more commit forward. This will unblocked a lot SharkTestsuite ONNX model tests, which has a deadline this week.
|
Why not bump iree llvmproject into dabdec1001dc368373dd581cf72f37a440873ce? same as the torch-mlir bce800a3 bring in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The opt-125M
failures on https://github.com/iree-org/iree/actions/runs/9020826623/job/24787542994?pr=17330#step:9:39 are odd:
___________ IREE compile and run: opt-125M::gpu_vulkan_real_weights ____________
Error invoking iree-compile
Error code: 1
Stderr diagnostics:
opt-125M.mlirbc:0:0: error: attempting to parse a byte at the end of the bytecode
opt-125M.mlirbc:0:0: note: in bytecode version 6 produced by: MLIR19.0.0git
Invoked with:
cd /home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/pytorch/models/opt-125M && iree-compile opt-125M.mlirbc --iree-hal-target-backends=vulkan-spirv -o opt-125M_gpu_vulkan_real_weights.vmfb
I don't think that file has been modified and the test should be passing before these changes. Did the bytecode format or its parsing change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still happening. @MaheshRavishankar @ScottTodd what do we do, can we check if a MLIR bytecode version bump did just occur and can we then regenerate or temporarily disable this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any recent changes in https://github.com/llvm/llvm-project/tree/main/mlir/include/mlir/Bytecode or https://github.com/llvm/llvm-project/tree/main/mlir/lib/Bytecode. Not sure why just that one test would be failing, so would like to investigate/debug.
To keep the integrate train rolling, we can disable by marking as an expected compile failure in the _models_
files at https://github.com/iree-org/iree/tree/main/build_tools/pkgci/external_test_suite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought vulkan was always failing for sdxl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh the resnet test is also failing. Those (opt-125M and resnet50) are two real (new) failures in parsing the bytecode. The SDXL Vulkan failures are already marked XFAIL, ignore those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets wait for the CI to run again. This might be related to the vector.shuffle/vector.interleave
issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no this is bad! It is failing on rocm now, and maybe the error is misleading, but is it failing to read the input file?
OK because it's just one more commit (we need this process to converge now).
Because IREE requires additional fixes. |
Strange one, the constants shouldnt change this way. ¯\_(ツ)_/¯
7527742
to
3e1d6e5
Compare
There are some other tests also failing with the same bytecode parsing error. |
https://github.com/iree-org/iree/actions/runs/9023942190/job/24797359659?pr=17330#step:7:720
Okay phew (or eek), not just files that I uploaded myself. That rules out some things. Let's try to find whatever upstream change(s) affected the parsing. |
Might have a CUDA test newly timing out: https://github.com/iree-org/iree/actions/runs/9023942176/job/24797058825?pr=17330 had to cancel that job after 6 hours. |
* Edited the source .py files to remove benchmarks that failed on CI with `error: attempting to parse a byte at the end of the bytecode` * Ran `bash ./build_tools/scripts/generate_cmake_files.sh` * Fixed path separators (Windows vs Linux)
Actually this (llvm/torch-mlir#3013) turned out to cause additional problems here: #17345. This integrate has been reverted to its original torch-mlir target. Discord conversation: |
This reverts commit 3faa5b7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM when CI goes green (assuming also that benchmarks don't have any glaring regressions)
Abbreviated Benchmark Summary@ commit eab01522fae6b391ca7cb61469647c3862c11e0e (vs. base c81496c512d01cda26233932aba1358b8abe8164) Data-Tiling Comparison TableClick to show
Regressed Latencies 🚩
[Top 3 out of 7 results showed] Improved Latencies 🎉
[Top 3 out of 15 results showed] No improved or regressed compilation metrics 🏖️ For more information: |
This allows dropping our existing local-revert of llvm/llvm-project#89131 and cherry-pick of llvm/llvm-project#91654 which we had introduced in the earlier integrate #17330. This locally reverts llvm/llvm-project#90802 because it causes numerical errors, reported at llvm/llvm-project#90802 (comment).
…800a3 (iree-org#17330) * torch-mlir integrated at bce800a. * llvm-project integrated at 2083e97e plus local changes: * Reverted llvm/llvm-project#89131 locally: while this change is good in its own right, the `vector.interleave` that it generates (instead of `vector.shuffle`) are not handled by some GPU codegen lowerings. * Filed iree-org#17346. * Cherry-picked Bazel build fix: llvm/llvm-project#91654 * Several e2e tests have been temporarily disabled, follow-up work is needed to reenable them: iree-org#17344 --------- Co-authored-by: MaheshRavishankar <[email protected]> Co-authored-by: Scott Todd <[email protected]>
This allows dropping our existing local-revert of llvm/llvm-project#89131 and cherry-pick of llvm/llvm-project#91654 which we had introduced in the earlier integrate iree-org#17330. This locally reverts llvm/llvm-project#90802 because it causes numerical errors, reported at llvm/llvm-project#90802 (comment).
…800a3 (iree-org#17330) * torch-mlir integrated at bce800a. * llvm-project integrated at 2083e97e plus local changes: * Reverted llvm/llvm-project#89131 locally: while this change is good in its own right, the `vector.interleave` that it generates (instead of `vector.shuffle`) are not handled by some GPU codegen lowerings. * Filed iree-org#17346. * Cherry-picked Bazel build fix: llvm/llvm-project#91654 * Several e2e tests have been temporarily disabled, follow-up work is needed to reenable them: iree-org#17344 --------- Co-authored-by: MaheshRavishankar <[email protected]> Co-authored-by: Scott Todd <[email protected]> Signed-off-by: Lubo Litchev <[email protected]>
This allows dropping our existing local-revert of llvm/llvm-project#89131 and cherry-pick of llvm/llvm-project#91654 which we had introduced in the earlier integrate iree-org#17330. This locally reverts llvm/llvm-project#90802 because it causes numerical errors, reported at llvm/llvm-project#90802 (comment). Signed-off-by: Lubo Litchev <[email protected]>
vector.interleave
that it generates (instead ofvector.shuffle
) are not handled by some GPU codegen lowerings.vector.interleave
tovector.shuffle
#17346..mlirbc
files for tests and benchmarks after LLVM integrate #17330 #17344