Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a lowering of vector.interleave to vector.shuffle #17346

Open
bjacob opened this issue May 10, 2024 · 17 comments
Open

Add a lowering of vector.interleave to vector.shuffle #17346

bjacob opened this issue May 10, 2024 · 17 comments

Comments

@bjacob
Copy link
Contributor

bjacob commented May 10, 2024

In llvm integrate #17330 we have to locally revert llvm/llvm-project#89131 because it causes vector.interleave to be created instead of vector.shuffle, and some GPU codegen backends expected vector.shuffle and are not handling vector.interleave.

llvm/llvm-project#89131 is good in itself though, as vector.interleave is more constrained than general vector.shuffle. We just need a lowering pattern from vector.interleave to vector.shuffle to be inserted into codegen pipelines. Then we will be able to drop the local revert of llvm/llvm-project#89131.

FYI @KoolJBlack @qedawkins @kuhar

@kuhar
Copy link
Member

kuhar commented May 10, 2024

Is this an mlir issue or iree issue? To me it sound like this should go to vector-to-spirv and vector-to-llvm?

@bjacob
Copy link
Contributor Author

bjacob commented May 10, 2024

I was thinking vector-to-vector, rewriting vector.interleave to vector.shuffle. This way, only a single pattern is needed, not backend-specific, and by construction we know current backends are happy with vector.shuffle since that is what they currently get. It can go in this existing file:

https://github.com/llvm/llvm-project/blob/3bde7983986d8ce637f6bb506860859249787751/mlir/lib/Dialect/Vector/Transforms/LowerVectorInterleave.cpp#L4

Then also a IREE-side change to insert that new pattern into codegen pipelines; and also IREE-side, remembering to drop the revert of llvm/llvm-project#89131 at the following LLVM integrate.

@kuhar
Copy link
Member

kuhar commented May 10, 2024

Oh OK, so this pattern is already there, just need to add add it to iree pipelines. Makes sense.

@bjacob
Copy link
Contributor Author

bjacob commented May 10, 2024

no no, a file is there but it only contains an unrelated UnrollInterleaveOp pattern. The pattern that we need here does not exist yet, it needs to be created.

@kuhar
Copy link
Member

kuhar commented May 10, 2024

On the spir-v side, I don't think there's any better lowering we could use anyway, spir-v has its own shuffle ops.

ScottTodd added a commit that referenced this issue May 10, 2024
…800a3 (#17330)

* torch-mlir integrated at bce800a.
* llvm-project integrated at 2083e97e plus local changes:
* Reverted llvm/llvm-project#89131 locally:
while this change is good in its own right, the `vector.interleave` that
it generates (instead of `vector.shuffle`) are not handled by some GPU
codegen lowerings.
        * Filed #17346.
* Cherry-picked Bazel build fix:
llvm/llvm-project#91654
* Several e2e tests have been temporarily disabled, follow-up work is
needed to reenable them: #17344

---------

Co-authored-by: MaheshRavishankar <[email protected]>
Co-authored-by: Scott Todd <[email protected]>
@bjacob
Copy link
Contributor Author

bjacob commented May 10, 2024

@kuhar @qedawkins , does this look like what we discussed? llvm/llvm-project#91800
Then I'm looking on the IREE side where to put this in the SPIRV pipeline. Maybe around

void runOnOperation() override {
MLIRContext *context = &getContext();
auto funcOp = getOperation();
bool emitIntegerDotProdOps = supportsIntegerDotProductOps(funcOp);
// First apply vectorization to generate vectors of the original tensor
// shape for tensor.pad ops.
{
RewritePatternSet patterns(context);
// Pull in additional vectorization patterns in IREE.
populateVectorizePadPatterns(patterns);
if (failed(applyPatternsAndFoldGreedily(funcOp, std::move(patterns)))) {
return signalPassFailure();
}
}
debugPrint(funcOp, "after vectorizing tensor.pad");
// Special peephole optimizations to clean up IR before further processing.
{
RewritePatternSet patterns(context);
// Pull in patterns to shuffle broadcast/transpose ops around in order to
// cancel them or embed into contract ops. Embedding in the flexible
// contract ops will help to sustain the structure through various
// transformations.
vector::populateVectorReductionToContractPatterns(patterns);
// Pull in patterns to canonicalize transfer ops.
vector::populateVectorTransferPermutationMapLoweringPatterns(patterns);
// Fold consumer add ops into the contraction op itself.
vector::ContractionOp::getCanonicalizationPatterns(patterns, context);
// Fold transpose ops if possible as we cannot unroll it later.
vector::TransposeOp::getCanonicalizationPatterns(patterns, context);
if (failed(applyPatternsAndFoldGreedily(funcOp, std::move(patterns)))) {
return signalPassFailure();
}
}
debugPrint(funcOp, "after peephole optimization");
// High dimension contraction can appear after vectorizing ops like 1-D
// convolution. Those 1-D convolution ops typically have a leading unit
// batch dimension. Try to drop that to map to matmul dimensions better.
SmallVector<vector::ContractionOp> contractOps;
funcOp.walk([&](vector::ContractionOp op) {
if (op.getIteratorTypes().size() > 3)
contractOps.push_back(op);
});
for (vector::ContractionOp op : contractOps) {
OpBuilder builder(op);
IRRewriter rewriter(builder);
auto result = vector::castAwayContractionLeadingOneDim(
op, /*maskingOp=*/nullptr, rewriter);
if (succeeded(result)) {
rewriter.replaceOp(op, *result);
}
}
debugPrint(funcOp, "after trimming contract leading unit dims");
// Fold tensor.extract_slice/insert_slice ops into transfer ops. This helps
// to remove those tensor slice ops so that we can enable further vector op
// transformations.
{
RewritePatternSet patterns(context);
vector::TransferReadOp::getCanonicalizationPatterns(patterns, context);
vector::TransferWriteOp::getCanonicalizationPatterns(patterns, context);
populateVectorTransferTensorSliceTransforms(patterns);
if (failed(applyPatternsAndFoldGreedily(funcOp, std::move(patterns)))) {
return signalPassFailure();
}
}
debugPrint(funcOp, "after folding tensor extract/insert slice ops");
// Lower vector.multi_dimension early if any operand is a transpose op.
// The lowering itself generates transpose ops. This helps to cancel
// transpose ops. vector.multi_reduction is arguably a higher level op and
// the lowering also unrolls the multi_reduction op, so it makes sense to
// happen before normal unrolling.
{
SmallVector<Operation *> reductionOps;
funcOp.walk([&](vector::MultiDimReductionOp reductionOp) {
if (llvm::any_of(reductionOp->getOperands(), [](Value operand) {
return operand.getDefiningOp<vector::TransposeOp>();
}))
reductionOps.push_back(reductionOp);
return WalkResult::advance();
});
RewritePatternSet patterns(context);
vector::populateVectorMultiReductionLoweringPatterns(
patterns, vector::VectorMultiReductionLowering::InnerParallel);
if (failed(applyOpPatternsAndFold(reductionOps, std::move(patterns)))) {
funcOp.emitOpError("vector lowering failed");
return signalPassFailure();
}
}
debugPrint(funcOp, "after lowering multi reduction ops");
// Prepare for SPIR-V integer dot product lowering.
if (emitIntegerDotProdOps) {
RewritePatternSet patterns(context);
vector::populateVectorContractCanonicalizeMatmulToMMT(
patterns, detectI8ToI32Matmul);
if (failed(applyPatternsAndFoldGreedily(funcOp, std::move(patterns)))) {
return signalPassFailure();
}
debugPrint(funcOp, "after preparing for SPIR-V dot product lowering");
}
// Then unroll vectors to native vector size. We try to use 128-bit
// vectors for memory access and 4/2/1 vector sizes for computation.
{
RewritePatternSet patterns(context);
populateVectorUnrollPatterns(patterns, emitIntegerDotProdOps);
if (failed(applyPatternsAndFoldGreedily(funcOp, std::move(patterns)))) {
return signalPassFailure();
}
}
debugPrint(funcOp, "after unrolling vector ops");
// Lower reduction-unrolled vector contract ops. Such contract ops have
// their reduction dimensions all be one, so we can convert them into
// elementwise ops.
{
RewritePatternSet patterns(context);
auto options =
vector::VectorTransformsOptions().setVectorTransformsOptions(
vector::VectorContractLowering::ParallelArith);
vector::populateVectorContractLoweringPatterns(patterns, options);
// The pattern can generate transpose ops. Try to fold it if possible to
// avoid lowering them into extract/insert later.
vector::TransposeOp::getCanonicalizationPatterns(patterns, context);
// It also generates broadcast/extract ops. Clean up them too.
vector::BroadcastOp::getCanonicalizationPatterns(patterns, context);
vector::ExtractOp::getCanonicalizationPatterns(patterns, context);
if (failed(applyPatternsAndFoldGreedily(funcOp, std::move(patterns)))) {
return signalPassFailure();
}
}
debugPrint(funcOp, "after lowering size-1 reduction contract ops");
// Now lower vector transpose given we have handled vector patterns that may
// generate transpose ops in previous steps. This converts transpose ops
// into extract and insert pairs, in preparation of further transformations
// to canonicalize/cancel.
{
RewritePatternSet patterns(context);
auto options =
vector::VectorTransformsOptions().setVectorTransposeLowering(
vector::VectorTransposeLowering::EltWise);
vector::populateVectorTransposeLoweringPatterns(patterns, options);
vector::populateVectorShapeCastLoweringPatterns(patterns);
if (failed(applyPatternsAndFoldGreedily(funcOp, std::move(patterns)))) {
return signalPassFailure();
}
}
debugPrint(funcOp, "after lowering transpose ops");
// Next run canonicalization to cast away leading size-1 dimensions. They
// can be generated from vector unrolling and generally cause issues to
// cancel corresponding read/write or insert/extract op pairs. This also
// need to happen before hoisting, where we would make certain vectors loop
// carried. Once that's done, it's hard to handle the leading size-1
// dimensions across regions.
{
RewritePatternSet patterns(context);
// We need to pull in casting way leading one dims to allow cancelling
// some read/write ops.
vector::populateCastAwayVectorLeadingOneDimPatterns(patterns);
// We may have vector.insert_strided_slice inserting 1-D native vectors
// into n-D larger vectors with the above. Break that down too. This is a
// companion transformation of unrolling.
vector::populateVectorInsertExtractStridedSliceDecompositionPatterns(
patterns);
vector::ExtractOp::getCanonicalizationPatterns(patterns, context);
// Trimming leading unit dims may generate broadcast/shape_cast ops. Clean
// them up.
vector::BroadcastOp::getCanonicalizationPatterns(patterns, context);
vector::ShapeCastOp::getCanonicalizationPatterns(patterns, context);
vector::TransferReadOp::getCanonicalizationPatterns(patterns, context);
vector::TransferWriteOp::getCanonicalizationPatterns(patterns, context);
populateVectorTransferTensorSliceTransforms(patterns);
if (failed(applyPatternsAndFoldGreedily(funcOp, std::move(patterns)))) {
return signalPassFailure();
}
}
debugPrint(funcOp, "after trimming leading unit dims");
// Lower vector reduction to SPIR-V integer dot product.
if (emitIntegerDotProdOps) {
RewritePatternSet patterns(context);
populateVectorReductionToSPIRVDotProductPatterns(patterns);
if (failed(applyPatternsAndFoldGreedily(funcOp, std::move(patterns)))) {
return signalPassFailure();
}
debugPrint(funcOp, "after lowering to SPIR-V dot product");
}
}
};
? where exactly in that big function?

@kuhar
Copy link
Member

kuhar commented May 10, 2024

Otherwise here also seems fine:

vector::populateVectorInsertExtractStridedSliceDecompositionPatterns(

@qedawkins
Copy link
Contributor

Nice, that looks great to me. In terms of the IREE side, I think some combination of adding the vector.interleave n-d -> 1-d here:

vector::TransposeOp::getCanonicalizationPatterns(patterns, context);

And then the interleave to shuffle can either go in the same place, or somewhere near here:

patterns, vector::VectorMultiReductionLowering::InnerParallel);

Basically after decomposing to 1d and before unrolling to 1/2/3/4 vector elements.

@qedawkins
Copy link
Contributor

Otherwise here also seems fine:

vector::populateVectorInsertExtractStridedSliceDecompositionPatterns(

Correct me if otherwise, but I was thinking it had to happen before unrolling to <= 4 elements? Unless interleave implements the unrolling interface already.

@kuhar
Copy link
Member

kuhar commented May 10, 2024

My hope would be that unrolling could break it down to source vectors in some cases already, but I haven't checked if it support the unrolling interface.

@kuhar
Copy link
Member

kuhar commented May 10, 2024

Basically after decomposing to 1d and before unrolling to 1/2/3/4 vector elements.

This is also a good option if it doesn't support unrolling.

@bjacob
Copy link
Contributor Author

bjacob commented May 10, 2024

Can we put it upstream in https://github.com/llvm/llvm-project/blob/e9f53e4095d8a8600b5c5d445c73e2d5a6f45abb/mlir/lib/Conversion/VectorToSPIRV/VectorToSPIRV.cpp#L812 ?

I have updated the PR with that, thanks for the tip!

Nice, that looks great to me. In terms of the IREE side, I think some combination of adding the vector.interleave n-d -> 1-d here:

@kuhar @qedawkins I am just trying to fix the immediate issue that is forcing us to carry a local LLVM revert here. This issue only seems to involve 1D vectors as far as I have seen so far.

@qedawkins
Copy link
Contributor

Makes sense, then I would say any point before here is likely to work:

populateVectorUnrollPatterns(patterns, emitIntegerDotProdOps);

@bjacob
Copy link
Contributor Author

bjacob commented May 10, 2024

So should the upstream PR do it in VectorToSPIRV.cpp#L812 or not?

If it should do that upstream, then nothing needs to be done on the IREE side, right?

@qedawkins
Copy link
Contributor

qedawkins commented May 10, 2024

I'm guessing we'll still need to do something on the IREE side because in addition to the requirement that all vectors must be 1-d, on SPIR-V they must also be <= 4 elements wide, and the unrolling that goes to <= 4 elements happens in IREE right now. So I'm thinking VectorToSPIRV will be too late.

@bjacob
Copy link
Contributor Author

bjacob commented May 10, 2024

but that sounds like solving a more general problem than my immediate concern of flattening our llvm integration. The upstream change that we have to carry a local revert of isn't changing the number of elements in a vector, or the rank of a vector. It's just changing vector.shuffle-on-1d-vector to vector.interleave-on-1d-vector.

bjacob added a commit to llvm/llvm-project that referenced this issue May 13, 2024
…le in VectorToSPIRV (#91800)

Context: iree-org/iree#17346.

Test IREE integrate showing it's fixing the problem it's intended to
fix, i.e. it allows IREE to drop its local revert of
#89131:

iree-org/iree#17359

This is added to VectorToSPIRV because SPIRV doesn't currently handle
`vector.interleave` (see motivating context above).

This is limited to 1D, non-scalable vectors.
bjacob added a commit to llvm/llvm-project that referenced this issue May 13, 2024
…le in VectorToSPIRV (#92012)

This is the second attempt at merging #91800, which bounced due to a
linker error apparently caused by an undeclared dependency.
`MLIRVectorToSPIRV` needed to depend on `MLIRVectorTransforms`. In fact
that was a preexisting issue already flagged by the tool in
https://discourse.llvm.org/t/ninja-can-now-check-for-missing-cmake-dependencies-on-generated-files/74344.

Context: iree-org/iree#17346.

Test IREE integrate showing it's fixing the problem it's intended to
fix, i.e. it allows IREE to drop its local revert of
#89131:

iree-org/iree#17359

This is added to VectorToSPIRV because SPIRV doesn't currently handle
`vector.interleave` (see motivating context above).

This is limited to 1D, non-scalable vectors.
bangtianliu pushed a commit to bangtianliu/iree that referenced this issue Jun 5, 2024
…800a3 (iree-org#17330)

* torch-mlir integrated at bce800a.
* llvm-project integrated at 2083e97e plus local changes:
* Reverted llvm/llvm-project#89131 locally:
while this change is good in its own right, the `vector.interleave` that
it generates (instead of `vector.shuffle`) are not handled by some GPU
codegen lowerings.
        * Filed iree-org#17346.
* Cherry-picked Bazel build fix:
llvm/llvm-project#91654
* Several e2e tests have been temporarily disabled, follow-up work is
needed to reenable them: iree-org#17344

---------

Co-authored-by: MaheshRavishankar <[email protected]>
Co-authored-by: Scott Todd <[email protected]>
LLITCHEV pushed a commit to LLITCHEV/iree that referenced this issue Jul 30, 2024
…800a3 (iree-org#17330)

* torch-mlir integrated at bce800a.
* llvm-project integrated at 2083e97e plus local changes:
* Reverted llvm/llvm-project#89131 locally:
while this change is good in its own right, the `vector.interleave` that
it generates (instead of `vector.shuffle`) are not handled by some GPU
codegen lowerings.
        * Filed iree-org#17346.
* Cherry-picked Bazel build fix:
llvm/llvm-project#91654
* Several e2e tests have been temporarily disabled, follow-up work is
needed to reenable them: iree-org#17344

---------

Co-authored-by: MaheshRavishankar <[email protected]>
Co-authored-by: Scott Todd <[email protected]>
Signed-off-by: Lubo Litchev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants