Pluggable Packing Representation in IREE #12075

nicolasvasilache · 2023-02-06T17:20:52Z

nicolasvasilache
Feb 6, 2023

Hi everyone,

I would like to start a discussion on how to represent and manipulate data-centric transformations such as packing, in a fashion that is both cognizant of IREE requirements and general enough. By general, I want to capture at least the fact that we shouldn't want to manually write dozens of special cases in our compiler and instead have one intuitive way to cover a large number of cases.

In addition to the conciseness aspect, there are also normalization, retargetability and extensibility aspects that I unpack below (pun intended) and that I would like to socialize better to reach a common understanding. I am hoping that the vision I describe that connects the elements below forms the basis a codegen vision that you'd find compelling and future-proof.

This is related to #11821 and other posts about the special case of data tiling that IREE implements today (i.e. transforming a linalg.matmul into a linalg.mmt4d to then call a library (link needed)).

General Background

Data layout transformations are a well-known class of transformations aimed at obtaining high-performance by reorganizing data to match hardware data transfer characteristics.
They are important because a computer is essentially a machine that moves bits form a large storage (e.g. datacenter) to the place where it can process the data (often registers at the heart of the CPU), computes and then communicate results back.

Each interface between 2 storage media is an opportunity for crippling inefficiencies. Like in physics, such inefficiencies compound in a multiplicative fashion. The rules that govern data costs are driven by storage capacity and temporal + spatial locality (i.e. data reuse and compulsory misses are key metrics to optimize for).

To avoid order of magnitude slowdowns, proper data layout and alignment needs to be considered at all levels of the hardware hierarchy, e.g.

Movement between register banks.
L1 / fast memory <-> registers.
Ln <-> Ln+1.
Ln+1 <-> main memory.
Main memory <-> PCI-E bus <-> neighboring main memory.
Memory <-> network interface <-> network topology <-> memory on another machine.

Specific harwardare features may allow to cut through some layers of abstraction (e.g. hardware prefetchers, write-back / write-through caches, DMA engines, CPU sockets, CUDA cp.async, NVLink, ...) or add additional layer of complexity (OS memory management / TLB, ...).

When talking about data packing, in IREE codegen, for now, I think we mostly care about handling:

The memory -> L2 -> L1 -> register case.
In the one particular case of linalg.matmul to linalg.mmt4d for which we want to reuse the Ruy implementation.

I know from discussions with @benvanik that he thinks a lot about all the other levels of the stack but for the sake of simplicity I'll only cast a net around memory -> Ln -> registers, which is a separable part of the problem. Additionally, packing solutions to the memory -> Ln -> registers problem compose with solutions for the higher-level part of the stack.

IREE Codegen

This is my attempt at characterizing ongoing efforts in IREE and making sure I, and others who have not been involved in the specifics of the IREE decision process, can get up to speed (if something is inaccurate please let me know so I can update my priors):

We pattern-match named linalg.matmul operations that we want to convert to linalg.mmt4d.
We perform a rewrite by adding a new tensor type encoding (i.e. an attribute) to the type of each operand. Importantly, that rewrite is kept virtual in an attribute, the application of the transformation is delayed.
A propagation pass rewrites the tensor type encoding through use-def chains by performing greedy graph rewrites.
Dispatch regions are formed by "casting a net" around operations using a fixed heuristic that must not know about any hardware features. In particular, the introduction of packing/unpacking operations is delayed. The packed tensor operands may be further recombined with other operations during the propagation pass.
Later, at the HAL level, hardware information becomes available and is used to materialize the packings and gives them as an input to codegen.
A late codegen runs that must reconcile the packing annotations with the implementation, across different dispatch regions, once HW-specific sizes are known.
Threading new attributes through the system is a manual process which seems to involve changes to at least SetEncoding.cpp, MaterializeEncoding.cpp and MaterializeEncoding.cpp.
At the moment we already have 10 attributes for the linalg.matmul case.

Step 3. and 4. (delayed packing materialization and dispatch region creation) are a cornerstone of IREE.
They aim to provide a good enough fixed graph partitioning for all target hardware (CPU, GPU, mobile, all current and future accelerators).

As a consequence, steps 3. and 4. should never know anything about the underlying hardware.

Properties Important for Packing

Stepping back from the specific use case that we have in IREE so far, I would like to characterize the general factors influencing packing decisions, I may be missing some:

The semantics of one "key" (i.e. expensive) operation consuming or producing the tensor. The op imposes one particular layout it prefers in its neighborhood (the neighborhood which may vary with graph transformations). The semantics part is important because the preferred packing is not the same for a linalg.matmul or a linalg.transposed_lhs_matmul (assuming such a named op is introduced).
The operand position: the lhs, rhs and res of a linalg.matmul have different layouts.
The problem size: a tall and skinny linalg.matmul requires a different packing than a square one. For example, if a dimension of a matmul is small (e.g. small batch size of 4), packing to say 16 immediately limits us to 25% of peak. Sometimes, this is sill what we may want on very specific accelerators for which no other implementation is competitive.
Hardware storage characteristics in the neighborhood of the storage we care about optimizing. In the memory -> Ln -> register case this comprises: register size, number of registers (i.e. volume of register storage), alignment, cache line size, cache line capacity and associativity (i.e. volume of cache storage).
Hardware compute characteristics of the unit on which the op is expected to execute. In a world of accelerators, this can be very varied (ARM sdot, Intel VNNI, CUDA tensorcores, Google TPU etc etc). The existence of such operations can heavily skew the decision: missing the mapping to the right ISA is a "game over" type of situation.
The element type which adds an extra dimension of complexity to the cartesian solution space.
Quantization.
Compression, a traditionally higher-level consideration that is coming down to the inner workings of the hardware these days (e.g. CUDA sp.mma).
The proper neutral element for the padding part of the packed operation. This neutral element may have to change as we propagate pad/packing through the graph (unless packing sizes divide, but this should only be seen as an optimization).

The case of matmul has been heavily studied for decades and the landscape is "relatively" clear and unambiguous.
For instance, matmul_f32_f32_f32 on AVX-512 run well with an m,n,k of size 16x16x1, iterating on as large a k dimension that fits in L1 (see figure 16, which includes the cost of pack / unpack on the fly).
Extensions to other data types and other CPU ISAs should be relatively straightforward, provided the right abstractions are added in MLIR (see the discussion on HW-specific and retargetable vector dialects in the vector dialect rationale).
Sometimes, the LLVM compiler itself may get in the way and adding inline asm may be necessary for unlocking performance (see table 3 p25).

Matmul is a crucial kernel to get right in the system and it is great we are pursuing the mmt4d path.

I am wondering how we should proceed to go beyond matmul and:

Devise a generic algorithm that does not require introducing a new named op for every new case.
Automate IREE-specific integrations, without having to manually write the C++ logic each time.

Generalizing Packing

Recently started investigations in mapping convolutions triggered the desire for an algorithm that can more generally detect a contraction in any generic or named op (see this commit).

Here is what this looks like:

func.func @matmul_mk_kn_mn(%A : !A_mk, %B : !B_kn, %C : !C_mn) -> !C_mn {
  //      CHECK: linalg.generic
  // CHECK-SAME: indexing_maps = [#[[$mk_kkmm]], #[[$kn_kknn]], #[[$mn_mmnn]]]
  // CHECK-SAME:   ["reduction", "parallel", "parallel", "reduction", "parallel", "parallel"]} 
  // CHECK-SAME:   ins(%{{.*}} : tensor<128x8x32x8xf32>, tensor<8x8x32x16xf32>)
  // CHECK-SAME:  outs(%{{.*}} : tensor<128x8x8x16xf32>)
  %0 = linalg.matmul ins(%A, %B : !A_mk, !B_kn) outs(%C : !C_mn) -> !C_mn
  return %0 : !C_mn
}
func.func @matmul_mk_nk_nm(%A : !A_mk, %B : !B_nk, %C : !C_nm) -> !C_nm {
  //      CHECK: linalg.generic
  // CHECK-SAME: indexing_maps = [#[[$mk_kkmm]], #[[$kn_kknn]], #[[$mn_mmnn]]]
  // CHECK-SAME:   ["reduction", "parallel", "parallel", "reduction", "parallel", "parallel"]} 
  // CHECK-SAME:   ins(%{{.*}} : tensor<128x8x32x8xf32>, tensor<8x8x32x16xf32>)
  // CHECK-SAME:  outs(%{{.*}} : tensor<8x128x8x16xf32>)
  %0 = linalg.generic #mkn_trait ins(%A, %B : !A_mk, !B_nk) outs(%C : !C_nm) {...} -> !C_nm
  return %0 : !C_nm
}
func.func @matmul_mk_nk_mn_transposed(%A : !A_mk, %B : !B_nk, %C : !C_nm) -> !C_nm {
  //      CHECK: linalg.generic
  // CHECK-SAME: indexing_maps = [#[[$mk_kkmm]], #[[$kn_kknn]], #[[$mn_mmnn]]]
  // CHECK-SAME:   ["reduction", "parallel", "parallel", "reduction", "parallel", "parallel"]} 
  // CHECK-SAME:   ins(%{{.*}} : tensor<128x8x32x8xf32>, tensor<8x8x32x16xf32>)
  // CHECK-SAME:  outs(%{{.*}} : tensor<8x128x8x16xf32>)
  %0 = linalg.generic #mkn_transposed_trait ins(%A, %B : !A_mk, !B_nk) outs(%C : !C_nm) {...} -> !C_nm
  return %0 : !C_nm
}

transform.sequence failures(propagate) {
^bb1(%module_op: !pdl.operation):
  %matmul = transform.structured.match ops{["linalg.matmul"]} in %module_op 
    : (!pdl.operation) -> !transform.op<"linalg.matmul">
  transform.structured.pack_greedily %matmul 
      gemm_packed_sizes = [8, 16, 32] gemm_inner_dims_order = [1, 2, 0]
    : (!transform.op<"linalg.matmul">) -> !transform.op<"linalg.generic">
}

The TL;DR is that a single transformation can find a gemm within a linalg.generic and pack it according to the parameters gemm_packed_sizes and gemm_inner_dims_order.

One interesting aspect here is that this acts as a normalization step: in the 3 examples above, different input linalg ops are packed to the same 8x16x32 form.

As a side note, this also works out of the box with other ops (e.g. batch contracted op or conv_2d).

I believe this provides a level of simplification that will help us greatly with generalization.

An additional interesting observation is that this transformation is significantly more general than just the memory -> Ln -> registers level and I expect it will help us with distribution at various levels of the hierarchy: packed parallel (resp. reduction) iterators generalize to hierarchical parallel (resp. reduction) iterators. Sliding windows are a bit trickier and require either "ghost regions" or modular arithmetic to pack and distribute but should not be fundamentally out of character.

I expect simple evolutions of this pack_greedily transformation will include:

Extracting other "things that run well" from any generic i.e. any identified subcomputation that pack enough arithmetic intensity to saturate some HW. For instance, Figure 16 in our tech report identifies conv_1d as such an operation that packs a serious punch on CPUs. This is interesting because conv_1d has significantly more reuse than matmul and saturates compute units even faster than gemm. This does not require a memory blowup transformation such as im2col, whose effect is to reduce memory intensity to allow mapping to gemm.
When the op is memory bound and 1. fails to find a gemm, conv_1d etc, the transform can still normalize the memory accesses of any linalg.generic by setting it up with guaranteed contiguous and aligned memory accesses.

Strawman Usage in an E2E Codegen Flow

Here is a strawman of how we can use this to get target-independent reliable performance backstop that bottoms out on innermost tile-level of known performance, no spilling, no unnecessary loop peeling or unrolling, etc:

pack and normalize to the most "arithmetic intensity punch"-packing subop (e.g. bgemm, brgemm, conv1d, gemm, gemv, dot, reduce). We know by normalization how many of those last dims are involved (resp. 4, 3, 3, 3, 2, 1, 1).
interchange parallel iterators to outermost positions (except the "last dims involved")
tile and fuse all parallel iterators by 1 (except the "last dims involved"), map and fold to whatever level of processor HW hierarchy is available
tile all remaining reduction/window dimensions by 1 (except the "last dims involved")
at this point the op is of shape 1x...x1x8x16x32 (strawman gemm case with 8x16x32 packing), vectorize it
hoist vector ops across loops
bufferize
lower vectors to the proper HW-specific abstraction ISA (i.e. the HW vector ops in https://mlir.llvm.org/docs/Dialects/Vector/#positioning-in-the-codegen-infrastructure). These may be either hw_specific.vector ops or inline_asm ops.

The above is a pure retargetable codegen backstop and will get to (conservatively) 5-10%, (optimistically) 10-30%of available perf reliably if step 8. is connected properly.
To get reliably higher, we will need to plug in a little larger tile sizes and plug classical transformations like interchange, promotion to fast local memory, better hoisting and pipelining.

The last k steps can also be replaced by "call a handwritten assembly kernel" when it makes sense (as I mentioned LLVM may get in the way).

Later: rewrite top-N parallel loops into a space filling curve for better cache locality and last mile optimizations.

A Few Extra Words On Generalized Packing

Additionally, the introduction of tensor.pack and tensor.unpack operations provides the supporting op to implement the 3rd type of linalg tiling that we have been missing until now: tiling an N-D op into a 2*N-D op without introducing loops. This can also unlock linalg nesting and generally improve composition.

Looking forward, this also connect with structured codegen beyond rectangular arrays when we can support more fancy data types.. I expect compression will require some of that but don't worry about it for now...

Request for Advice for Connecting Properly to IREE

I would like to be able to connect generalized packing-based codegen strategies for ops that have more ambiguity than "matmul in a size regime that is already good for mmt4d".

I may not know the name of an op ahead of time, the proper packing sizes may vary based on all the factors in the "Properties Important for Packing" section.
Even in the mm4d case, we know we will want multiple different packings for different problem size regimes (e.g. see slide 153 in this older presentation for sizes up to 256^3).

This is where I would be interested in advice and help to connect this work better to IREE and avoid manual intervention in the compiler for every case.

Pluggable Type Attribute or Interface

Can we think of an extensible mechanism to connect different packing strategies to backend compilation, without having to hardcode them in C++ ?
Could we evolve the current fixed load-bearing op name + operand packing name + hardcoded C++ mapping to magic numbers?

The information I would need to unambiguously delay the creation of packing resembles:

struct PackingMetadata {
  // In the current IREE implementation, this encoded as one of 10 hardcoded attributes (e.g. MATMUL_TRANSPOSE_LHS_I8_I8_I32).
  struct LinalgOpMetadata {
    SmallVector<AffineMap> indexingMaps;
    SmallVector<linalg::Utils::IteratorType> iteratorTypes;
    Attribute packingConstant/NeutralElement;
    int64_t operandNumber; // for operands only, but not for the op.
    // Some representation of the computation, 
  };
  
  // This is a delayed **target** subcomputation we want to extract from a `linalg.generic`.
  // In the current IREE implementation, this is encoded as hardcoded C++ magic numbers.
  optional<GemmPackingMetadata> {
    SmallVector<int64_t> gemmPackedSizes;
    SmallVector<int64_t> gemmInnerDimsOrder;
    SmallVector<int64_t> gemmOuterDimsOrder;
  };

  // Other delayed **target** subcomputations may make sense in the future.
  // This does not exist in IREE atm.
  optional<ConvPackingMetadata> { ... };
  optional<ElementwisePackingMetadata> { ... };
  optional<ReductionPackingMetadata> { ... };
  optional<BRGemmPackingMetadata> { ... };
};

Now I am not particularly thrilled about the complexity of this potential Attribute / Interface but it would do the job without requiring SSA values.

In the absence of such an attribute, the following works for me but are not ideal in an IREE context:

apply packing as a preprocessing step (e.g. mlir-opt or iree-opt are both fine).
lower tensor.pack/tensor.unpack to a mix of tensor.expand/collapse_shape, linalg.fill and linalg.transpose. This currently needs to happen because IREE does not legalize tensor.pack/tensor.unpack ops and fails when presented with such IR. There could be opportunities to do a graph leel rewrite (see next section).
(sometimes) locally disable some heuristics of dispatch region formation to avoid interfering with the lowered transpose op, there are 2 cases here:

i. linalg.generic ops that implement a linalg.transpose seem to always be fused on inputs. This undoes the transposition part of the packing and defeats the purpose of the transformation. This seems to reliably not occur when using a real linalg.transpose. In general, I think IREE should not be so dependent on hardcoded names but there seems to be an easy mitigation.

ii. whatever the form of linalg.transpose or linalg.generic implementing the transpose, it seems to always be fused into the output. This is an interesting tradeoff: (a) on one hand, this resolves transposition immediately without breaking the layout; (b) on the other hand, the iterators order is changed and the normalization property is lost (i.e. the most minor op iterator dimensions are not the ones that participate in the gemm anymore). However it may be possible to recover it.

To make things more concrete, here is a draft PR #12076 that illustrates the various point in this discussion.

One may appreciate the fact that this PR has 0 lines of C++ and is purely a composition of upstream transformations with IREE's compilation flow.

Benchmarking

A much more modest question: I have a specific use case for performance collection that I am not sure how IREE supports: I would like to run IREE on a few different compiled variants of the same dispatch region and look at their performance, sort and dig deeper into assembly. On CUDA GPUs this is easy thanks to nsight but on CPU I am unclear. I talked to @qcolombet who used tracy but it seems this is not ideal for this use case.

Atm, am I better off (a) stitching all my IRs in a single compilation unit with a dozen dispatches + dump intermediate files + manually figure it out from there or (b) is there a better suggested alternative?

Looking Beyond: Graph-Level Optimization

In the grander picture, the discussion above omits a key aspect: how to make good global packing decisions.
At the memory -> Ln -> register case, my understanding is that IREE is focusing on the "fuse padding with producers" approach which has been proven for matmul in Ruy. This is designed to fit within the IREE constraints of delayed materialization of packing and attributes.

Alternatively, the approach followed by Intel with their TPP work is to propagate tensor.pack / tensor.unpack operations aggressively through the graph. By making good packing decisions on various conv, brgemm and matmul flavors, it is my understanding that:

MLP models are fully folded away (into weights and the 1 input tensor) and aligned
ResNet (50?) has only 7 tensor.pack and 1 tensor.unpack left.

Maybe they could shed more general light on other high order bits (@chelini).

A third data point worth mentioning, if people are not familiar with the topic, is the TASO work from Stanford. This contribution features a "Graph Rewriter and Data Layout Joint Optimizer" (see Figure 1).

Lastly, Jim Demmel's communication-avoiding algorithms (i.e. distributed recomputations to avoid communications) also lurk in the shadows here, if the hardware has enough compute vs communication imbalance. It is my expectation Hopper will be close to that regime and will also be much programmable with higher-D operations that match the hardware hierarchy.

I don't want to deviate further from the topic at hand but I wanted to point out that tiling N-D to 2*N-D ops can also have large scale implications on the graph level. If these specific topics are of interest to IREE, let's start another discussion.

Thanks For Reading !

Questions, comments?

@stellaraccident @mattwalsh @benvanik @MaheshRavishankar @silvasean @jpienaar @ftynse @ThomasRaoux @qcolombet @chelini @rengolin

MaheshRavishankar · 2023-02-06T18:51:34Z

MaheshRavishankar
Feb 6, 2023
Collaborator

Looking through this, this is well in line with everything that is already being considered for implementing data-tiling in IREE (please see details below). I think maybe some of the implementation details today are being over-emphasized. I think everything in here is already pluggable in IREE and was always the plan.

This is related to #11821 and other posts about the special case of data tiling that IREE implements today (i.e. transforming a linalg.matmul into a linalg.mmt4d to then call a library (link needed)).

That is just what is used today cause the 2D data-tiling on linalg.matmul to linalg.mmt4d was already available and would let us hook into all existing work driven through linalg.mmt4d. This is not central to the current approach, and indeed we should be able to use the generalized data-tiling transformations you added to MLIR in into IREE already.

Strawman Usage in an E2E Codegen Flow

Here is a strawman of how we can use this to get target-independent reliable performance backstop that bottoms out on innermost tile-level of known performance, no spilling, no unnecessary loop peeling or unrolling, etc:

pack and normalize to the most "arithmetic intensity punch"-packing subop (e.g. bgemm, brgemm, conv1d, gemm, gemv, dot, reduce). We know by normalization how many of those last dims are involved (resp. 4, 3, 3, 3, 2, 1, 1).

interchange parallel iterators to outermost positions (except the "last dims involved")

tile and fuse all parallel iterators by 1 (except the "last dims involved"), map and fold to whatever level of processor HW hierarchy is available

tile all remaining reduction/window dimensions by 1 (except the "last dims involved")

at this point the op is of shape 1x...x1x8x16x32 (strawman gemm case with 8x16x32 packing), vectorize it

hoist vector ops across loops

bufferize

lower vectors to the proper HW-specific abstraction ISA (i.e. the HW vector ops in mlir.llvm.org/docs/Dialects/Vector/#positioning-in-the-codegen-infrastructure). These may be either hw_specific.vector ops or inline_asm ops.

The above is a pure retargetable codegen backstop and will get to (conservatively) 5-10%, (optimistically) 10-30%of available perf reliably if step 8. is connected properly. To get reliably higher, we will need to plug in a little larger tile sizes and plug classical transformations like interchange, promotion to fast local memory, better hoisting and pipelining.

The last k steps can also be replaced by "call a handwritten assembly kernel" when it makes sense (as I mentioned LLVM may get in the way).

Later: rewrite top-N parallel loops into a space filling curve for better cache locality and last mile optimizations.

This all seems like things that should happen in the Codegen backend, and is already what is being done with mmt4d, but it is great if we can generalize to any linalg.generic.

Pluggable Type Attribute or Interface

Can we think of an extensible mechanism to connect different packing strategies to backend compilation, without having to hardcode them in C++ ? Could we evolve the current fixed load-bearing op name + operand packing name + hardcoded C++ mapping to magic numbers?

The information I would need to unambiguously delay the creation of packing resembles:

struct PackingMetadata {
  // In the current IREE implementation, this encoded as one of 10 hardcoded attributes (e.g. MATMUL_TRANSPOSE_LHS_I8_I8_I32).
  struct LinalgOpMetadata {
    SmallVector<AffineMap> indexingMaps;
    SmallVector<linalg::Utils::IteratorType> iteratorTypes;
    Attribute packingConstant/NeutralElement;
    int64_t operandNumber; // for operands only, but not for the op.
    // Some representation of the computation, 
  };
  
  // This is a delayed **target** subcomputation we want to extract from a `linalg.generic`.
  // In the current IREE implementation, this is encoded as hardcoded C++ magic numbers.
  optional<GemmPackingMetadata> {
    SmallVector<int64_t> gemmPackedSizes;
    SmallVector<int64_t> gemmInnerDimsOrder;
    SmallVector<int64_t> gemmOuterDimsOrder;
  };

  // Other delayed **target** subcomputations may make sense in the future.
  // This does not exist in IREE atm.
  optional<ConvPackingMetadata> { ... };
  optional<ElementwisePackingMetadata> { ... };
  optional<ReductionPackingMetadata> { ... };
  optional<BRGemmPackingMetadata> { ... };
};

The current use the enum encoding attributes are a placeholder. Literally the simplest thing. In terms of complexity, having a single attribute that has a bunch of optional fields does not seem that different from having any different number of Enums. I think the only easily extensible state here is through use of AttributeInterfaces. I am not opposed to either anyway.

Now I am not particularly thrilled about the complexity of this potential Attribute / Interface but it would do the job without requiring SSA values.

In the absence of such an attribute, the following works for me but are not ideal in an IREE context:

apply packing as a preprocessing step (e.g. mlir-opt or iree-opt are both fine).

lower tensor.pack/tensor.unpack to a mix of tensor.expand/collapse_shape, linalg.fill and linalg.transpose. This currently needs to happen because IREE does not legalize tensor.pack/tensor.unpack ops and fails when presented with such IR. There could be opportunities to do a graph leel rewrite (see next section).

This is just the current implementation and will be fixed soon. IREE is still using the iree_linalg_ext.pack and iree_linalg_ext.unpack operations. They are being moved to use the upstream tensor.pack and tensor.unpack operations (if you generate the iree_linalg_ext version that should work e2e today cause that is the tested part in IREE). These are meant to be end-to-end handalable precisely for the sake of experimenting without having to use a defered mechanism from the get-go. I think a lot of what is described below is just extrapolating from this point which is meant to work, and is being looked at.

(sometimes) locally disable some heuristics of dispatch region formation to avoid interfering with the lowered transpose op, there are 2 cases here:
i. linalg.generic ops that implement a linalg.transpose seem to always be fused on inputs. This undoes the transposition part of the packing and defeats the purpose of the transformation. This seems to reliably not occur when using a real linalg.transpose. In general, I think IREE should not be so dependent on hardcoded names but there seems to be an easy mitigation.

I dont follow the details here. The linalg.generic is always fused on inputs. That is true. I see two issues here. You are lowering the tensor.pack and tensor.unpack to a sequence of ops cause it didnt work out of the box. That is the bug which we should fix. The default then tries to reduce the number of generic ops (including fusing with transposes). IREE is not looking for hard-coded names explicitly. So this sounds like separate issues that are being lobbed into dispatch region formation heuristics (please file bugs for specific ases you think fusion should/should not happen and we can discuss those separately). Another issue might be linalg.transpose. These are meant to be generalized early. So linalg.transpose related things have not been evaluated, so they might just go down an unintended path.

ii. whatever the form of linalg.transpose or linalg.generic implementing the transpose, it seems to always be fused into the output. This is an interesting tradeoff: (a) on one hand, this resolves transposition immediately without breaking the layout; (b) on the other hand, the iterators order is changed and the normalization property is lost (i.e. the most minor op iterator dimensions are not the ones that participate in the gemm anymore). However it may be possible to recover it.

I think we should ignore this part. The whole point of the pack and unpack operations is to have a single op that carries the entire context of all the operations it needs to perform without having to rely on "fusion heuristics" to auto-magically fuse things into a dispatch as you would expect. That is not what the default heuristics are meant for.

3 replies

nicolasvasilache Feb 7, 2023
Author

They are being moved to use the upstream tensor.pack and tensor.unpack operations (if you generate the iree_linalg_ext version that should work e2e today cause that is the tested part in IREE)... These are meant to be end-to-end handalable precisely for the sake of experimenting without having to use a defered mechanism from the get-go

Adding a trivial 1-1 pattern to #12076 indeed allows it to pass the first layers of iree-compile up to codegen.
So it seems there is a builtin way to allow non-attribute-based packing to connect e2e: this was the ask.
I understand this is not the way we will ship thing; still separation of concerns, reducing IR footprint one has to grok at and optionality help tremendously with modularity and velocity.

Thank you!

You are lowering the tensor.pack and tensor.unpack to a sequence of ops cause it didnt work out of the box. That is the bug which we should fix.

I think we should ignore this part. The whole point of the pack and unpack operations is to have a single op that carries the entire context of all the operations it needs to perform without having to rely on "fusion heuristics" to auto-magically fuse things into a dispatch as you would expect.

There are a 2 related things that fail, I'll ping you on the PR, otherwise things look promising!

That is not what the default heuristics are meant for.

Does this mean there is room for non-default heuristics? Is this already pluggable? Maybe a topic for another discussion, more closely related to the "Looking Beyond: Graph-Level Optimization" part.

Thanks again!

MaheshRavishankar Feb 7, 2023
Collaborator

They are being moved to use the upstream tensor.pack and tensor.unpack operations (if you generate the iree_linalg_ext version that should work e2e today cause that is the tested part in IREE)... These are meant to be end-to-end handalable precisely for the sake of experimenting without having to use a defered mechanism from the get-go

Adding a trivial 1-1 pattern to #12076 indeed allows it to pass the first layers of iree-compile up to codegen. So it seems there is a builtin way to allow non-attribute-based packing to connect e2e: this was the ask. I understand this is not the way we will ship thing; still separation of concerns, reducing IR footprint one has to grok at and optionality help tremendously with modularity and velocity.

Thank you!

You are lowering the tensor.pack and tensor.unpack to a sequence of ops cause it didnt work out of the box. That is the bug which we should fix.

I think we should ignore this part. The whole point of the pack and unpack operations is to have a single op that carries the entire context of all the operations it needs to perform without having to rely on "fusion heuristics" to auto-magically fuse things into a dispatch as you would expect.

There are a 2 related things that fail, I'll ping you on the PR, otherwise things look promising!

That is not what the default heuristics are meant for.

Does this mean there is room for non-default heuristics? Is this already pluggable? Maybe a topic for another discussion, more closely related to the "Looking Beyond: Graph-Level Optimization" part.

Sure. There is no one size fits all here. There is absolutely a place for custom fusions, etc. that cant (shouldnt) be captured by a default. But I would suggest to be very deliberate about doing that/using that path. Current default heuristics have been picked after some deep dives into several models, and bucketing things into "should be handled"/"cannot be handled this way", even if it means missing some fusions.

Thanks again!

hanhanW Feb 9, 2023
Collaborator

Thanks for writing this up! This is very helpful for both who are working on data-tiling approach and who are not. What you and Mahesh said are pretty in line with what I've been thinking!

Adding some details about tensor.pack/unpack support in IREE.

Now I am not particularly thrilled about the complexity of this potential Attribute / Interface but it would do the job without requiring SSA values.
In the absence of such an attribute, the following works for me but are not ideal in an IREE context:

apply packing as a preprocessing step (e.g. mlir-opt or iree-opt are both fine).

lower tensor.pack/tensor.unpack to a mix of tensor.expand/collapse_shape, linalg.fill and linalg.transpose. This currently needs to happen because IREE does not legalize tensor.pack/tensor.unpack ops and fails when presented with such IR. There could be opportunities to do a graph leel rewrite (see next section).

This is just the current implementation and will be fixed soon. IREE is still using the iree_linalg_ext.pack and iree_linalg_ext.unpack operations. They are being moved to use the upstream tensor.pack and tensor.unpack operations (if you generate the iree_linalg_ext version that should work e2e today cause that is the tested part in IREE). These are meant to be end-to-end handalable precisely for the sake of experimenting without having to use a defered mechanism from the get-go. I think a lot of what is described below is just extrapolating from this point which is meant to work, and is being looked at.

What Mahesh mentioned makes sense. If you generate iree_linalg_ext version, they should work e2e. I'm working on switching to use tensor.pack/unpack ops in IREE. The tensor.pack op will work e2e (for most of cases) after landing #11875. The tensor.unpack op is coming soon. After having them working e2e, we'll try to retire iree_linalg_ext version, and everything should work with upstream ops and transformations.

chelini · 2023-02-07T06:46:26Z

chelini
Feb 7, 2023

Thanks for writing this up. I am excited to see where this is going! What @nicolasvasilache described is what we are doing. First, we have a set of "good" packing decisions for different operations like matmul or conv2d, currently hard-coded. Then, after packing, we try to push tensor.pack and tensor.unpack at the boundaries. With what @hanhanW and I upstreamed in DataLayoutPropagation.cpp plus some yet-to-be upstream in-house patterns, I can get a single tensor.pack and a single tensor.unpack in ResNet50. I still have some pieces to connect, but I hope to have ResNet50 running e2e with LIBXSMM soonish.

2 replies

nicolasvasilache Feb 7, 2023
Author

One single tensor pack/unpack for the whole ResNet graph + libraries running at peak .. way to go :) !

hanhanW Feb 9, 2023
Collaborator

One single tensor.pack and tensor.unpack sounds really good! After switching to use tensor.pack/unpack ops in IREE, I'll try to prototype the propagation in IREE as well! :D

rengolin · 2023-02-09T12:01:41Z

rengolin
Feb 9, 2023

Hi Nicolas, Great write up! Some comments inline.

On Mon, 6 Feb 2023 at 17:21, Nicolas Vasilache ***@***.***> wrote: Step 3. and 4. (delayed packing materialization and dispatch region creation) are a cornerstone of IREE. They aim to provide a good enough fixed graph partitioning for all target hardware (CPU, GPU, mobile, all current and future accelerators). As a consequence, steps 3. and 4. should never know anything about the underlying hardware.

This may be the case in IREE, but I disagree this is a strong general property. While your high-level pass does annotate the ops, not knowing what is the underlying hardware limits the ability to do high-level transforms that would be beneficial to code generation down the line. High-level decisions, for example how to partition your graph (around propagation), will make following code-gen decisions from easy to impossible. This is less important when lowering to CPUs, because it's the same ISA and (mostly) same virtual memory, but when lowering to GPUs or accelerators, some compute still remain on CPUs, and knowing the trade-offs at partition time will make you find better boundaries and lead to better code generation of the partitions down the pipe. The alternative is to carry dozens of attributes in each instruction. This is not only fragile, but can lead to inconsistent messages that the back-end has no option but to drop it on the floor and move on.

Sometimes, the LLVM compiler itself may get in the way <https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237> and adding inline asm may be necessary for unlocking performance (see table 3 p25 <https://arxiv.org/pdf/2202.03293.pdf>).

I am wondering how we should proceed to go beyond matmul and:

1. Devise a generic algorithm that does not require introducing a new named op for every new case. 2. Automate IREE-specific integrations, without having to manually write the C++ logic each time. Inline assembly moves the library's job to the user and it's never a

"good" solution, especially if the users are ML/HPC researchers who only know Python or Octave. The best move is to the compiler side, knowing where to inject micro-kernels in the right places, like both our projects are aiming at. To me, this is the only acceptable stop-gap, until compilers can select the right instructions for every case. Alternatively, the approach followed by Intel with their TPP work is to

propagate tensor.pack / tensor.unpack operations aggressively through the graph. By making good packing decisions on various conv, brgemm and matmul flavors, it is my understanding that: 1. MLP models are fully folded away (into weights and the 1 input tensor) and aligned 2. ResNet (50?) has only 7 tensor.pack and 1 tensor.unpack left. Maybe they could shed more general light on other high order bits ( @chelini <https://github.com/chelini>).

The quick overview is that TPP sits in between previous Intel efforts to create catch-all libraries (TBB, MKL, etc) and intrinsics / inline assembly. We want this micro-kernel library to act as an ISA between optimizing compilers and varied complex hardware choices (CPUs, GPUs, accels). So far, it has worked (by hand in the paper and via compiler in PlaidML) for key Intel, AMD and Arm CPUs and we have an effort looking into GPUs and beyond. What previous work has done (and why we need to replicate), is to completely remove packing from the model except a single input packing / output unpacking pair. Weight shapes are static throughout the model, even if they get repacked during training, they're still the same shape for every backward pass. So what remains is reshaping the input as it come in and the output as it goes out, which most models already do beforehand, so even that could be removed. For inference, static pre-trained weights get packed at compile time and re-written to the model file (MLIR, protobuf). For training, it will depend on initialization. If there has been a pre-training, you need to pack only once you start training. If they've been randomly initialized, then you probably don't even need a re-layout. Resnet 50's remaining packs are due to missing patterns, as they should all go away too. PlaidML can do all that already, but the rest of its technology is stuck in the past. We're extracting the TPP value from it and putting it in MLIR upstream, where it can continue to have an impact for many years to come.

…

--renato

Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pluggable Packing Representation in IREE #12075

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

Strawman Usage in an E2E Codegen Flow

Pluggable Type Attribute or Interface

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Pluggable Packing Representation in IREE #12075

nicolasvasilache Feb 6, 2023

General Background

IREE Codegen

Properties Important for Packing

Generalizing Packing

Strawman Usage in an E2E Codegen Flow

A Few Extra Words On Generalized Packing

Request for Advice for Connecting Properly to IREE

Pluggable Type Attribute or Interface

Benchmarking

Looking Beyond: Graph-Level Optimization

Thanks For Reading !

Replies: 3 comments · 5 replies

MaheshRavishankar Feb 6, 2023 Collaborator

Strawman Usage in an E2E Codegen Flow

Pluggable Type Attribute or Interface

nicolasvasilache Feb 7, 2023 Author

MaheshRavishankar Feb 7, 2023 Collaborator

hanhanW Feb 9, 2023 Collaborator

chelini Feb 7, 2023

nicolasvasilache Feb 7, 2023 Author

hanhanW Feb 9, 2023 Collaborator

rengolin Feb 9, 2023

nicolasvasilache
Feb 6, 2023

Replies: 3 comments 5 replies

MaheshRavishankar
Feb 6, 2023
Collaborator

nicolasvasilache Feb 7, 2023
Author

MaheshRavishankar Feb 7, 2023
Collaborator

hanhanW Feb 9, 2023
Collaborator

chelini
Feb 7, 2023

nicolasvasilache Feb 7, 2023
Author

hanhanW Feb 9, 2023
Collaborator

rengolin
Feb 9, 2023