Replies: 8 comments 16 replies
-
cc @bjacob |
Beta Was this translation helpful? Give feedback.
-
That sounds good to me! I understand that you mean in an experimentation path (given the current regressions) similar to what has been done for ARM. We should also include RISC-V as as part of this experimentation as we have put a lot of effort in keeping both x86 and RISC-V backends aligned. That should also give us more certainty about the applicability of the data tiling approach to different targets and keep the entropy in the CPU backends under control. We haven't done too much tuning for matmuls on RISC-V so I wouldn't expect RISC-V to add too much work, other than benchmarking. In this regard, I also planned to enable masking for matmuls in the foreseeable future. With that we can combine both data tiling + masking, where data tiling is used for data layout transformations + alignment (but not for value padding). This will allow us to propagate the "aligned allocations" beyond the immediate producer of the matmul (see the discussion with @nicolasvasilache in internal chat). Happy to coordinate on this but I think there shouldn't be too much interference initially. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the awesome work @bjacob ! I landed all the needed PRs. We don't have any compilation issue and runtime issue now. I started a new benchmark in #11993 Looks like we have more improved models now. We're still having regressions on some models. I'll try to grab some profile traces and share the found here. (I'm just back to work recently, so it will likely happen later this week or next week.) |
Beta Was this translation helpful? Give feedback.
-
I've been working on teaching x86 about data tiling and pack fusion. Here are some study on MobileBert fp32 model. (The google doc version of write up is https://docs.google.com/document/d/1c6XzXY6m3yso92mJNc-gTmw2EoLYAYLyELnTEYMIgfI/edit?usp=sharing) Performance BreakdownMatmuls(NxMxK):
The configuration of data tiling is based on #12637
Gap AnalysisThe existing inner tile sizes are not the best options. I ran a few configurations and found that {8, 2, 32} is quite better, and I used it for benchmarking MobileBert fp32. The matmul kernels are quite better in data tiling. It does not mean that the performance of mmt4d kernel is better than pure matmul. Because those mmt4d kernels do not have element-wise operations as consumers. It is a signal that non-tuned mmt4d kernels are not worse than the IREE default codegen pipeline. There are three categories of the gap.
We can close the gap if we subtract the above overheads and Tracy overheads from e2e latency. The Tracy overheads are fine, most of them are from tiny pack/unpack dispatches. Also, they are excluded in regular benchmarking process. Pack FusionI took the most critical fusion case as a case study. According to Tracy profiling, the actual execution time of single generic ops is 70 us. The single pack op case is 62 us. The generic + pack kernel case is 128.32 us. All of the numbers exclude IREE runtime overheads. The packing overheads are not hidden in fusion. If we look into the asm dump, we’ll notice that the registers are not fully utilized. Only zmm0 to zmm11 are used. A simple “fix” is increasing the tiling sizes, which cut the execution time to 67 us. W/o adjusting tiling sizes for pack ops can improve ~10 ms in e2e latency. However, it is not a good fix because the total execution time is 11 ms while the single generic op only takes 1.47 ms. It is 7.4x slower. Here is the asm dump with the fix. The registers are not fully utilized either. Single Pack and UnPack OpsThere are at least 9 ms of packing ops on constants. ConstEval can address the issue, but we haven’t found a proper way to do it in IREE. The prototype and initial investigation can be found at #11360 (comment). ConstEval can kill 4 dispatches (7%) and reduce 361 kernel launches (15%) in total. Some pack ops are not fused with its producers because they are blocked by reshape ops. We can not simply propagate the encoding cross reshape ops. The advanced solution is early materializing the encodings and kicking data layout propagations. The midterms solution is coaching fusion to fuse them into a single dispatch, and let codegen handle the case. Eventually, we still need good kernels for pack and unpack ops. It is not studied on x86, and we should teach Codegen to generate better kernels for pack and unpack ops. |
Beta Was this translation helpful? Give feedback.
-
Related to this, I performed an experiment on x86 AVX512-based standalone f32 GEMM kernels (https://github.com/vmurali/matmul) with A transposed and B, C, D kept as is. Each size is run 10 times and averaged. The kernels don't use a threadpool (yet), and instead launch new threads (default number of threads = total number of hardware threads = 176 in the core I ran). Here are the results. The aligned cases run at around half the speed of the unaligned cases!! This is because of cache set conflicts - the L1$ has only 64 sets per way, each set containing 64 bytes (= 16 f32s). So a 512 (= 32x16) row size will exhaust half the sets in a way, so only two sets of the entire L1 cache will be used overall, till all the ways are exhausted (there are 12 ways, so 24 fetches from the transposed A matrix will start trashing the cache leading to abysmal performance since the transposed A's columns don't get stored in the cache while iterating over B's columns). 1024 and higher will create set conflicts with just one row and start trashing the 12 ways after just 12 rows. This shows that there's a lot of performance left to be squeezed, using both data layout changes, and tightening codegen.
|
Beta Was this translation helpful? Give feedback.
-
Thanks both for the detailed analysis! I think we should go over this on Monday.
This is weird. Do you know why? I would expect a better memory access pattern to improve latency with the same tile sizes. An experiment that would be super useful is to pack without changing the data layout. That should help us understand the actual overhead of padding without transposing the data. Not sure if you have run that already or if it's feasible with the current implementation but it's something we need to know.
Do you have profile data for that? It would be good to also run a simpler experiment to understand the latency of an aligned load vs unaligned load without any cache noise. That should also give us very valuable information. |
Beta Was this translation helpful? Give feedback.
-
The default matmul configuration is using
MMT4d kernel studyThe 384x128x512 matmul is critical in the MobileBert FP32 model. Here are asm dumps for matmul and mmt4d kernels: matmul.asm: https://gist.githubusercontent.com/hanhanW/36c038a7062dcd1c0488acdd26b29bbc/raw mmt4d.asm: https://gist.githubusercontent.com/hanhanW/73a2891b9521bac4972f7f738ca9dbd6/raw There is no instruction selection difference between the two. Only the memory access patterns are different. The perf data shows that mmt4d kernel is nicer than matmul on the memory system, and it remains a bit of a mystery why the observed latency was worse. The perf result of matmul:
The perf result of mmt4d:
After talking to Mahesh and Benoit, I think that maybe we should disable distribution, and have pure single-threaded kernels comparison. So far they are distributed but benchmarked with single-threaded. side note: if the input become dynamic, the latency of matmul kernels become 2x, while mmt4d kernel has similar latency. It looks like the LLVM CodeGen kick in some optimization when the input is static shapes. Packing Kernel StudyThere are two types of packing, one is packing on LHS, and the other is packing on RHS. Packing on LHS from ASM dump: https://gist.githubusercontent.com/hanhanW/045bed2e1b74e7826bac43fc75ca62d0/raw MLIR file: https://gist.githubusercontent.com/hanhanW/f1748b7bdb4a266a7958d09d50bece2d/raw UnPack KernelsSome unpack cases are good some are bad, need more investigation.. ASM dump for 384x128x512 and 384x512x128 unpack kernels: https://gist.githubusercontent.com/hanhanW/18f881bf080d500aed7d7fd2c33fe362/raw MLIR file: https://gist.githubusercontent.com/hanhanW/881ceeaaf5947177ff781d8e7eb58f61/raw |
Beta Was this translation helpful? Give feedback.
-
Thanks, Hanhan! Super helpful! Let's talk about this tomorrow. I may have some answers to your questions and you may have some answers to mine :) |
Beta Was this translation helpful? Give feedback.
-
The ARM codegen (in a experimental path) is switched to use data tiling approach. This discussion is about moving x86 to use data tiling approach. Currently, there are a lot of regressions comparing with default pipeline, see #11711 for the prototype. The x86 matmul codegen is mostly using DoubleTilingPadExpert. With hoist padding features on, we'll have reach to the idea that similar to data tiling approach. The goal is to triage and address performance issue in enabling data tiling approach for x86.
In the data tiling approach, IREE pads the input operands of matmul to aligned with 16 and set encoding attributes to the tensor types in SetEncodingPass. It introduces set/unset encoding ops and matmul with encoding tensor types. The LLVMCPU codegen materialize the set/unset encoding ops into pack/unpack ops, and materialize matmul with encoding tensor types into a mmt4d op in
LLVMCPUMaterializeEncodingPass
. Then the pipeline is set up in KernelConfig.cpp file.Most of the model regresses, including MobileBert FP32. I think we can start with MobileBert FP32 because we've been spending time on the model and there are only three kinds of matmul.
cc @pzread @MaheshRavishankar
Abbreviated Linux Benchmark Summary
@ commit d19e93fde4708dcaa2709330b431ce16dc4b6ae0 (vs. base c792899fa19b4b1c8aecdab56797def22925be5f)
Regressed Latencies 🚩
[Top 3 out of 55 results showed]
Improved Latencies 🎉
[Top 3 out of 12 results showed]
Regressed Compilation Times 🚩
[Top 3 out of 62 results showed]
Improved Compilation Times 🎉
[Top 3 out of 13 results showed]
Regressed Total Dispatch Sizes 🚩
[Top 3 out of 57 results showed]
Improved Total Dispatch Sizes 🎉
[Top 3 out of 8 results showed]
Regressed Total Artifact Sizes 🚩
[Top 3 out of 40 results showed]
Improved Total Artifact Sizes 🎉
[Top 3 out of 8 results showed]
For more information:
Beta Was this translation helpful? Give feedback.
All reactions