Data tiling approach for x86 #11821

hanhanW · 2023-01-12T22:13:51Z

hanhanW
Jan 12, 2023
Collaborator

The ARM codegen (in a experimental path) is switched to use data tiling approach. This discussion is about moving x86 to use data tiling approach. Currently, there are a lot of regressions comparing with default pipeline, see #11711 for the prototype. The x86 matmul codegen is mostly using DoubleTilingPadExpert. With hoist padding features on, we'll have reach to the idea that similar to data tiling approach. The goal is to triage and address performance issue in enabling data tiling approach for x86.

In the data tiling approach, IREE pads the input operands of matmul to aligned with 16 and set encoding attributes to the tensor types in SetEncodingPass. It introduces set/unset encoding ops and matmul with encoding tensor types. The LLVMCPU codegen materialize the set/unset encoding ops into pack/unpack ops, and materialize matmul with encoding tensor types into a mmt4d op in LLVMCPUMaterializeEncodingPass. Then the pipeline is set up in KernelConfig.cpp file.

Most of the model regresses, including MobileBert FP32. I think we can start with MobileBert FP32 because we've been spending time on the model and there are only three kinds of matmul.

cc @pzread @MaheshRavishankar

Abbreviated Linux Benchmark Summary

@ commit d19e93fde4708dcaa2709330b431ce16dc4b6ae0 (vs. base c792899fa19b4b1c8aecdab56797def22925be5f)

Regressed Latencies 🚩

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
MobileBertSquad [fp32] (TFLite) 4-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake)	213.688 (vs. 72.717, 193.86%↑)	214.165	1.663
MobileBertSquad [fp32] (TFLite) 4-thread,full-inference,experimental-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake)	212.526 (vs. 73.270, 190.06%↑)	212.988	1.491
MobileBertSquad [fp32] (TFLite) 1-thread,full-inference,experimental-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake)	572.870 (vs. 202.509, 182.89%↑)	571.455	3.631

[Top 3 out of 55 results showed]

Improved Latencies 🎉

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
EfficientNet [int8] (TFLite) 1-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake)	75.383 (vs. 121.530, 37.97%↓)	75.356	0.152
EfficientNet [int8] (TFLite) 1-thread,full-inference,experimental-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake)	75.543 (vs. 121.577, 37.86%↓)	75.687	0.294
EfficientNet [int8] (TFLite) full-inference,default-flags with IREE-LLVM-CPU-Sync @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake)	75.043 (vs. 120.208, 37.57%↓)	74.957	0.319

[Top 3 out of 12 results showed]

Regressed Compilation Times 🚩

Benchmark Name	Compilation Time (ms)
DeepLabV3 [fp32] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,default-flags	56495 (vs. 20395, 177.00%↑)
DeepLabV3 [fp32] (TFLite) CPU-x86\_64-CascadeLake full-inference,default-flags	56495 (vs. 20395, 177.00%↑)
DeepLabV3 [fp32] (TFLite) CPU-x86\_64-CascadeLake 4-thread,full-inference,default-flags	56495 (vs. 20395, 177.00%↑)

[Top 3 out of 62 results showed]

Improved Compilation Times 🎉

Benchmark Name	Compilation Time (ms)
Resnet50Tf [fp32] (TF) CPU-x86\_64-CascadeLake 8-thread,full-inference,default-flags	17806 (vs. 20546, 13.34%↓)
Resnet50Tf [fp32] (TF) CPU-x86\_64-CascadeLake full-inference,default-flags	17806 (vs. 20546, 13.34%↓)
Resnet50Tf [fp32] (TF) CPU-x86\_64-CascadeLake 4-thread,full-inference,default-flags	17806 (vs. 20546, 13.34%↓)

[Top 3 out of 13 results showed]

Regressed Total Dispatch Sizes 🚩

Benchmark Name	Total Dispatch Size (bytes)
MobileNetV3Small [fp32,imagenet] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,default-flags	993272 (vs. 205128, 384.22%↑)
MobileNetV3Small [fp32,imagenet] (TFLite) CPU-x86\_64-CascadeLake full-inference,default-flags	993272 (vs. 205128, 384.22%↑)
MobileNetV3Small [fp32,imagenet] (TFLite) CPU-x86\_64-CascadeLake 4-thread,full-inference,default-flags	993272 (vs. 205128, 384.22%↑)

[Top 3 out of 57 results showed]

Improved Total Dispatch Sizes 🎉

Benchmark Name	Total Dispatch Size (bytes)
MobileBertSquad [int8] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,experimental-flags	2546584 (vs. 5812648, 56.19%↓)
MobileBertSquad [int8] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,default-flags	2546584 (vs. 5812648, 56.19%↓)
MobileBertSquad [int8] (TFLite) CPU-x86\_64-CascadeLake full-inference,experimental-flags	2546584 (vs. 5812648, 56.19%↓)

[Top 3 out of 8 results showed]

Regressed Total Artifact Sizes 🚩

Benchmark Name	Total Artifact Size (bytes)
PersonDetect [int8] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,experimental-flags	564999 (vs. 337543, 67.39%↑)
PersonDetect [int8] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,default-flags	564999 (vs. 337543, 67.39%↑)
PersonDetect [int8] (TFLite) CPU-x86\_64-CascadeLake full-inference,experimental-flags	564999 (vs. 337543, 67.39%↑)

[Top 3 out of 40 results showed]

Improved Total Artifact Sizes 🎉

Benchmark Name	Total Artifact Size (bytes)
MobileBertSquad [int8] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,experimental-flags	27957207 (vs. 31017175, 9.87%↓)
MobileBertSquad [int8] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,default-flags	27957207 (vs. 31017175, 9.87%↓)
MobileBertSquad [int8] (TFLite) CPU-x86\_64-CascadeLake full-inference,experimental-flags	27957207 (vs. 31017175, 9.87%↓)

[Top 3 out of 8 results showed]

For more information:

MaheshRavishankar · 2023-01-12T22:15:44Z

MaheshRavishankar
Jan 12, 2023
Collaborator

cc @bjacob

7 replies

bjacob Jan 16, 2023
Collaborator

#11849 provides x86-64 tile sizes.

bjacob Jan 16, 2023
Collaborator

#11850 is also needed (in addition to #11849) to get the intended effect in benchmark builds.

bjacob Jan 17, 2023
Collaborator

Both are merged - you can retry data tiling !

hanhanW Jan 17, 2023
Collaborator Author

Thanks, I have a couple of PR that can be separated from the prototype PR. I'll rebase and give it a shot. :)

bjacob Jan 19, 2023
Collaborator

I tried benchmarking but it crashed (segfault) - #11902, https://buildkite.com/iree/iree-benchmark-linux/builds/2719#0185cad6-7996-4d83-9efe-af5de0029f69

dcaballe · 2023-01-13T07:33:01Z

dcaballe
Jan 13, 2023

That sounds good to me! I understand that you mean in an experimentation path (given the current regressions) similar to what has been done for ARM. We should also include RISC-V as as part of this experimentation as we have put a lot of effort in keeping both x86 and RISC-V backends aligned. That should also give us more certainty about the applicability of the data tiling approach to different targets and keep the entropy in the CPU backends under control. We haven't done too much tuning for matmuls on RISC-V so I wouldn't expect RISC-V to add too much work, other than benchmarking.

In this regard, I also planned to enable masking for matmuls in the foreseeable future. With that we can combine both data tiling + masking, where data tiling is used for data layout transformations + alignment (but not for value padding). This will allow us to propagate the "aligned allocations" beyond the immediate producer of the matmul (see the discussion with @nicolasvasilache in internal chat). Happy to coordinate on this but I think there shouldn't be too much interference initially.

0 replies

hanhanW · 2023-01-31T00:56:24Z

hanhanW
Jan 31, 2023
Collaborator Author

Thanks for the awesome work @bjacob ! I landed all the needed PRs. We don't have any compilation issue and runtime issue now. I started a new benchmark in #11993

Looks like we have more improved models now. We're still having regressions on some models. I'll try to grab some profile traces and share the found here. (I'm just back to work recently, so it will likely happen later this week or next week.)

1 reply

hanhanW Mar 15, 2023
Collaborator Author

I looked at wrong metrics. All of them are regressed. The below comment has initial analysis.

hanhanW · 2023-03-15T21:49:41Z

hanhanW
Mar 15, 2023
Collaborator Author

I've been working on teaching x86 about data tiling and pack fusion. Here are some study on MobileBert fp32 model.

(The google doc version of write up is https://docs.google.com/document/d/1c6XzXY6m3yso92mJNc-gTmw2EoLYAYLyELnTEYMIgfI/edit?usp=sharing)

Performance Breakdown

Matmuls(NxMxK):

384x128x512
384x512x128
384x384x32
384x32x384
384x128x128
384x512x384
384x16x512

The configuration of data tiling is based on #12637
The below chart sums up the execution time of each matmul types in the model.

	Default Codegen	Data Tiling {16, 1, 16}	Data Tiling {8, 2, 32}
matmul_384x128x512	55.97 ms	78 ms	51.61 ms
matmul_384x512x128	41.48 ms	42.68 ms	37.79 ms
matmul_384x384x32	6.92 ms	6.9 ms	6.27 ms
matmul_384x32x384	6.18 ms	8.59 ms	5.85 ms
matmul_384x128x128	5.83 ms	6.85 ms	5.52 ms
matmul_384x512x384	1.47 ms	1.41 ms	0.984 ms
matmul_384x2x512	73.48 us	70 us	84.76 us
generic_384x512_pack	1.47 ms	18.97 ms	21.23 ms
Non-fused pack ops			26.25 ms
Non-fused unpack ops			7.16 ms
Tracy overheads (executable_issue_dispatch_inline)	2.79 ms	17.92 ms	17.1 ms
E2E	153 ms	260 ms	222 ms

Gap Analysis

The existing inner tile sizes are not the best options. I ran a few configurations and found that {8, 2, 32} is quite better, and I used it for benchmarking MobileBert fp32.

The matmul kernels are quite better in data tiling. It does not mean that the performance of mmt4d kernel is better than pure matmul. Because those mmt4d kernels do not have element-wise operations as consumers. It is a signal that non-tuned mmt4d kernels are not worse than the IREE default codegen pipeline.

There are three categories of the gap.

The generic + pack fusion case is much worse, which is 14.4x slower.
There are unfused pack ops, which causes 26.25 ms overheads. There are 9 ms of pack kernel packing on constants. ConstEval can remove the overhead.
There are unfused unpack ops, which causes 17.1 ms

We can close the gap if we subtract the above overheads and Tracy overheads from e2e latency. The Tracy overheads are fine, most of them are from tiny pack/unpack dispatches. Also, they are excluded in regular benchmarking process.

Pack Fusion

I took the most critical fusion case as a case study. According to Tracy profiling, the actual execution time of single generic ops is 70 us. The single pack op case is 62 us. The generic + pack kernel case is 128.32 us. All of the numbers exclude IREE runtime overheads. The packing overheads are not hidden in fusion.

If we look into the asm dump, we’ll notice that the registers are not fully utilized. Only zmm0 to zmm11 are used. A simple “fix” is increasing the tiling sizes, which cut the execution time to 67 us. W/o adjusting tiling sizes for pack ops can improve ~10 ms in e2e latency. However, it is not a good fix because the total execution time is 11 ms while the single generic op only takes 1.47 ms. It is 7.4x slower. Here is the asm dump with the fix. The registers are not fully utilized either.

Single Pack and UnPack Ops

There are at least 9 ms of packing ops on constants. ConstEval can address the issue, but we haven’t found a proper way to do it in IREE. The prototype and initial investigation can be found at #11360 (comment). ConstEval can kill 4 dispatches (7%) and reduce 361 kernel launches (15%) in total.

Some pack ops are not fused with its producers because they are blocked by reshape ops. We can not simply propagate the encoding cross reshape ops. The advanced solution is early materializing the encodings and kicking data layout propagations. The midterms solution is coaching fusion to fuse them into a single dispatch, and let codegen handle the case.

Eventually, we still need good kernels for pack and unpack ops. It is not studied on x86, and we should teach Codegen to generate better kernels for pack and unpack ops.

0 replies

vmurali · 2023-03-15T22:28:11Z

vmurali
Mar 15, 2023

Related to this, I performed an experiment on x86 AVX512-based standalone f32 GEMM kernels (https://github.com/vmurali/matmul) with A transposed and B, C, D kept as is. Each size is run 10 times and averaged. The kernels don't use a threadpool (yet), and instead launch new threads (default number of threads = total number of hardware threads = 176 in the core I ran). Here are the results.

The aligned cases run at around half the speed of the unaligned cases!! This is because of cache set conflicts - the L1$ has only 64 sets per way, each set containing 64 bytes (= 16 f32s). So a 512 (= 32x16) row size will exhaust half the sets in a way, so only two sets of the entire L1 cache will be used overall, till all the ways are exhausted (there are 12 ways, so 24 fetches from the transposed A matrix will start trashing the cache leading to abysmal performance since the transposed A's columns don't get stored in the cache while iterating over B's columns). 1024 and higher will create set conflicts with just one row and start trashing the 12 ways after just 12 rows.

This shows that there's a lot of performance left to be squeezed, using both data layout changes, and tightening codegen.

Matmul MxNxK	GFLOPS	Latency in micro secs
384 128 512	17.385418 GFLOPS	2895.049600 us
384 512 128	12.085292 GFLOPS	4164.702600 us
384 384 32	2.430755 GFLOPS	3882.408200 us
384 32 384	5.831643 GFLOPS	1618.271900 us
384 128 128	4.587917 GFLOPS	2742.619600 us
384 512 384	36.976851 GFLOPS	4083.499200 us
384 2 512	0.791025 GFLOPS	994.194100 us
2 2 2	0.000378 GFLOPS	42.335100 us
3 3 3	0.001508 GFLOPS	35.812700 us
1 1 1	0.000055 GFLOPS	36.677600 us
4 4 4	0.002352 GFLOPS	54.423300 us
5 5 5	0.006750 GFLOPS	37.036700 us
3 3 3	0.001515 GFLOPS	35.646200 us
8 8 8	0.025895 GFLOPS	39.544500 us
9 9 9	0.039837 GFLOPS	36.599100 us
7 7 7	0.018750 GFLOPS	36.586400 us
16 16 16	0.221751 GFLOPS	36.942300 us
17 17 17	0.092951 GFLOPS	105.711200 us
15 15 15	0.184827 GFLOPS	36.520700 us
32 32 32	0.767652 GFLOPS	85.372000 us
33 33 33	0.209790 GFLOPS	342.599100 us
31 31 31	0.703369 GFLOPS	84.709500 us
64 64 64	0.870972 GFLOPS	601.957100 us
65 65 65	0.544571 GFLOPS	1008.592700 us
63 63 63	0.670742 GFLOPS	745.582900 us
128 128 128	2.164383 GFLOPS	1937.874800 us
129 129 129	1.769458 GFLOPS	2426.379600 us
127 127 127	2.143385 GFLOPS	1911.353500 us
256 256 256	9.233670 GFLOPS	3633.921500 us
257 257 257	8.537844 GFLOPS	3976.318200 us
255 255 255	9.232948 GFLOPS	3591.783200 us
512 512 512	56.422483 GFLOPS	4757.597200 us
513 513 513	55.891061 GFLOPS	4831.029900 us
511 511 511	43.059457 GFLOPS	6197.608500 us
1024 1024 1024	317.333901 GFLOPS	6767.268100 us
1025 1025 1025	291.352830 GFLOPS	7392.346900 us
1023 1023 1023	265.758840 GFLOPS	8056.922100 us
2048 2048 2048	1081.124305 GFLOPS	15890.743200 us
2049 2049 2049	1165.178294 GFLOPS	14766.020700 us
2047 2047 2047	979.282165 GFLOPS	17517.642800 us
4096 4096 4096	521.757571 GFLOPS	263415.343300 us
4097 4097 4097	1096.204337 GFLOPS	125468.977600 us
4095 4095 4095	847.803790 GFLOPS	161993.037200 us
8192 8192 8192	231.393267 GFLOPS	4751700.916200 us
8193 8193 8193	482.674736 GFLOPS	2278789.871200 us
8191 8191 8191	453.117277 GFLOPS	2425661.187300 us
16384 16384 16384	314.786240 GFLOPS	27943066.335100 us
16385 16385 16385	360.317365 GFLOPS	24416540.375800 us
16383 16383 16383	346.563852 GFLOPS	25376224.919100 us

0 replies

dcaballe · 2023-03-16T04:09:24Z

dcaballe
Mar 16, 2023

Thanks both for the detailed analysis! I think we should go over this on Monday.

The existing inner tile sizes are not the best options

This is weird. Do you know why? I would expect a better memory access pattern to improve latency with the same tile sizes.

An experiment that would be super useful is to pack without changing the data layout. That should help us understand the actual overhead of padding without transposing the data. Not sure if you have run that already or if it's feasible with the current implementation but it's something we need to know.

This is because of cache set conflicts

Do you have profile data for that?

It would be good to also run a simpler experiment to understand the latency of an aligned load vs unaligned load without any cache noise. That should also give us very valuable information.

8 replies

vmurali Mar 16, 2023

It's already in discussions :)

hanhanW Mar 16, 2023
Collaborator Author

Thanks for the sharing and the detailed analysis!

The existing inner tile sizes are not the best options

This is weird. Do you know why? I would expect a better memory access pattern to improve latency with the same tile sizes.

I meant the current inner tile sizes used in MaterializationEncoding pass {N=16, M=16, K=1}, not the tiling sizes used for default matmul pipeline {N=8, M=32, K=16}. They were added for targeting at SIMD registers, see #11849 They are mostly initial numbers for e2e functionality, so I'm not surprised that we'll have to update them.

An experiment that would be super useful is to pack without changing the data layout. That should help us understand the actual overhead of padding without transposing the data. Not sure if you have run that already or if it's feasible with the current implementation but it's something we need to know.

Padding is not needed in MobileBert study. Because they are all aligned with 16. We've spent many time about padding, hoist-padding on matmul kernels in the past. We have a experimental pass at preprocessing level, which is PadLinalgOpsToIntegerMultiplePass. And we've studied matmul perf in iree-llvm-sandbox project.

I think we should probably move this discussion to another thread

I'd scope the discussion to "how we enable data tiling (i.e., pack + mmt4d + unpack) on x86)". I'd like to keep the focus on the related work, e.g., pack/unpack overhead, better fusion, mmt4d performance, etc. For handwritten matmul kernels or other matmul exploration, I'd suggest to create a new discussion and move the details to there.

bjacob Mar 16, 2023
Collaborator

I meant the current inner tile sizes used in MaterializationEncoding pass {N=16, M=16, K=1}, not the tiling sizes used for default matmul pipeline {N=8, M=32, K=16}. They were added for targeting at SIMD registers, see #11849 They are mostly initial numbers for e2e functionality, so I'm not surprised that we'll have to update them.

Tile sizes are essentially dictated by the ISA, at least as long as we stick to powers of two, which we are doing here so far.
That is because tile sizes are essentially (individual SIMD instruction tile size) * (how many can be put side by side, based on the number of SIMD registers).

The tile sizes from #11849 are thus essentially the only reasonable possibility.

If another tile size currently yields higher performance, that is a sign of deeper issues, which should be addressed first. For example, maybe codegen is sub-optimal and maybe whatever we are measuring here is merely observing incidental details of the current codegen.

With this type of kernel work, it is important to work from the bottom up, startin from the inner-most, lowest level aspects. Here are the "levels" which I am talking about here, from the lowest level:

Choice of tile size.
Implementation of inner-most loop of the matmul kernel.
Implementation of end-to-end matmul operating on tiled data layout without any fusions.
Implementation of end-to-end matmul operating on external data and/or with fusions.

Firm up all what can be worked on at each level before going to the next level up. At lower levels, we have few choices and very solid metrics to evaluate performance. The higher the level one works at, the more non-canonical choices needs to be made and the murkier the performance analysis. And this is hard enough working only on the details of one level at a time; if we try to solve simultaneously for the variables of multiple levels, it will get even harder.

For example, we should never be discussing both "choice of tile size" (level 1) and "fusions" (level 4) in the same discussion thread :-)

vmurali Mar 16, 2023

I'd scope the discussion to "how we enable data tiling (i.e., pack + mmt4d + unpack) on x86)". I'd like to keep the focus on the related work, e.g., pack/unpack overhead, better fusion, mmt4d performance, etc. For handwritten matmul kernels or other matmul exploration, I'd suggest to create a new discussion and move the details to there.

Sorry, I didn't mean to hijack this thread. One of the ways to gauge IREE's performance is against handwritten kernels, instead of just comparing against different parameters/choices within IREE itself. If the performance w.r.t. a handwritten simple kernel is 10x better than what IREE emits, there are some deep issues which we should tackle.

vmurali Mar 16, 2023

#12666 as a new discussion.

hanhanW · 2023-03-28T00:38:41Z

hanhanW
Mar 28, 2023
Collaborator Author

The default matmul configuration is using {N=8, M=32, K=16} tiling sizes, which triggers an optimization about spilling in LLVM CodeGen. A similar things can be observed in data tiling with N=8, M=32, K=2 configuration. According to above discussion, we should build things on top of the {N=16, M=16, K=1} configuration. To have an apple-to-apple comparison, we should update the inner tiling sizes of default matmul to {N=16,M=16,K=16}. Here is microbenchmarks for matmuls in MobileBert:

NxMxK IREE matmul Mmt4d kernel Pack LHS Pack RHS UnPack Res

384x128x512 425 us 502 us 80 us 27 us 11 us

384x512x128 456 us 393 us 21 us 17 us 61 us

384x384x32 74 us 78 us 5 us 3 us 41 us

384x32x384 79 us 91 us 56 us 7 us 1.6 us

384x128x128 100 us 106 us 20 us 4.6 us 11.5 us

MMT4d kernel study

The 384x128x512 matmul is critical in the MobileBert FP32 model. Here are asm dumps for matmul and mmt4d kernels:

matmul.asm: https://gist.githubusercontent.com/hanhanW/36c038a7062dcd1c0488acdd26b29bbc/raw

mmt4d.asm: https://gist.githubusercontent.com/hanhanW/73a2891b9521bac4972f7f738ca9dbd6/raw

There is no instruction selection difference between the two. Only the memory access patterns are different. The perf data shows that mmt4d kernel is nicer than matmul on the memory system, and it remains a bit of a mystery why the observed latency was worse.

The perf result of matmul:

 Performance counter stats for 'CPU(s) 16':

          4,674.74 msec cpu-clock                        #    1.000 CPUs utilized
             1,506      context-switches                 #  322.157 /sec
                 3      cpu-migrations                   #    0.642 /sec
                 1      page-faults                      #    0.214 /sec
       188,439,339      cycles                           #    0.040 GHz                         (50.03%)
        53,581,616      instructions                     #    0.28  insn per cycle              (62.52%)
        11,737,352      branches                         #    2.511 M/sec                       (62.52%)
           679,783      branch-misses                    #    5.79% of all branches             (62.52%)
        14,212,999      L1-dcache-loads                  #    3.040 M/sec                       (62.52%)
         1,452,328      L1-dcache-load-misses            #   10.22% of all L1-dcache accesses   (62.47%)
           774,788      LLC-loads                        #  165.739 K/sec                       (49.97%)
            67,163      LLC-load-misses                  #    8.67% of all LL-cache accesses    (49.97%)

       4.674724515 seconds time elapsed

The perf result of mmt4d:

 Performance counter stats for 'CPU(s) 16':

          5,110.30 msec cpu-clock                        #    1.000 CPUs utilized
               676      context-switches                 #  132.282 /sec
                 3      cpu-migrations                   #    0.587 /sec
                 1      page-faults                      #    0.196 /sec
       176,389,940      cycles                           #    0.035 GHz                         (49.90%)
        38,594,091      instructions                     #    0.22  insn per cycle              (62.43%)
         7,901,575      branches                         #    1.546 M/sec                       (62.43%)
           614,693      branch-misses                    #    7.78% of all branches             (62.46%)
        10,493,332      L1-dcache-loads                  #    2.053 M/sec                       (62.54%)
         1,148,904      L1-dcache-load-misses            #   10.95% of all L1-dcache accesses   (62.62%)
           703,457      LLC-loads                        #  137.655 K/sec                       (50.06%)
            63,980      LLC-load-misses                  #    9.10% of all LL-cache accesses    (49.98%)

       5.110281821 seconds time elapsed

After talking to Mahesh and Benoit, I think that maybe we should disable distribution, and have pure single-threaded kernels comparison. So far they are distributed but benchmarked with single-threaded.

side note: if the input become dynamic, the latency of matmul kernels become 2x, while mmt4d kernel has similar latency. It looks like the LLVM CodeGen kick in some optimization when the input is static shapes.

Packing Kernel Study

There are two types of packing, one is packing on LHS, and the other is packing on RHS. Packing on LHS from NxK to nxKx16x1 has transposition variants. The current codegen is not doing good at it. We can probably borrow the logic in transpose kernels to improve it. Hopefully, the latency would be as same as packing on RHS. They would be the overheads that we take in data tiling approach.

ASM dump: https://gist.githubusercontent.com/hanhanW/045bed2e1b74e7826bac43fc75ca62d0/raw

MLIR file: https://gist.githubusercontent.com/hanhanW/f1748b7bdb4a266a7958d09d50bece2d/raw

UnPack Kernels

Some unpack cases are good some are bad, need more investigation..

ASM dump for 384x128x512 and 384x512x128 unpack kernels: https://gist.githubusercontent.com/hanhanW/18f881bf080d500aed7d7fd2c33fe362/raw

MLIR file: https://gist.githubusercontent.com/hanhanW/881ceeaaf5947177ff781d8e7eb58f61/raw

0 replies

dcaballe · 2023-03-28T05:33:36Z

dcaballe
Mar 28, 2023

Thanks, Hanhan! Super helpful! Let's talk about this tomorrow. I may have some answers to your questions and you may have some answers to mine :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data tiling approach for x86 #11821

{{title}}

Replies: 8 comments 16 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Data tiling approach for x86 #11821

hanhanW Jan 12, 2023 Collaborator

Abbreviated Linux Benchmark Summary

Regressed Latencies 🚩

Improved Latencies 🎉

Regressed Compilation Times 🚩

Improved Compilation Times 🎉

Regressed Total Dispatch Sizes 🚩

Improved Total Dispatch Sizes 🎉

Regressed Total Artifact Sizes 🚩

Improved Total Artifact Sizes 🎉

Replies: 8 comments · 16 replies

MaheshRavishankar Jan 12, 2023 Collaborator

bjacob Jan 16, 2023 Collaborator

bjacob Jan 16, 2023 Collaborator

bjacob Jan 17, 2023 Collaborator

hanhanW Jan 17, 2023 Collaborator Author

bjacob Jan 19, 2023 Collaborator

dcaballe Jan 13, 2023

hanhanW Jan 31, 2023 Collaborator Author

hanhanW Mar 15, 2023 Collaborator Author

hanhanW Mar 15, 2023 Collaborator Author

Performance Breakdown

Gap Analysis

Pack Fusion

Single Pack and UnPack Ops

vmurali Mar 15, 2023

dcaballe Mar 16, 2023

vmurali Mar 16, 2023

hanhanW Mar 16, 2023 Collaborator Author

bjacob Mar 16, 2023 Collaborator

vmurali Mar 16, 2023

vmurali Mar 16, 2023

hanhanW Mar 28, 2023 Collaborator Author

MMT4d kernel study

Packing Kernel Study

UnPack Kernels

dcaballe Mar 28, 2023

hanhanW
Jan 12, 2023
Collaborator

Replies: 8 comments 16 replies

MaheshRavishankar
Jan 12, 2023
Collaborator

bjacob Jan 16, 2023
Collaborator

bjacob Jan 16, 2023
Collaborator

bjacob Jan 17, 2023
Collaborator

hanhanW Jan 17, 2023
Collaborator Author

bjacob Jan 19, 2023
Collaborator

dcaballe
Jan 13, 2023

hanhanW
Jan 31, 2023
Collaborator Author

hanhanW Mar 15, 2023
Collaborator Author

hanhanW
Mar 15, 2023
Collaborator Author

vmurali
Mar 15, 2023

dcaballe
Mar 16, 2023

hanhanW Mar 16, 2023
Collaborator Author

bjacob Mar 16, 2023
Collaborator

hanhanW
Mar 28, 2023
Collaborator Author

dcaballe
Mar 28, 2023