LLM performance on x86 #16246

hanhanW · 2024-01-29T19:19:06Z

hanhanW
Jan 29, 2024
Collaborator

A model can be imported from different frameworks (e.g., JAX, TF, PyTorch, etc). The latencies will differ because of the input dialect and how each graph is represented is different. They are probably around 10-20%. For example,

date (Date)    full_config    Latency in Milliseconds
Jul 21, 2023    BERT_LARGE-FP32_JAX_BATCH1 / iree @ c2-standard-60    420.5682673
Jul 21, 2023    BERT_LARGE-FP32_TF_BATCH1 / iree @ c2-standard-60    407.016955
Jul 21, 2023    BERT_LARGE-FP32_JAX_BATCH32 / iree @ c2-standard-60    7510.731077
Jul 21, 2023    BERT_LARGE-FP32_TF_BATCH32 / iree @ c2-standard-60    6961.659407
Jul 21, 2023    BERT_LARGE-FP32_JAX_BATCH64 / iree @ c2-standard-60    15234.0361
Jul 21, 2023    BERT_LARGE-FP32_TF_BATCH64 / iree @ c2-standard-60    13566.0207

We can use existing MLIR files imported from other frameworks for development, if they are not ready on SHARK/Turbine yet.

Performance

The below performance numbers are collected from IREE Benchmark Large and OpenXLA Benchmark. For more benchmark results (e.g., MobileNet, MobileBert, etc), please take a look at spreadsheet.

Device: GCP-c2-standard-60
Latency unit: ms

	batch size	JAX/IREE	PT/Inductor	JAX/XLA	TF/XLA
Number of Threads		30	default	default	default
Bert Large	1	232.32	266.70	523.52	933.26
	32	7,467.42	4,639.59	6,557.05	7,675.41
	64	15,022.24	10,651.10	14,690.27	15,087.55
ResNet50_FP32	1	22.04	21.67	31.99	36.71
	64	1,430.03	381.95	693.98	603.16
	128	2,712.84	805.75	1,276.40	1,082.26
T5 Large	1	822.46		*	2,443.66
	16	13,944.91		*	27,046.99
	32	28,170.94		*	40,510.94
T5_4CG_Large	1	14,161.24		25,883.89
	16	x		157,509.29
	32	x		213,059.38
GPT2LMHead_FP32	1	132.14		164.35
	64	*8,954.04		9,040.50
	128	*17,939.68		17,590.65
Falcon7bGptqPT	0	4,137.96
Falcon7bInt4GptqPT	0	4,052.98
GPT2_117M_TF_1X1XI32	0	8.63

Note: For GPT2_117M, IREE only uses 15 threads.
Note: * means that there are correctness issue. (#14601)

There are several models in x86 benchmark suites. There are more TF, Torch, and JAX models available; we can add them to our benchmark suite. (Note that TF models are being removed because it is becoming a maintenance burden to generate artifacts for it. Plus everyone's moving to JAX.)

Potential interesting models

There are some available models (not just LLM) in MLIR format. We might be interested in some of them.

Unet2d [Torch]: it consists of ResNet encoder and decoder blocks with cross-attention layers. It is used in stable diffusion.
Falcon7bGptq [Torch]: Falcon seems to be a popular LLM, published in 2023.
Falcon7bInt4Gptq: Int4 version of Falcon-7b

There are even more models that generate MLIR listed here: https://github.com/openxla/openxla-benchmark?tab=readme-ov-file#supported-workloads

Proxy for Bert Large models

For LLM, we can use the existing Bert Large model as a proxy. The PyTorch version was generated from the HuggingFace implementation, which should be very similar to https://huggingface.co/bert-large-uncased

Get MLIR files for batch_size = [1, 24, 48] cases

wget https://storage.googleapis.com/iree-model-artifacts/pytorch/pt_models_20240124.1093_1706139741/BERT_LARGE_FP32_PT_384XI32_BATCH1/linalg.mlirbc \
  -O ~/bert_large_batch1.mlirbc
wget https://storage.googleapis.com/iree-model-artifacts/pytorch/pt_models_20240124.1093_1706139741/BERT_LARGE_FP32_PT_384XI32_BATCH24/linalg.mlirbc \
  -O ~/bert_large_batch24.mlirbc
wget https://storage.googleapis.com/iree-model-artifacts/pytorch/pt_models_20240124.1093_1706139741/BERT_LARGE_FP32_PT_384XI32_BATCH48/linalg.mlirbc \
  -O ~/bert_large_batch48.mlirbc

Compile the model with your own flags, and benchmark them with --entry_function=forward and two inputs --input=${batch_size}x384xi64. E.g.,

# batch_size = 1
iree-compile --output-format=vm-bytecode \
  --iree-hal-target-backends=llvm-cpu \
  --iree-llvmcpu-target-cpu=cascadelake \
  --iree-llvmcpu-target-triple=x86_64-unknown-linux-gnu \
  --iree-llvmcpu-enable-ukernels=all \
  ~/bert_large_batch1.mlirbc -o /tmp/a.vmfb

iree-benchmark-module --device=local-task --module=/tmp/a.vmfb --function=forward --input=1x384xi64 --input=1x384xi64

# batch_size = 24
iree-benchmark-module --device=local-task --module=/tmp/b.vmfb --function=forward --input=24x384xi64 --input=24x384xi64

# batch_size = 48
iree-benchmark-module --device=local-task --module=/tmp/c.vmfb --function=forward --input=48x384xi64 --input=48x384xi64

@MaheshRavishankar @bjacob @Max191 I reached out to @mariecwhite to learn about IREE CPU performance; collect data from public benchmarks. Please take a look.

Max191 · 2024-01-30T16:41:19Z

Max191
Jan 30, 2024
Collaborator

I rebased the test branch for llama2 performance. I'll post it here along with the relevant compiler flags needed.

Test branch

https://github.com/Max191/iree/tree/quantized-matmul-v2-testing

Flags

--iree-global-opt-enable-quantized-matmul-reassociation=true
- This is the primary optimization for quantized int4 vecmat, rewriting to a form that targets data tiling and ukernels.
--iree-llvmcpu-enable-ukernels=mmt4d
--iree-llvmcpu-narrow-matmul-tile-bytes=16777216
- This is for distribution sizes of vecmat. I tuned this number for my AMD Ryzen 9 7900X 12-Core Processor
--iree-global-opt-propagate-transposes
- Propagate and try to fold transposes. In llama2, this is mainly for the KV caching part of the model.
--iree-opt-aggressively-propagate-transposes
- Aggressive form of transpose propagation. This has some potentially risky patterns, but it helps on llama2
--iree-global-opt-enable-demote-contraction-inputs-to-bf16
- Truncate inputs of contractions to decrease memory load of mmt4d ops
--iree-opt-outer-dim-concat
- Transposes concatenations to happen on the outermost dimension. In llama2 this works in combination with transpose propagation to get more efficient KV caching.

Full iree-compile command:

iree-compile \
    --iree-opt-const-eval=false \
    --iree-hal-target-backends=llvm-cpu \
    --iree-llvmcpu-target-cpu=znver4 \
    --iree-stream-resource-index-bits=64 \
    --iree-vm-target-index-bits=64 \
    --iree-global-opt-enable-quantized-matmul-reassociation=true \
    --iree-llvmcpu-enable-ukernels=mmt4d \
    --iree-llvmcpu-narrow-matmul-tile-bytes=16777216 \
    --iree-global-opt-propagate-transposes \
    --iree-opt-aggressively-propagate-transposes \
    --iree-opt-outer-dim-concat \

0 replies

Max191 · 2024-02-01T15:20:57Z

Max191
Feb 1, 2024
Collaborator

Here are my own benchmark numbers for Llama2 in IREE compared against llama.cpp.
Benchmarks were taken on the test branch in #16246 (comment)

Link to a spreadsheet with some more detail, but I will repeat the information here too.

Benchmarks for Llama2 i4 Quantized

	IREE	llama.cpp
Number of Threads	8	8
Context Length 1	86	73.0
Context Length 10	89	73.1
Context Length 20	90	73.2
Context Length 30	92	73.2
Context Length 40	94	73.3
Context Length 50	96	73.4
Context Length 100	104	73.8
Context Length 150	114	74.2
Context Length 200	124	74.6
Context Length 250	133	75.0
Context Length 300	143	75.3
Context Length 400	155	76.1
Context Length 500	178	76.9

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM performance on x86 #16246

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

LLM performance on x86 #16246

hanhanW Jan 29, 2024 Collaborator

Performance

Potential interesting models

Proxy for Bert Large models

Replies: 2 comments

Max191 Jan 30, 2024 Collaborator

Test branch

Flags

Max191 Feb 1, 2024 Collaborator

Benchmarks for Llama2 i4 Quantized

hanhanW
Jan 29, 2024
Collaborator

Max191
Jan 30, 2024
Collaborator

Max191
Feb 1, 2024
Collaborator