Replies: 2 comments
-
I rebased the test branch for llama2 performance. I'll post it here along with the relevant compiler flags needed. Test branchhttps://github.com/Max191/iree/tree/quantized-matmul-v2-testing Flags
Full
|
Beta Was this translation helpful? Give feedback.
0 replies
-
Here are my own benchmark numbers for Llama2 in IREE compared against llama.cpp. Link to a spreadsheet with some more detail, but I will repeat the information here too. Benchmarks for Llama2 i4 Quantized
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
A model can be imported from different frameworks (e.g., JAX, TF, PyTorch, etc). The latencies will differ because of the input dialect and how each graph is represented is different. They are probably around 10-20%. For example,
We can use existing MLIR files imported from other frameworks for development, if they are not ready on SHARK/Turbine yet.
Performance
The below performance numbers are collected from IREE Benchmark Large and OpenXLA Benchmark. For more benchmark results (e.g., MobileNet, MobileBert, etc), please take a look at spreadsheet.
Device: GCP-c2-standard-60
Latency unit: ms
Note: For GPT2_117M, IREE only uses 15 threads.
Note: * means that there are correctness issue. (#14601)
There are several models in x86 benchmark suites. There are more TF, Torch, and JAX models available; we can add them to our benchmark suite. (Note that TF models are being removed because it is becoming a maintenance burden to generate artifacts for it. Plus everyone's moving to JAX.)
Potential interesting models
There are some available models (not just LLM) in MLIR format. We might be interested in some of them.
ResNet
encoder and decoder blocks with cross-attention layers. It is used in stable diffusion.There are even more models that generate MLIR listed here: https://github.com/openxla/openxla-benchmark?tab=readme-ov-file#supported-workloads
Proxy for Bert Large models
For LLM, we can use the existing Bert Large model as a proxy. The PyTorch version was generated from the HuggingFace implementation, which should be very similar to https://huggingface.co/bert-large-uncased
Get MLIR files for batch_size = [1, 24, 48] cases
Compile the model with your own flags, and benchmark them with
--entry_function=forward
and two inputs--input=${batch_size}x384xi64
. E.g.,@MaheshRavishankar @bjacob @Max191 I reached out to @mariecwhite to learn about IREE CPU performance; collect data from public benchmarks. Please take a look.
Beta Was this translation helpful? Give feedback.
All reactions