IREE Compiler support Model Quantization. #12005

matrix97317 · 2023-01-31T09:22:25Z

matrix97317
Jan 31, 2023

Can you support model quantization components (INT8，INT4，etc） on the IREE stack?
Currently existing some projects support this, like https://github.com/sophgo/tpu-mlir

antiagainst · 2023-02-01T18:16:37Z

antiagainst
Feb 1, 2023
Collaborator

Quantization in general is a broad topic; it's touching many layers across the stack, involving model authoring, framework exporting, which hardware we are talking about, etc. So the answer depends on what specifics you are interested with.

In general, IREE compiles input models down to runtime scheduling logic and accelerator (GPU, CPU, etc.) executables. We support a broad range of hardware here---arm/x86_64/risv for CPU, various kinds of GPUs (AMD, Apple, ARM, Intel, NVIDIA, Qualcomm, etc.). Different hardware may have different capabilities regarding int8 / int4 / etc. support, and different API / software stack may further not expose those smaller bitwidth yet (and may expose it in the future). For example, if you drive older generations of NVIDIA GPUs using CUDA, you don't have native int4 support there. If you drive GPUs via Vulkan for any kind of GPUs, no native int4 support too, but certain hardware (e.g., newer generation of NVIDIA) does support it; so it's not exposed yet.

So you can see that it's a diverse landscape to simply answer as a binary yes or no. :) But still coming to answer your question, as long as the input model is exported to MLIR with proper int8/int4/etc. and the target hardware can natively support it, we should support the compilation flow and generate performant code. For targets that don't, we can also emulate it with other bitwidths (int32) to make it runnable at least; not gonna be performant though. Specifically, right now the support for int8 across various CPU/GPU targets are progressing well, esp. for mobile focused architectures like ARM CPU or Vulkan for GPU. Models should be runnable and we are working on adding more accelerated implementations by going through native int8 intrinsics. For int4, we haven't looked into it much yet.

0 replies

matrix97317 · 2023-02-03T02:50:58Z

matrix97317
Feb 3, 2023
Author

Thanks. Therefore, we can use the following workflow to build a quantitative model on top of IREE.
FP32 Model-> INT8 Model-> INT8 Model IR -> IREE.

1 reply

antiagainst Feb 3, 2023
Collaborator

Yup it should be doable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IREE Compiler support Model Quantization. #12005

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

IREE Compiler support Model Quantization. #12005

matrix97317 Jan 31, 2023

Replies: 2 comments · 1 reply

antiagainst Feb 1, 2023 Collaborator

matrix97317 Feb 3, 2023 Author

antiagainst Feb 3, 2023 Collaborator

matrix97317
Jan 31, 2023

Replies: 2 comments 1 reply

antiagainst
Feb 1, 2023
Collaborator

matrix97317
Feb 3, 2023
Author

antiagainst Feb 3, 2023
Collaborator