When is QCDQ performed inbetween quantized layers? #99

vselhakim1337 · 2024-02-08T17:03:18Z

vselhakim1337
Feb 8, 2024

Hi all, I've been trying to learn QONNX (and FINN), and I'm trying to understand the QCDQ method a bit better (especially for 8-bit quantization), so in particular the Quant operator. This is towards a broader goal of developing custom accelerators that use QONNX. In particular, my understanding from the QONNX paper, is that if there are two consecutive layers with quantized weights, inputs and outputs, then the integer output from the first layer can be passed directly to the integer input of the second. However, I wonder if this is actually the case, i.e. is some form of requantization step performed in-between, just like vanilla ONNX? If so, how would the scaling work on an FPGA implementation without a FP32 single precision IP? Would it be safe to just using a fixed-point approximation and a regular int multiplier for the scaling? Or is there a different method utilized on implementation level (e.g. with FINN)?

Thank you in advance!

jurevreca12 · 2024-03-19T11:41:00Z

jurevreca12
Mar 19, 2024

On FPGA division is a rather difficult operator to implement (cheaply). However, there is a special case when the divisor is a power-of-two. In that case division is just a shift operation, which is much cheaper to implement on FPGA. So for direct FPGA implementation, it is common to limit this scaling opoeration to a power of two. I.e. if you look at QKeras, there is an option to limit the scaling factor by chosing alpha='auto_po2'.

8 replies

jurevreca12 Mar 20, 2024

Generally you need to perform this step otherwise the accumulator values will grow out of hand. However, pow2 scaling can be quite competitive in accuracy, specially if you learn the scaling factor (see https://arxiv.org/abs/1902.08153). QKeras, I believe, does not currently support learned scaling factors, however in my experience using pow2 scaling factors does not reduce the accuracy drastically.

vselhakim1337 Mar 20, 2024
Author

Generally you need to perform this step otherwise the accumulator values will grow out of hand.

I cannot verify this statement without having seen reproducible results. I've seen work done on models where pure integer arithmetic is used, e.g. I-BERT. However, I do find the prospect of using pow2 scale factors interesting, so thank you for the heads up. However, I'd assume this works with QAT, and I wonder if it works well post-training since I deal mostly with pre-trained models.

jurevreca12 Mar 20, 2024

Yes, I've been talking mostly about QAT. Large models generally, have trouble with post-training quantization, as far as my knowledge on the topic. I mostly deal with QAT.

maltanar Mar 20, 2024
Maintainer

Hi @vselhakim1337 ,

Where QCDQ nodes (or in general, quantizer ops like Quant that can introduce scaling factors on integer compute) are inserted in the graph is an outcome of the quantization strategy used by a quantizer (either PTQ or QAT). In turn, the quantization strategy (such as number of bits, granularity and other constraints of scaling factors..) is typically guided by what a particular hardware backend is capable of accelerating, as well as how much quantization the network can tolerate. For instance, per-channel scaling of weights tends to work better for few-bit quantization vs per-tensor scaling, while 8-bit quantization doesn't necessarily care as much.

One way of dealing with scaling factors is to restrict to power-of-two values as @jurevreca12 mentioned above. Another one is streamlining which involves moving them around in the network, and "absorbing" into a the thresholds of a threshold-based activation quantizer. This is what FINN uses under the hood. While quite efficient for few-bit and/or per-tensor quantization, it can get quite expensive (exponentially many more thresholds) as the activation bitwidth grows. It is also not always feasible since it requires careful consideration for the quantization scheme, e.g. ResNets must use matching scale factors on the outputs of their branches.

You can read more about the basic idea in this paper:
https://arxiv.org/abs/1709.04060
Here is the standard FINN streamlining pass for simple linear topologies:
https://github.com/Xilinx/finn/blob/main/src/finn/transformation/streamline/__init__.py#L72
And here is what streamlining for a ResNet-50 looks like:
https://github.com/Xilinx/finn-examples/blob/main/build/resnet50/custom_steps.py#L132-L189

vselhakim1337 Mar 21, 2024
Author

Hi @maltanar,

Thanks for the response! What I get from it:

With 8-bit quantization per-tensor and per channel scaling yield similar results
Streamlining "pushes" scaling factors through linear layers and then absorbs them in the activation function using a multithresholding node.
For a certain number of bits, multithresholding becomes expensive, and then scaling factors can instead be represented by pow2 numbers or some other method.
Looking at the ResNet50 example, I see that the streamlining step is custom and particular to this model. Does FINN have "generic" streamlining step, or I need to write my own for a particular model?

My interest in this is because I'm building custom accelerators for FPGAs that use QONNX as an IR (just like FINN) and would like to know the best practices when it comes to handling QCDQ. Since the aim for my accelerator is to be as dynamically configurable as possible, I would like to avoid the scaling operation in QCDQ as much as possible, if not eliminate it completely since as @jurevreca12 pointed out, computing the scaling is expensive on an FPGA. Otherwise, if I were to have the scaling factor dynamically configurable, then I might need to use an additional multiplier. Though for pow2 factors I can see how a table can be used to cheaply make a dynamic scaler. I know this is a bit off-topic, but have there been case studies done on models with Brevitas, where QAT has is performed with pure integer arithmetic through Brevitas?

Thank you very much for your answers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When is QCDQ performed inbetween quantized layers? #99

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

When is QCDQ performed inbetween quantized layers? #99

vselhakim1337 Feb 8, 2024

Replies: 1 comment · 8 replies

jurevreca12 Mar 19, 2024

jurevreca12 Mar 20, 2024

vselhakim1337 Mar 20, 2024 Author

jurevreca12 Mar 20, 2024

maltanar Mar 20, 2024 Maintainer

vselhakim1337 Mar 21, 2024 Author

vselhakim1337
Feb 8, 2024

Replies: 1 comment 8 replies

jurevreca12
Mar 19, 2024

vselhakim1337 Mar 20, 2024
Author

maltanar Mar 20, 2024
Maintainer

vselhakim1337 Mar 21, 2024
Author