When is QCDQ performed inbetween quantized layers? #99
Unanswered
vselhakim1337
asked this question in
Q&A
Replies: 1 comment 8 replies
-
On FPGA division is a rather difficult operator to implement (cheaply). However, there is a special case when the divisor is a power-of-two. In that case division is just a shift operation, which is much cheaper to implement on FPGA. So for direct FPGA implementation, it is common to limit this scaling opoeration to a power of two. I.e. if you look at QKeras, there is an option to limit the scaling factor by chosing |
Beta Was this translation helpful? Give feedback.
8 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all, I've been trying to learn QONNX (and FINN), and I'm trying to understand the QCDQ method a bit better (especially for 8-bit quantization), so in particular the
Quant
operator. This is towards a broader goal of developing custom accelerators that use QONNX. In particular, my understanding from the QONNX paper, is that if there are two consecutive layers with quantized weights, inputs and outputs, then the integer output from the first layer can be passed directly to the integer input of the second. However, I wonder if this is actually the case, i.e. is some form of requantization step performed in-between, just like vanilla ONNX? If so, how would the scaling work on an FPGA implementation without a FP32 single precision IP? Would it be safe to just using a fixed-point approximation and a regular int multiplier for the scaling? Or is there a different method utilized on implementation level (e.g. with FINN)?Thank you in advance!
Beta Was this translation helpful? Give feedback.
All reactions