Abstract and 1. Introduction

  1. Related work

  2. Method

    3.1. Uniform quantizer

    3.2. IGQ-ViT

    3.3. Group size allocation

  3. Experiments

    4.1. Implementation details and 4.2. Results

    4.3. Discussion

  4. Conclusion, Acknowledgements, and References

Supplementary Material

A. More implementation details

B. Compatibility with existing hardwares

C. Latency on practical devices

D. Application to DETR

Network quantization. Network quantization aims at reducing bit-widths of weights and activations of neural networks. QAT methods simulate the quantization process by applying a round function to weights and activations of the network. Since derivatives of the round function is either zero or infinite, they approximate the gradients (e.g., using the straight-through estimator [3]) to train the network with backpropagation. These methods also adjust the derivatives of the round function [17, 18] or train quantization parameters jointly with network weights based on task losses [11, 16]. For better convergence of the training process, many heuristics have been introduced, e.g., progressively shrinking bit-widths [44] or freezing parts of the network weights [29, 43]. Quantized networks using QAT show performance comparable to or even better then full-precision counterparts. However, the quantization process is computationally demanding, requiring a significant amount of training time. PTQ offers an alternative approach to quantizing neural networks. Instead of training fullprecision models and simulating the quantization process at training time, PTQ methods calibrate quantization parameters (e.g., quantization intervals) using a subset of training samples. Early efforts focus on optimizing the quantization parameters to minimize the difference between floatingpoint and quantized values [2, 28]. Another line of research proposes to consider distributions of weights and/or activations to design quantizers. For instance, the work of [12] has observed that network weights follow a bell-shaped distribution. Based on this, it introduces piecewise linear quantizers that assign different quantization intervals according to the magnitudes of activations, performing better compared to uniform quantizers. Recent PTQ methods learn to either round up or down network weights by using a reconstruction error of layer outputs [27] or exploiting the Hessian of training losses [19], and they have proven the effectiveness on CNN architectures (e.g., ResNet [13], MobileNetV2 [31]).

Transformer quantization. While ViTs [10] and the variants [25, 34] have become increasingly popular in computer vision, the unique structure and characteristics of ViT architectures makes network quantization challenging. For example, PTQ methods for CNNs [2, 19, 27, 28] do not perform well on quantizing softmax attentions and GELU activations in transformers, suggesting that directly applying them for ViT quantization results in significant performance degradation [26]. To date, only a limited number of PTQ methods have been developed for ViTs. The work of [26] estimates quantization parameters that maximize similarities between full-precision and quantized outputs of linear operations, and proposes to preserve a relative order of attention values after quantization. APQ-ViT [9] introduces a calibration metric to minimize the discrepancies between full-precision and quantized outputs, while maintaining the power-law distribution of softmax attentions. PTQ4ViT [40] introduces twin uniform quantizers to handle asymmetric distributions in softmax attentions and GELU activations effectively. Most PTQ methods for ViTs exploit a single quantizer for all channels, suggesting that they do not consider the distributions of activation values across channels, typically having extreme scale variations. Recent works [21, 23] attempt to alleviate the scale variation problem efficiently. FQ-ViT [23] proposes to consider interchannel scale variations for LayerNorm [1], and exploits channel-wise quantizers with the constraint of the ratio of quantization intervals being power-of-two values. This enables using bit-shift operations, calculating mean and variance of LayerNorm in an integer level. The scale reparameterization technique, introduced by RepQ-ViT [21], allows to use layer-wise quantizers, instead of adopting channelwise ones, by adjusting the affine factors of LayerNorm and the weights of FC layers. However, this technique applies to the activations for LayerNorm only, and does not fully address the inter-channel scale variations for other layers in transformers.

Similar to ours, the works of [4, 7, 32, 36] adopt group quantization techniques for transformers. For instance, Qbert [32] and VS-quant [7] divide consecutive channels uniformly into a number of groups without considering the dynamic range of each channel, and thus the channels assigned to each group do not follow similar distributions. PEG [4] alleviates this issue by sorting the activations across channels w.r.t. the dynamic ranges during calibration, before grouping the channels. Quantformer [36] proposes to use a differentiable search [6, 24] for QAT in order to group channels of activation maps. The channels assigned to particular groups are however fixed after calibrating pretrained networks for PTQ in the group quantization techniques [4, 7, 32], which makes them inappropriate for ViTs having diverse channel distributions according to input instances. In contrast, our approach apply group quantization along channels of activation maps and tokens of softmax attentions dynamically at runtime for each input instance, without additional parameters for PTQ.

Authors:

(1) Jaehyeon Moon, Yonsei University and Articron;

(2) Dohyung Kim, Yonsei University;

(3) Junyong Cheon, Yonsei University;

(4) Bumsub Ham, a Corresponding Author from Yonsei University.


This paper is available on arxiv under CC BY-NC-ND 4.0 Deed (Attribution-Noncommercial-Noderivs 4.0 International) license.