low-bit weight
Searching for Low-Bit Weights in Quantized Neural Networks
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators. However, the quantization functions used in most conventional quantization methods are non-differentiable, which increases the optimization difficulty of quantized networks. Compared with full-precision parameters (\emph{i.e.}, 32-bit floating numbers), low-bit values are selected from a much smaller set. For example, there are only 16 possibilities in 4-bit space. Thus, we present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately. In particular, each weight is represented as a probability distribution over the discrete value set. The probabilities are optimized during training and the values with the highest probability are selected to establish the desired quantized network. Experimental results on benchmarks demonstrate that the proposed method is able to produce quantized neural networks with higher performance over the state-of-the-arts on both image classification and super-resolution tasks.
2a084e55c87b1ebcdaad1f62fdbbac8e-AuthorFeedback.pdf
We sincerely thank four reviewers for the valuable comments. The ablation study issue is concerned by Reviewer #2, #4. We answer these issues first. We have already conducted the ablation study on the temperature in Sec. We fill fix the typos in the updated version and proofread the paper to make it more readible.
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
Panferov, Andrei, Chen, Jiale, Tabesh, Soroush, Castro, Roberto L., Nikdan, Mahdi, Alistarh, Dan
One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330v2) put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, which is Pareto-competitive with FP16, i.e., it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.
Review for NeurIPS paper: Searching for Low-Bit Weights in Quantized Neural Networks
Weaknesses: 1) The similar idea of learning an auxiliary differentiable network has also been introduced in the following paper. The main difference of this paper to the following reference is that multiple bits are learned for each code in this paper while, undoubtedly, binary weights and representations will be more cost-efficient. More importantly, authors did not discuss this similar reference. IJCAI, 2019 2) I am very confused with the EQ. According to EQ. (1), The values v are discrete numbers while p is probability that the elements in W belong to the i -th discrete value.
Review for NeurIPS paper: Searching for Low-Bit Weights in Quantized Neural Networks
The paper proposes a novel end-to-end gradient-based optimization for searching discrete low-bit weights in quantized networks. After reading the reviews, rebuttal, and the discussion among reviewers the paper clearly is recognized as novel and well executed. I would encourage the authors to further improve their work by better clarifying the decay strategy for the temperature in the camera ready and to add a comparison with SGD-R scheduling as pointed out by one of the reviewers. It would be also nice to have a mention on how the proposed approach relates to Latent Weights Do Not Exist: Rethinking Binarized Neural.
Searching for Low-Bit Weights in Quantized Neural Networks
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators. However, the quantization functions used in most conventional quantization methods are non-differentiable, which increases the optimization difficulty of quantized networks. Compared with full-precision parameters (\emph{i.e.}, 32-bit floating numbers), low-bit values are selected from a much smaller set. For example, there are only 16 possibilities in 4-bit space. Thus, we present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.