Goto

Collaborating Authors

 ternary quantization


Tequila: Trapping-free Ternary Quantization for Large Language Models

Huang, Hong, Wu, Decheng, Cen, Rui, Yu, Guanghua, Li, Zonghang, Liu, Kai, Zhu, Jianchen, Chen, Peng, Liu, Xue, Wu, Dapeng

arXiv.org Artificial Intelligence

Quantization techniques are essential for the deployment of Large Language Models (LLMs) on edge devices. However, prevailing methods often rely on mixed-precision multiplication that lacks efficient hardware support, making it not feasible. However, such aggressive compression leads to significant accuracy degradation, even after costly quantization-aware training with massive data. We identify the core issue as deadzone trapping: a large number of weights are trapped at the dead-zone boundary. This occurs because these weights receive only noisy, uninformative gradients, preventing stable escape from the deadzone and severely impeding model capacity and optimization. To address this issue, we propose T equila, a trapping-free quantization optimization method that reactivates deadzone-trapped weights by repurposing them as dynamic biases. This allows the repurposed weights to provide a continuous signal in the forward pass and, critically, receive direct, meaningful gradient signals during backpropagation, thereby enhancing model capacity and optimization with nearly zero inference overhead. Extensive evaluations demonstrate that Tequila outperforms state-of-the-art (SOT A) ternary quantization methods across five benchmarks. Specifically, on the ARC benchmark, it achieves > 4% accuracy gain over the SOT A baseline, nearly matching full-precision performance (within < 1% gap) with a 3.0 inference speedup. Consequently, Tequila offers a highly practical and efficient implementation for the deployment of advanced LLMs in resource-constrained environments. Recent advancements in large language models (LLMs) (Wu et al., 2023; Floridi & Chiriatti, 2020; Zhang et al., 2022) have demonstrated remarkable success across a wide range of applications, from conversational chatbots to creative writing.


Binary and Ternary Quantization Can Enhance Feature Discrimination

Lu, Weizhi, Chen, Mingrui, Li, Weiyu

arXiv.org Artificial Intelligence

Quantization is widely applied in machine learning to reduce computational and storage costs for both data and models. Considering that classification tasks are fundamental to the field, it is crucial to investigate how quantization impacts classification performance. Traditional research has focused on quantization errors, assuming that larger errors generally lead to lower classification accuracy. However, this assumption lacks a solid theoretical foundation and often contradicts empirical observations. For example, despite introducing significant errors, $\{0,1\}$-binary and $\{0, \pm1\}$-ternary quantized data have sometimes achieved classification accuracy comparable or even superior to full-precision data. To reasonably explain this phenomenon, a more accurate evaluation of classification performance is required. To achieve this, we propose a direct analysis of the feature discrimination of quantized data, instead of focusing on quantization errors. Our analysis reveals that both binary and ternary quantization can potentially enhance, rather than degrade, the feature discrimination of the original data. This finding is supported by classification experiments conducted on both synthetic and real data.


An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits

Steinmetz, Cody, Childress, Gavin, Herbst, Aaron, Jones, Gavin, Singh, Jasdeep, Vang, Eli, Weinstock, Keagan

arXiv.org Artificial Intelligence

Large language models (LLMs) have transformed natural-language processing, yet their scale makes real-world deployment costly. Post-training quantization reduces memory and computation but often degrades accuracy, while quantization-aware training can recover performance at the cost of extra training. Pushing quantization to the ternary (2-bit) regime yields even larger savings but is notoriously unstable. Building on recent work showing that a bias-free, RMS-normalized Transformer with straight-through estimation can reach 1.58-bit precision, we demonstrate that simply inserting RMS normalization before every linear projection and applying a gradual, layer-wise quantization schedule stably fine-tunes full-precision checkpoints into ternary LLMs. Our approach matches or surpasses more elaborate knowledge-distillation pipelines on standard language-modeling benchmarks without adding model complexity. These results indicate that careful normalization alone can close much of the accuracy gap between ternary and full-precision LLMs, making ultra-low-bit inference practical.


Reviews: HitNet: Hybrid Ternary Recurrent Neural Network

Neural Information Processing Systems

The authors study the problem of quantizing recurrent neural networks. While extreme low bit quantization (2 bits quantization) has achieved strong results for CNN, so far, such quantization performed poorly for recurrent neural network. The goal of this paper is thus to identify the reason for this observation, and to propose extreme quantization scheme better suited for RNNs. First, the authors compare different weight quantization: 2-bits uniform quantization, thresholded ternary quantization (TTQ) and Bernoulli ternary quantization (BTQ). This comparison is performed using a RNN trained on Penn TreeBank.


Working with Hyperspheres in Machine Learning part2

#artificialintelligence

Abstract: We consider the reflection of a photon by a two-level system in a quasi-one-dimensional waveguide. This is important in part because it forms the backdrop for more complicated proposals where many emitters are coupled to the waveguide: leading to super and subradiant coupling even when the emitters are distant. The incorporation of chiral effects, for example unidirectional emission of dipole emitters, has already led to rich physics such as dimer coupling. However, chirality is not the only effect of the dipole, as we explore from a phase singularity perspective. We demonstrate that control of the dipole allows a rich variety of control of the phase and amplitude of the scattered light in both directions. This expands the scope for the physics of 1D chains of emitters.


Ternary Quantization: A Survey

Liu, Dan, Liu, Xue

arXiv.org Artificial Intelligence

Inference time, model size, and accuracy are critical for deploying deep neural network models. Numerous research efforts have been made to compress neural network models with faster inference and higher accuracy. Pruning and quantization are mainstream methods to this end. During model quantization, converting individual float values of layer weights to low-precision ones can substantially reduce the computational overhead and improve the inference speed. Many quantization methods have been studied, for example, vector quantization, low-bit quantization, and binary/ternary quantization. This survey focuses on ternary quantization. We review the evolution of ternary quantization and investigate the relationships among existing ternary quantization methods from the perspective of projection function and optimization methods.


Hyperspherical Loss-Aware Ternary Quantization

Liu, Dan, Liu, Xue

arXiv.org Artificial Intelligence

Most of the existing works use projection functions for ternary quantization in discrete space. Scaling factors and thresholds are used in some cases to improve the model accuracy. However, the gradients used for optimization are inaccurate and result in a notable accuracy gap between the full precision and ternary models. To get more accurate gradients, some works gradually increase the discrete portion of the full precision weights in the forward propagation pass, e.g., using temperature-based Sigmoid function. Instead of directly performing ternary quantization in discrete space, we push full precision weights close to ternary ones through regularization term prior to ternary quantization. In addition, inspired by the temperature-based method, we introduce a re-scaling factor to obtain more accurate gradients by simulating the derivatives of Sigmoid function. The experimental results show that our method can significantly improve the accuracy of ternary quantization in both image classification and object detection tasks.


Smart Ternary Quantization

Morin, Grégoire, Razani, Ryan, Nia, Vahid Partovi, Sari, Eyyüb

arXiv.org Machine Learning

Neural network models are resource hungry. Low bit quantization such as binary and ternary quantization is a common approach to alleviate this resource requirements. Ternary quantization provides a more flexible model and often beats binary quantization in terms of accuracy, but doubles memory and increases computation cost. Mixed quantization depth models, on another hand, allows a trade-off between accuracy and memory footprint. In such models, quantization depth is often chosen manually (which is a tiring task), or is tuned using a separate optimization routine (which requires training a quantized network multiple times). Here, we propose Smart Ternary Quantization (STQ) in which we modify the quantization depth directly through an adaptive regularization function, so that we train a model only once. This method jumps between binary and ternary quantization while training. We show its application on image classification.


Deep Neural Network Compression with Single and Multiple Level Quantization

Xu, Yuhui, Wang, Yongzhuang, Zhou, Aojun, Lin, Weiyao, Xiong, Hongkai

arXiv.org Machine Learning

Network quantization is an effective solution to compress deep neural networks for practical usage. Existing network quantization methods cannot sufficiently exploit the depth information to generate low-bit compressed network. In this paper, we propose two novel network quantization approaches, single-level network quantization (SLQ) for high-bit quantization and multi-level network quantization (MLQ) for extremely low-bit quantization (ternary).We are the first to consider the network quantization from both width and depth level. In the width level, parameters are divided into two parts: one for quantization and the other for re-training to eliminate the quantization loss. SLQ leverages the distribution of the parameters to improve the width level. In the depth level, we introduce incremental layer compensation to quantize layers iteratively which decreases the quantization loss in each iteration. The proposed approaches are validated with extensive experiments based on the state-of-the-art neural networks including AlexNet, VGG-16, GoogleNet and ResNet-18. Both SLQ and MLQ achieve impressive results.


Extremely Low Bit Neural Network: Squeeze the Last Bit Out With ADMM

Leng, Cong (Alibaba Group) | Dou, Zesheng (Alibaba Group) | Li, Hao (Alibaba Group) | Zhu, Shenghuo (Alibaba Group) | Jin, Rong (Alibaba Group)

AAAI Conferences

Although deep learning models are highly effective for various learning tasks, their high computational costs prohibit the deployment to scenarios where either memory or computational resources are limited. In this paper, we focus on compressing and accelerating deep models with network weights represented by very small numbers of bits, referred to as extremely low bit neural network. We model this problem as a discretely constrained optimization problem. Borrowing the idea from Alternating Direction Method of Multipliers (ADMM), we decouple the continuous parameters from the discrete constraints of network, and cast the original hard problem into several subproblems. We propose to solve these subproblems using extragradient and iterative quantization algorithms that lead to considerably faster convergency compared to conventional optimization methods. Extensive experiments on image recognition and object detection verify that the proposed algorithm is more effective than state-of-the-art approaches when coming to extremely low bit neural network.