Optimal Formats for Weight Quantisation
Orr, Douglas, Ribar, Luka, Luschi, Carlo
–arXiv.org Artificial Intelligence
Weight quantisation is an essential technique for enabling efficient training and deployment of modern deep learning models. However, the recipe book of quantisation formats is large and formats are often chosen empirically. In this paper, we propose a framework for systematic design and analysis of quantisation formats. By connecting the question of format design with the classical quantisation theory, we show that the strong practical performance of popular formats comes from their ability to represent values using variable-length codes. We frame the problem as minimising the KL divergence between original and quantised model outputs under a model size constraint, which can be approximated by minimising the squared quantisation error, a well-studied problem where entropy-constrained quantisers with variable-length codes are optimal. We develop non-linear quantisation curves for block-scaled data across multiple distribution families and observe that these formats, along with sparse outlier formats, consistently outperform fixed-length formats, indicating that they also exploit variable-length encoding. Finally, by using the relationship between the Fisher information and KL divergence, we derive the optimal allocation of bit-widths to individual parameter tensors across the model's layers, saving up to 0.25 bits per parameter when applied to large language models. Weight quantisation enables large deep learning models to run on low-resource hardware and edge devices by saving space and memory bandwidth usage. It can be seen as an optimisation problem, where the goal is to retain the behaviour of the high-precision reference model while reducing the total number of bits needed to store its parameters. This naturally splits into two sub-problems of format design and quantisation procedure, both of which are highly active areas of research. We focus on the format design question, i.e., how to choose a representation space for model parameters. This is somewhat independent from the quantisation procedure, which aims to find an optimal point in that space.
arXiv.org Artificial Intelligence
Sep-26-2025
- Country:
- Asia
- Europe
- North America
- Canada > British Columbia
- Vancouver (0.04)
- Puerto Rico > San Juan
- San Juan (0.04)
- United States
- Hawaii > Honolulu County
- Honolulu (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- Hawaii > Honolulu County
- Canada > British Columbia
- Genre:
- Research Report (1.00)
- Technology: