One-Bit Quantization for Random Features Models
Akhtiamov, Danil, Ghane, Reza, Hassibi, Babak
The success of deep neural networks in tasks such as image recognition, natural language processing, and reinforcement learning has come at the cost of escalating computational and memory requirements. Modern models, often comprised of billions of parameters, demand significant resources for training and inference, rendering them impractical for deployment on resource-constrained devices like mobile phones, embedded systems, or IoT devices. To address this challenge, weight quantization--reducing the precision of neural network weights--has emerged as a promising technique to lower memory footprint and accelerate inference. In particular, one-bit quantization, which restricts weights to{+1, 1}, offers extreme compression (e.g., 32 memory reduction for 32-bit floats) and enables efficient hardware implementations using bitwise operations. Various works have explored the possibility of network quantization in the recent years. In particular, for Large Language Models (LLMs), some post-training have been able to reduce the model size via fine-tuning. Examples of such approach include GPTQ Frantar et al. (2022) which can quantize a 175 billion GPT model to 4 bits and QuIP which Chee et al. (2023) compresses Llama 2 70B to 2 and 3 bits. Furthermore, quantization-aware training approaches, such as Bitnet Wang et al. (2023), Bitnet 1.58b Ma et al. (2024), have been able to achieve one-bit language models with comparable performance to the models from the same weight class. For a recent survey on efficient LLMs we refer to Xu et al. (2024).
Oct-21-2025