Goto

Collaborating Authors

 one-bit quantization


One-Bit Quantization for Random Features Models

arXiv.org Machine Learning

The success of deep neural networks in tasks such as image recognition, natural language processing, and reinforcement learning has come at the cost of escalating computational and memory requirements. Modern models, often comprised of billions of parameters, demand significant resources for training and inference, rendering them impractical for deployment on resource-constrained devices like mobile phones, embedded systems, or IoT devices. To address this challenge, weight quantization--reducing the precision of neural network weights--has emerged as a promising technique to lower memory footprint and accelerate inference. In particular, one-bit quantization, which restricts weights to{+1, 1}, offers extreme compression (e.g., 32 memory reduction for 32-bit floats) and enables efficient hardware implementations using bitwise operations. Various works have explored the possibility of network quantization in the recent years. In particular, for Large Language Models (LLMs), some post-training have been able to reduce the model size via fine-tuning. Examples of such approach include GPTQ Frantar et al. (2022) which can quantize a 175 billion GPT model to 4 bits and QuIP which Chee et al. (2023) compresses Llama 2 70B to 2 and 3 bits. Furthermore, quantization-aware training approaches, such as Bitnet Wang et al. (2023), Bitnet 1.58b Ma et al. (2024), have been able to achieve one-bit language models with comparable performance to the models from the same weight class. For a recent survey on efficient LLMs we refer to Xu et al. (2024).


Autoencoder-Based Error Correction Coding for One-Bit Quantization

arXiv.org Machine Learning

This paper proposes a novel deep learning-based error correction coding scheme for AWGN channels under the constraint of one-bit quantization in the receivers. Specifically, it is first shown that the optimum error correction code that minimizes the probability of bit error can be obtained by perfectly training a special autoencoder, in which "perfectly" refers to converging the global minima. However, perfect training is not possible in most cases. To approach the performance of a perfectly trained autoencoder with a suboptimum training, we propose utilizing turbo codes as an implicit regularization, i.e., using a concatenation of a turbo code and an autoencoder. It is empirically shown that this design gives nearly the same performance as to the hypothetically perfectly trained autoencoder, and we also provide a theoretical proof of why that is so. The proposed coding method is as bandwidth efficient as the integrated (outer) turbo code, since the autoencoder exploits the excess bandwidth from pulse shaping and packs signals more intelligently thanks to sparsity in neural networks. Our results show that the proposed coding scheme at finite block lengths outperforms conventional turbo codes even for QPSK modulation. Furthermore, the proposed coding method can make one-bit quantization operational even for 16-QAM.