Goto

Collaborating Authors

 quip



QTIP: Quantization with Trellises and Incoherence Processing

Neural Information Processing Systems

Post-training quantization (PTQ) reduces the memory footprint of LLMs by quan-tizing weights to low-precision datatypes. Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput.


QuIP: 2-Bit Quantization of Large Language Models With Guarantees

Neural Information Processing Systems

We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from incoherent weight and Hessian matrices, i.e., from the weights being even in magnitude and the



QuIP: 2-Bit Quantization of Large Language Models With Guarantees

Neural Information Processing Systems

We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from incoherent weight and Hessian matrices, i.e., from the weights being even in magnitude and the directions in which it is important to round them accurately being unaligned with the coordinate axes. QuIP consists of two steps: (1) an adaptive rounding procedure minimizing a quadratic proxy objective; (2) efficient pre-and post-processing that ensures weight and Hessian incoherence via multiplication by random orthogonal matrices. We complement QuIP with the first theoretical analysis for an LLM-scale quantization algorithm, and show that our theory also applies to an existing method, OPTQ. Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight.


RaanA: A Fast, Flexible, and Data-Efficient Post-Training Quantization Algorithm

Yang, Yongyi, Gao, Jianyang, Hu, Wei

arXiv.org Artificial Intelligence

Post-training Quantization (PTQ) has become a widely used technique for improving inference efficiency of large language models (LLMs). However, existing PTQ methods generally suffer from crucial limitations such as heavy calibration data requirements and inflexible choice of target number of bits. In this paper, we propose RaanA, a unified PTQ framework that overcomes these challenges by introducing two novel components: 1) RaBitQ-H, a variant of a randomized vector quantization method RaBitQ, designed for fast, accurate, and highly efficient quantization; and 2) AllocateBits, an algorithm that optimally allocates bit-widths across layers based on their quantization sensitivity. RaanA achieves competitive performance with state-of-the-art quantization methods while being extremely fast, requiring minimal calibration data, and enabling flexible bit allocation. Extensive experiments demonstrate RaanA's efficacy in balancing efficiency and accuracy. The code is publicly available at https://github.com/FFTYYY/RaanA .


Compressing Large Language Models using Low Rank and Low Precision Decomposition

Neural Information Processing Systems

Due to the correlated nature of language syntax and semantics learned during training, often, the weight matrices of LLMs exhibit redundancy, which manifests as a low-rank structure. This redundancy suggests the potential for compression without substantial loss in performance.



PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression

Neural Information Processing Systems

There has been significant interest in "extreme" compression of large language models (LLMs), i.e., to 1-2 bits per parameter, which allows such models to be executed efficiently on resource-constrained devices.


QuIP: 2-Bit Quantization of Large Language Models With Guarantees

Neural Information Processing Systems

We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from incoherent weight and Hessian matrices, i.e., from the weights being even in magnitude and the