Goto

Collaborating Authors

 gptq 0


Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

arXiv.org Artificial Intelligence

The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it can near the accuracy that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.


Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization

arXiv.org Machine Learning

Layer-wise post-training quantization has emerged as a widely used technique for compressing large language models (LLMs) without retraining. However, recent progress in this line of research is saturating, underscoring the need to revisit its core limitation and explore further improvements. This study identifies a critical bottleneck in existing layer-wise PTQ methods: the accumulation of quantization errors across layers significantly degrades performance, particularly in low-bit regimes. To address this, we propose Quantization Error Propagation (QEP), a lightweight and general framework that enhances layer-wise PTQ by explicitly propagating the quantization error which enable compensating for accumulated quantization errors. Additionally, we introduce a tunable propagation mechanism that allows for control over both propagation strength and computational overhead, making the framework adaptable to various architectures and resource constraints. Empirical evaluation on LLaMA2 models (7B, 13B, 70B) demonstrate that incorporating QEP into standard layer-wise PTQ pipelines outperforms standard PTQ methods. Notably, QEP yields substantial performance improvements under extreme low-bit quantization settings.


A Comprehensive Evaluation of Quantization Strategies for Large Language Models

arXiv.org Artificial Intelligence

Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques, which reduce the bits needed for model weights or activations with minimal performance loss, have become popular due to the rise of LLMs. However, most quantization studies use pre-trained LLMs, and the impact of quantization on instruction-tuned LLMs and the relationship between perplexity and benchmark performance of quantized LLMs are not well understood. Evaluation of quantized LLMs is often limited to language modeling and a few classification tasks, leaving their performance on other benchmarks unclear. To address these gaps, we propose a structured evaluation framework consisting of three critical dimensions: (1) knowledge \& capacity, (2) alignment, and (3) efficiency, and conduct extensive experiments across ten diverse benchmarks. Our experimental results indicate that LLMs with 4-bit quantization can retain performance comparable to their non-quantized counterparts, and perplexity can serve as a proxy metric for quantized LLMs on most benchmarks. Furthermore, quantized LLMs with larger parameter scales can outperform smaller LLMs. Despite the memory savings achieved through quantization, it can also slow down the inference speed of LLMs. Consequently, substantial engineering efforts and hardware support are imperative to achieve a balanced optimization of decoding speed and memory consumption in the context of quantized LLMs.