mxfp4
Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration
Cuyckens, Stef, Yi, Xiaoling, Geens, Robin, Dumoulin, Joren, Wiesner, Martin, Fang, Chao, Verhelst, Marian
Emerging continual learning applications necessitate next-generation neural processing unit (NPU) platforms to support both training and inference operations. The promising Microscaling (MX) standard enables narrow bit-widths for inference and large dynamic ranges for training. However, existing MX multiply-accumulate (MAC) designs face a critical trade-off: integer accumulation requires expensive conversions from narrow floating-point products, while FP32 accumulation suffers from quantization losses and costly normalization. To address these limitations, we propose a hybrid precision-scalable reduction tree for MX MACs that combines the benefits of both approaches, enabling efficient mixed-precision accumulation with controlled accuracy relaxation. Moreover, we integrate an 8x8 array of these MACs into the state-of-the-art (SotA) NPU integration platform, SNAX, to provide efficient control and data transfer to our optimized precision-scalable MX datapath. We evaluate our design both on MAC and system level and compare it to the SotA. Our integrated system achieves an energy efficiency of 657, 1438-1675, and 4065 GOPS/W, respectively, for MXINT8, MXFP8/6, and MXFP4, with a throughput of 64, 256, and 512 GOPS.
Block Rotation is All You Need for MXFP4 Quantization
Shao, Yuantian, Wang, Peisong, Chen, Yuanteng, Xu, Chang, Wei, Zhihui, Cheng, Jian
Large language models (LLMs) have achieved remarkable success, but their rapidly growing scale imposes prohibitive costs in memory, computation, and energy. Post-training quantization (PTQ) is a promising solution for efficient deployment, yet achieving accurate W4A4 quantization remains an open challenge. While most existing methods are designed for INT4 formats, the emergence of MXFP4 -- a new FP4 format with various hardware support (NVIDIA, AMD, Intel)-- raises questions about the applicability of current techniques. In this work, we establish a comprehensive benchmark of PTQ methods under the MXFP4 format. Through systematic evaluation, we find that methods like GPTQ consistently deliver strong performance, whereas rotation-based approaches, which are almost used by all state-of-the-art approaches, suffer from severe incompatibility with MXFP4. We further provide the first in-depth analysis of this conflict, tracing its root to a fundamental mismatch between MXFP4's PoT (power-of-two) block scaling and the redistribution of outlier energy via global rotation. Building on this insight, we propose a simple yet effective block rotation strategy that adapts rotation-based methods to MXFP4, leading to substantial accuracy improvements across diverse LLMs. Our findings not only offer clear guidance for practitioners but also set a foundation for advancing PTQ research under emerging low-precision formats.
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Research Report > Promising Solution (0.68)
- Research Report > New Finding (0.66)
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
Chen, Mengzhao, Wu, Meng, Jin, Hui, Yuan, Zhihang, Liu, Jing, Zhang, Chaoyi, Li, Yunshui, Huang, Jie, Ma, Jin, Xue, Zeyue, Liu, Zhiheng, Bin, Xingyan, Luo, Ping
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.
Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization
Egiazarian, Vage, Castro, Roberto L., Kuznedelev, Denis, Panferov, Andrei, Kurtic, Eldar, Pandit, Shubhra, Marques, Alexandre, Kurtz, Mark, Ashkboos, Saleh, Hoefler, Torsten, Alistarh, Dan
The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it can near the accuracy that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.
- Europe > Austria > Vienna (0.14)
- Europe > Switzerland > Zürich > Zürich (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving
Lee, Jungi, Park, Junyong, Cha, Soohyun, Cho, Jaehoon, Sim, Jaewoong
Reduced-precision data formats are crucial for cost-effective serving of large language models (LLMs). While numerous reduced-precision formats have been introduced thus far, they often require intrusive modifications to the software frameworks or are rather unconventional for widespread adoption across hardware vendors. In this paper, we instead focus on recent industry-driven variants of block floating-point (BFP) formats and conduct a comprehensive analysis to push their limits for efficient LLM serving. Our analysis shows that existing ultra low-bit BFP variants struggle to provide reasonable language model performance due to outlier values in blocks. To address the outliers with BFPs, we propose MX+, a cost-effective and non-intrusive extension designed for seamless integration into the microscaling (MX) formats. MX+ builds on the key insight that the outlier does not need to use its exponent field in the element data type, which allows us to repurpose the exponent field as an extended mantissa to increase the precision of the outlier element. Our evaluation shows that MX+ achieves significantly higher model performance compared to the 4-bit MX format (MXFP4) with negligible storage overhead and slowdown, thus offering a compelling alternative to MXFP4 or MXFP6 for efficient LLM inference.
- Asia > South Korea > Seoul > Seoul (0.78)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Middle East > Jordan (0.04)
Pretraining Large Language Models with NVFP4
NVIDIA, null, Abecassis, Felix, Agrusa, Anjulie, Ahn, Dong, Alben, Jonah, Alborghetti, Stefania, Andersch, Michael, Arayandi, Sivakumar, Bjorlin, Alexis, Blakeman, Aaron, Briones, Evan, Buck, Ian, Catanzaro, Bryan, Choi, Jinhang, Chrzanowski, Mike, Chung, Eric, Cui, Victor, Dai, Steve, Rouhani, Bita Darvish, del Mundo, Carlo, Donia, Deena, Eryilmaz, Burc, Estela, Henry, Goel, Abhinav, Goncharov, Oleg, Guvvala, Yugi, Hesse, Robert, Hewett, Russell, Hum, Herbert, Kapasi, Ujval, Khailany, Brucek, Khona, Mikail, Knight, Nick, Kondratenko, Alex, Krashinsky, Ronny, Lanir, Ben, Layton, Simon, Lightstone, Michael, Lo, Daniel, Micikevicius, Paulius, Mishra, Asit, Moon, Tim, Narayanan, Deepak, Ni, Chao, Paithankar, Abhijit, Pasumarthi, Satish, Patel, Ankit, Patwary, Mostofa, Poojary, Ashwin, Prasad, Gargi, Priyadarshi, Sweta, Qin, Yigong, Ren, Xiaowei, Rybakov, Oleg, Sakr, Charbel, Satheesh, Sanjeev, Sergienko, Stas, Shamis, Pasha, Shankar, Kirthi, Sharma, Nishant, Shoeybi, Mohammad, Siu, Michael, Smelyanskiy, Misha, Stosic, Darko, Stosic, Dusan, Su, Bor-Yiing, Sun, Frank, Tajbakhsh, Nima, Thomas, Shelby, Tredak, Przemek, Tsykunov, Evgeny, Vaithilingam, Gandhi, Vavre, Aditya, Venkatesan, Rangharajan, Waleffe, Roger, Wan, Qiyu, Wang, Hexin, Wang, Mengdi, Wei, Lizzie, Wu, Hao, Wu, Evan, Wyss, Keith, Xu, Ning, Xue, Jinze, Yang, Charlene, Zhai, Yujia, Zhang, Ruoxi, Zhu, Jingyang, Zhu, Zhongbo
Large Language Models (LLMs) today are powerful problem solvers across many domains, and they continue to get stronger as they scale in model size, training set size, and training set quality, as shown by extensive research and experimentation across the industry. Training a frontier model today requires on the order of tens to hundreds of yottaflops, which is a massive investment of time, compute, and energy. Improving pretraining efficiency is therefore essential to enable the next generation of even more capable LLMs. While 8-bit floating point (FP8) training is now widely adopted, transitioning to even narrower precision, such as 4-bit floating point (FP4), could unlock additional improvements in computational speed and resource utilization. However, quantization at this level poses challenges to training stability, convergence, and implementation, notably for large-scale models trained on long token horizons. In this study, we introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format. Our method integrates Random Hadamard transforms (RHT) to bound block-level outliers, employs a two-dimensional quantization scheme for consistent representations across both the forward and backward passes, utilizes stochastic rounding for unbiased gradient estimation, and incorporates selective high-precision layers. We validate our approach by training a 12-billion-parameter model on 10 trillion tokens -- the longest publicly documented training run in 4-bit precision to date. Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline. These findings highlight that NVFP4, when combined with our training approach, represents a major step forward in narrow-precision LLM training algorithms.
- North America > United States (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Training LLMs with MXFP4
Tseng, Albert, Yu, Tao, Park, Youngsuk
Low precision (LP) datatypes such as MXFP4 can accelerate matrix multiplications (GEMMs) and reduce training costs. However, directly using MXFP4 instead of BF16 during training significantly degrades model quality. In this work, we present the first near-lossless training recipe that uses MXFP4 GEMMs, which are $2\times$ faster than FP8 on supported hardware. Our key insight is to compute unbiased gradient estimates with stochastic rounding (SR), resulting in more accurate model updates. However, directly applying SR to MXFP4 can result in high variance from block-level outliers, harming convergence. To overcome this, we use the random Hadamard tranform to theoretically bound the variance of SR. We train GPT models up to 6.7B parameters and find that our method induces minimal degradation over mixed-precision BF16 training. Our recipe computes $>1/2$ the training FLOPs in MXFP4, enabling an estimated speedup of $>1.3\times$ over FP8 and $>1.7\times$ over BF16 during backpropagation.
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Thailand (0.04)
Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models
Lo, Yun-Chen, Wei, Gu-Yeon, Brooks, David
As cutting-edge large language models (LLMs) continue to transform various industries, their fast-growing model size and sequence length have led to memory traffic and capacity challenges. Recently, AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm have proposed a Microscaling standard (Mx), which augments block floating-point with microexponents to achieve promising perplexity-to-footprint trade-offs. However, the Microscaling suffers from significant perplexity degradation on modern LLMs with less than six bits. This paper profiles modern LLMs and identifies three main challenges of low-bit Microscaling format, i.e., inaccurate tracking of outliers, vacant quantization levels, and wasted binary code. In response, Nanoscaling (NxFP) proposes three techniques, i.e., NanoMantissa, Adaptive Microexponent, and Code Recycling to enable better accuracy and smaller memory footprint than state-of-the-art MxFP. Experimental results on direct-cast inference across various modern LLMs demonstrate that our proposed methods outperform state-of-the-art MxFP by up to 0.64 in perplexity and by up to 30% in accuracy on MMLU benchmarks. Furthermore, NxFP reduces memory footprint by up to 16% while achieving comparable perplexity as MxFP.
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Middle East > Jordan (0.04)