mantissa
- North America > Canada > Ontario > Toronto (0.14)
- North America > Canada > Quebec > Montreal (0.04)
Formal that "Floats" High: Formal Verification of Floating Point Arithmetic
Mohanty, Hansa, Viswambharan, Vaisakh Naduvodi, Gadde, Deepak Narayan
Formal verification of floating-point arithmetic remains challenging due to non-linear arithmetic behavior and the tight coupling between control and datapath logic. Existing approaches often rely on high-level C models for equivalence checking against Register Transfer Level (RTL) designs, but this introduces abstraction gaps, translation overhead, and limits scalability at the RTL level. To address these challenges, this paper presents a scalable methodology for verifying floating-point arithmetic using direct RTL-to-RTL model checking against a golden reference model. The approach adopts a divide-and conquer strategy that decomposes verification into modular stages, each captured by helper assertions and lemmas that collectively prove a main correctness theorem. Counterexample (CEX)-guided refinement is used to iteratively localize and resolve implementation defects, while targeted fault injection validates the robustness of the verification process against precision-critical datapath errors. To assess scalability and practicality, the methodology is extended with agentic AI-based formal property generation, integrating large language model (LLM)-driven automation with Human-in-the-Loop (HITL) refinement. Coverage analysis evaluates the effectiveness of the approach by comparing handwritten and AI-generated properties in both RTL-to-RTL model checking and standalone RTL verification settings. Results show that direct RTL-to-RTL model checking achieves higher coverage efficiency and requires fewer assertions than standalone verification, especially when combined with AI-generated properties refined through HITL guidance.
- Asia > India (0.04)
- Europe > Germany (0.04)
- Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- North America > Canada > Quebec > Montreal (0.04)
Empirical and computer-aided robustness analysis of long-step and accelerated methods in smooth convex optimization
Vernimmen, Pierre, Glineur, François
This work assesses both empirically and theoretically, using the performance estimation methodology, how robust different first-order optimization methods are when subject to relative inexactness in their gradient computations. Relative inexactness occurs, for example, when compressing the gradient using fewer bits of information, which happens when dealing with large-scale problems on GPUs. Three major families of methods are analyzed: constant step gradient descent, long-step methods, and accelerated methods. The latter two are first shown to be theoretically not robust to inexactness. Then, a semi-heuristic shortening factor is introduced to improve their theoretical guarantees. All methods are subsequently tested on a concrete inexact problem, with two different types of relative inexactness, and it is observed that both accelerated methods are much more robust than expected, and that the shortening factor significantly helps the long-step methods. In the end, all shortened methods appear to be promising, even in this inexact setting.
- Europe > Russia (0.04)
- Europe > Belgium > Wallonia > Walloon Brabant > Louvain-la-Neuve (0.04)
- Asia > Russia (0.04)
MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation
Natesh, Vikas, Kung, H. T., Kong, David
We offer a novel approach, MGS (Markov Greedy Sums), to improve the accuracy of low-bitwidth floating-point dot products in neural network computations. In conventional 32-bit floating-point summation, adding values with different exponents may lead to loss of precision in the mantissa of the smaller term, which is right-shifted to align with the larger term's exponent. Such shifting (a.k.a. 'swamping') is a significant source of numerical errors in accumulation when implementing low-bitwidth dot products (e.g., 8-bit floating point) as the mantissa has a small number of bits. We avoid most swamping errors by arranging the terms in dot product summation based on their exponents and summing the mantissas without overflowing the low-bitwidth accumulator. We design, analyze, and implement the algorithm to minimize 8-bit floating point error at inference time for several neural networks. In contrast to traditional sequential summation, our method has significantly lowered numerical errors, achieving classification accuracy on par with high-precision floating-point baselines for multiple image classification tasks. Our dMAC hardware units can reduce power consumption by up to 34.1\% relative to conventional MAC units.
- Europe > Austria > Vienna (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- (5 more...)
Scaling Laws for Floating Point Quantization Training
Sun, Xingwu, Li, Shuaipeng, Xie, Ruobing, Han, Weidong, Wu, Kan, Yang, Zhen, Li, Yixing, Wang, An, Li, Shuai, Xue, Jinbao, Cheng, Yu, Tao, Yangyu, Kang, Zhanhui, Xu, Chengzhong, Wang, Di, Jiang, Jie
Low-precision training is considered an effective strategy for reducing both training and downstream inference costs. Previous scaling laws for precision mainly focus on integer quantization, which pay less attention to the constituents in floating-point quantization and thus cannot well fit the LLM losses in this scenario. In contrast, while floating-point quantization training is more commonly implemented in production, the research on it has been relatively superficial. In this paper, we thoroughly explore the effects of floating-point quantization targets, exponent bits, mantissa bits, and the calculation granularity of the scaling factor in floating-point quantization training performance of LLM models. While presenting an accurate floating-point quantization unified scaling law, we also provide valuable suggestions for the community: (1) Exponent bits contribute slightly more to the model performance than mantissa bits. We provide the optimal exponent-mantissa bit ratio for different bit numbers, which is available for future reference by hardware manufacturers; (2) We discover the formation of the critical data size in low-precision LLM training. Too much training data exceeding the critical data size will inversely bring in degradation of LLM performance; (3) The optimal floating-point quantization precision is directly proportional to the computational power, but within a wide computational power range, we estimate that the best cost-performance precision lies between 4-8 bits.
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Macao (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Asia > China > Hong Kong (0.04)
NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks
Hao, Yongchang, Cao, Yanshuai, Mou, Lili
The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks. With NeuZip, we are able to achieve memory-efficient training and inference without sacrificing performance. Notably, we significantly reduce the memory footprint of training a Llama-3 8B model from 31GB to less than 16GB, while keeping the training dynamics fully unchanged. In inference, our method can reduce memory usage by more than half while maintaining near-lossless performance. Our code is publicly available.
- North America > Canada > Alberta (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks
Urs Köster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K. Bansal, William Constable, Oguz Elibol, Scott Gray, Stewart Hall, Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J. Pai, Naveen Rao
Deep neural networks are commonly developed and trained in 32-bit floating point format. Significant gains in performance and energy efficiency could be realized by training and inference in numerical formats optimized for deep learning. Despite advances in limited precision inference in recent years, training of neural networks in low bit-width remains a challenging problem. Here we present the Flexpoint data format, aiming at a complete replacement of 32-bit floating point format training and inference, designed to support modern deep network topologies without modifications. Flexpoint tensors have a shared exponent that is dynamically adjusted to minimize overflows and maximize available dynamic range. We validate Flexpoint by training AlexNet [1], a deep residual network [2, 3] and a generative adversarial network [4], using a simulator implemented with the neon deep learning framework. We demonstrate that 16-bit Flexpoint closely matches 32-bit floating point in training all three models, without any need for tuning of model hyperparameters. Our results suggest Flexpoint as a promising numerical format for future hardware for training and inference.