overflow
NeuralSteiner: Learning Steiner Tree for Overflow-avoiding Global Routing in Chip Design
Global routing plays a critical role in modern chip design. The routing paths generated by global routers often form a rectilinear Steiner tree (RST). Recent advances from the machine learning community have shown the power of learning-based route generation; however, the yielded routing paths by the existing approaches often suffer from considerable overflow, thus greatly hindering their application in practice.We propose NeuralSteiner, an accurate approach to overflow-avoiding global routing in chip design. The key idea of NeuralSteiner approach is to learn Steiner trees: we first predict the locations of highly likely Steiner points by adopting a neural network considering full-net spatial and overflow information, then select appropriate points by running a graph-based post-processing algorithm, and finally connect these points with the input pins to yield overflow-avoiding RSTs. NeuralSteiner offers two advantages over previous learning-based models. First, by using the learning scheme, NeuralSteiner ensures the connectivity of generated routes while significantly reducing congestion. Second, NeuralSteiner can effectively scale to large nets and transfer to unseen chip designs without any modifications or fine-tuning. Extensive experiments over public large-scale benchmarks reveal that, compared with the state-of-the-art deep generative methods, NeuralSteiner achieves up to a 99.8\% reduction in overflow while speeding up the generation and maintaining a slight wirelength loss within only 1.8\%.
- Asia > China > Beijing > Beijing (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Asia > China > Henan Province > Zhengzhou (0.04)
- (2 more...)
- Asia > China > Beijing > Beijing (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Asia > China > Henan Province > Zhengzhou (0.04)
- (2 more...)
Rainbow Padding: Mitigating Early Termination in Instruction-Tuned Diffusion LLMs
Kim, Bumjun, Jeon, Dongjae, Kim, Dueun, Jeung, Wonje, No, Albert
Diffusion large language models (dLLMs) have emerged as a promising alternative to autoregressive models, offering flexible generation orders and strong performance on complex reasoning tasks. However, instruction-tuned dLLMs exhibit a critical vulnerability we term \texttt{
Power Transform Revisited: Numerically Stable, and Federated
Power transforms are popular parametric techniques for making data more Gaussian-like, and are widely used as preprocessing steps in statistical analysis and machine learning. However, we find that direct implementations of power transforms suffer from severe numerical instabilities, which can lead to incorrect results or even crashes. In this paper, we provide a comprehensive analysis of the sources of these instabilities and propose effective remedies. We further extend power transforms to the federated learning setting, addressing both numerical and distributional challenges that arise in this context. Experiments on real-world datasets demonstrate that our methods are both effective and robust, substantially improving stability compared to existing approaches.
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Texas > Dallas County > Dallas (0.04)
- North America > United States > New York (0.04)
- (3 more...)
Overflow Prevention Enhances Long-Context Recurrent LLMs
Ben-Kish, Assaf, Zimerman, Itamar, Mirza, M. Jehanzeb, Wolf, Lior, Glass, James, Karlinsky, Leonid, Giryes, Raja
A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency. We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance. Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized. Specifically, we demonstrate that a chunk-based inference procedure, which identifies and processes only the most relevant portion of the input can mitigate recurrent memory failures and be effective for many long-context tasks: On LongBench, our method improves the overall performance of Falcon3-Mamba-Inst-7B by 14%, Falcon-Mamba-Inst-7B by 28%, RecurrentGemma-IT-9B by 50%, and RWKV6-Finch-7B by 51%. Surprisingly, this simple approach also leads to state-of-the-art results in the challenging LongBench v2 benchmark, showing competitive performance with equivalent size Transformers. Furthermore, our findings raise questions about whether recurrent models genuinely exploit long-range dependencies, as our single-chunk strategy delivers stronger performance - even in tasks that presumably require cross-context relations.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.70)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.30)
MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation
Natesh, Vikas, Kung, H. T., Kong, David
We offer a novel approach, MGS (Markov Greedy Sums), to improve the accuracy of low-bitwidth floating-point dot products in neural network computations. In conventional 32-bit floating-point summation, adding values with different exponents may lead to loss of precision in the mantissa of the smaller term, which is right-shifted to align with the larger term's exponent. Such shifting (a.k.a. 'swamping') is a significant source of numerical errors in accumulation when implementing low-bitwidth dot products (e.g., 8-bit floating point) as the mantissa has a small number of bits. We avoid most swamping errors by arranging the terms in dot product summation based on their exponents and summing the mantissas without overflowing the low-bitwidth accumulator. We design, analyze, and implement the algorithm to minimize 8-bit floating point error at inference time for several neural networks. In contrast to traditional sequential summation, our method has significantly lowered numerical errors, achieving classification accuracy on par with high-precision floating-point baselines for multiple image classification tasks. Our dMAC hardware units can reduce power consumption by up to 34.1\% relative to conventional MAC units.
- Europe > Austria > Vienna (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- (5 more...)