AITopics | logit

Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-error threshold. Then during backpropagation, the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint of gradients across classes and introduces a systematic drift in the parameter update of the classifier layer. We prove that this drift forms a positive feedback loop with the feature, causing the global classifier mean and the global feature mean to grow exponentially. We call this mechanism Numerical Feature Inflation (NFI). This mechanism explains the rapid norm growth before a Slingshot spike, the subsequent reappearance of gradients, and the resulting loss spike. We further show that NFI is not equivalent to an observed loss spike: in more practical tasks, partial absorption may not produce visible spikes, but it can still break the zero-sum constraint and drive rapid growth of parameter norms. Our results reinterpret Slingshot as a numerical dynamic of finite-precision training, and provide a testable explanation for abnormal parameter growth and logit divergence in late-stage training.

artificial intelligence, machine learning, spike, (17 more...)

arXiv.org Machine Learning

2605.06152

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking

Kiyohara, Haruka, Curmei, Mihaela, Evnine, Ariel, Kalyanaraman, Shankar, Nir, Israel, Pop, Ana-Roxana, Razin, Nitzan, Dean, Sarah, Joachims, Thorsten, Weinsberg, Udi

arXiv.org Machine LearningMay-27-2026

Large-scale search, recommendation, and retrieval-augmented generation (RAG) systems typically employ a two-stage architecture: an early-stage ranker (ESR) generates a candidate set, which is subsequently re-ranked by a late-stage ranker (LSR). While there are many reinforcement learning (RL) methods for training the LSR, end-to-end training of the ESR has proven challenging. In particular, naive application of "vanilla" policy gradient (V-PG) is not scalable for candidate-set sizes relevant for practical use due to exploding variance. This issue arises because V-PG propagates the gradient to the joint probability of the candidate sets, ignoring the contribution of each specific item in the candidate set to the reward. To mitigate this issue, we propose a novel "credit-assigned" policy gradient (CA-PG), which computes gradients with respect to the probability that the target item is chosen in any candidate set, i.e. marginalizing over all candidate sets that contain it. Our theoretical analysis reveals that CA-PG significantly reduces the variance of V-PG by marginalizing over the specific composition of the candidate set, while preserving the ability to learn the correct ranking of items under a reasonably aligned LSR policy. Experiments on both synthetic and real-world data demonstrate that CA-PG improves the convergence speed and training stability for ESRs utilizing the canonical Plackett-Luce model, especially when the candidate-set size is large.

large language model, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2605.26385

Country: Asia (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.66)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

InfoSFT: Learn More and Forget Less with Information-Aware Token Weighting

Sabbaghi, Mahdi, Pappas, George, Javanmard, Adel, Hassani, Hamed

arXiv.org Machine LearningMay-15-2026

Supervised fine-tuning (SFT) provides the standard approach for teaching LLMs new behaviors from offline expert demonstrations. However, standard SFT uniformly fits all samples -- including those with low likelihood under the base model -- which can disproportionately drive training updates toward overfitting specific samples rather than learning the target behavior. Moreover, adapting to these unlikely samples induces substantial policy shifts that degrade prior capabilities. Existing methods mitigate this by filtering, regenerating, or down-weighting low-likelihood data. In doing so, they often suppress precisely the novel behaviors the base model has yet to learn. We propose InfoSFT, a principled weighting scheme for the SFT objective that concentrates learning signals on maximally informative, medium-confidence tokens -- those neither overly familiar to the base model nor too unlikely to cause instability. Requiring only a one-line modification to the standard token-wise loss, InfoSFT demonstrably improves generalization over vanilla SFT and likelihood-weighted baselines across math, code, and chain-of-thought tasks with diverse model families, while better preserving pre-existing capabilities.

large language model, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

2605.14967

Country: North America > United States > California (0.28)

Genre: Research Report (0.82)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

e271e30de7a2e462ca1f85cefa816380-Paper-Conference.pdf

Neural Information Processing SystemsMay-1-2026, 04:57:21 GMT

artificial intelligence, calibration, machine learning, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

b956d55b4d15eb3f024c67f8415822e4-Paper-Conference.pdf

Neural Information Processing SystemsApr-30-2026, 01:35:41 GMT

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

Asia (0.93)
North America > United States > Minnesota (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

93712c59f6a81bd92040facf04c8b308-Paper-Conference.pdf

Neural Information Processing SystemsApr-29-2026, 01:32:34 GMT

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Add feedback

477bdb55b231264bb53a7942fd84254d-Paper.pdf

Neural Information Processing SystemsApr-25-2026, 16:55:31 GMT

artificial intelligence, machine learning, regression, (19 more...)

Neural Information Processing Systems

Country: Asia (0.14)

Genre: Research Report > New Finding (0.47)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(2 more...)

Add feedback

Imitating Deep Learning Dynamics via Locally Elastic Stochastic Differential Equations

Neural Information Processing SystemsApr-25-2026, 09:39:50 GMT

Understanding the training dynamics of deep learning models is perhaps a necessary step toward demystifying the effectiveness of these models. In particular, how do data from different classes gradually become separable in their feature spaces when training neural networks using stochastic gradient descent?

artificial intelligence, machine learning, neural network, (13 more...)

Neural Information Processing Systems

Country: North America > United States (0.14)

Genre: Research Report (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CNN+ RPNClassificationRegressionMaskInput ImageLong-Tailed Object DetectionPost-Processing CalibrationBulldozerSchool busTruckBulldozerSchool busTruckNORCAL

Neural Information Processing SystemsApr-24-2026, 19:55:18 GMT

Vanilla models for object detection and instance segmentation suffer from the heavy bias toward detecting frequent objects in the long-tailed setting. Existing methods address this issue mostly during training, e.g., by re-sampling or reweighting. In this paper, we investigate a largely overlooked approach -- postprocessing calibration of confidence scores. We propose NORCAL, Normalized Calibration for long-tailed object detection and instance segmentation, a simple and straightforward recipe that reweighs the predicted scores of each class by its training sample size. We show that separately handling the background class and normalizing the scores over classes for each proposal are keys to achieving superior performance. On the LVIS dataset, NORCAL can effectively improve nearly all the baseline models not only on rare classes but also on common and frequent classes. Finally, we conduct extensive analysis and ablation studies to offer insights into various modeling choices and mechanisms of our approach. Our code is publicly available at https://github.com/tydpan/NorCal.

artificial intelligence, machine learning, segmentation, (19 more...)

Neural Information Processing Systems

Technology: