AITopics | decay

Collaborating Authors

decay

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Anytime Training with Schedule-Free Spectral Optimization

Apte, Anuj, Deshpande, Pranav, Kumar, Niraj, Chakrabarti, Shouvanik, Kim, Junhyung Lyle

arXiv.org Machine LearningMay-25-2026

Standard neural network training relies on learning-rate schedules tied to a fixed horizon, leading to strong path dependence and costly re-tuning as data availability changes. Schedule-Free (SF) methods address this by removing explicit schedules, yet SF-AdamW, the current state-of-the-art anytime optimizer, consistently underperforms well-tuned AdamW baselines. We propose SF-NorMuon, a schedule-free spectral optimizer that closes this gap: with a single hyperparameter configuration, SF-NorMuon matches or exceeds tuned AdamW on 125M and 772M parameter language models across $1$--$8\times$ Chinchilla horizons. On the theoretical side, we prove a stationarity guarantee for schedule-free spectral dynamics and identify weight decay at the fast iterate as essential for long-horizon stability. SF-NorMuon enables practitioners to obtain high-quality checkpoints at any point during training without committing to a horizon in advance. By closing the performance gap with tuned baselines, SF-NorMuon makes horizon-free optimization more practical, taking a step towards truly open-ended, continual learning.

decay, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2605.23061

Country: North America > United States (0.92)

Genre: Research Report (0.52)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Three Costs of Amortizing Gaussian Process Inference with Neural Processes

Young, Robin

arXiv.org Machine LearningMay-22-2026

Neural processes amortize Gaussian process inference, replacing the exact $O(n^3)$ posterior with a learned $O(n)$ map from context sets to predictive distributions. For a class of latent neural processes, we bound the Kullback--Leibler (KL) divergence between the GP and LNP predictives, decomposing it into three interpretable sources, namely label contamination as the neural process uses label values to estimate a quantity that is label-independent in the exact GP, an information bottleneck because the finite-dimensional representation cannot resolve the full context geometry, and amortization error from a single encoder network shared across all contexts. The bottleneck truncation term decays in the representation dimension $d$ as $O(e^{-cd^{2/d_x}})$ for squared-exponential kernels on $\mathbb{R}^{d_x}$ where $c > 0$ is a kernel-dependent constant and as $O(d^{-2ν/d_x})$ for Matérn-$ν$ kernels, directly linking architecture sizing to kernel smoothness and input dimension. The label contamination term is $O(1)$ in general, with only the observation-noise component decaying as $O(1/n)$, identifying a persistent cost of routing uncertainty estimation through a label-dependent representation. These results characterize the costs of amortization within the analyzed class and yield architectural recommendations to predict variance from context locations alone in the GP-amortization regime, and replace mean aggregation with second-order pooling to close the dominant amortization gap.

artificial intelligence, machine learning, variance, (20 more...)

arXiv.org Machine Learning

2605.21798

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (1.00)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Modeling & Simulation (0.70)

Add feedback

Supplementary materials for Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing Anonymous Author(s) Affiliation Address email AAdditional graphs from outlier analysis1

Neural Information Processing SystemsApr-30-2026, 05:24:42 GMT

Figure 1: A summary of several outlier statistics recorded from ImageNet validation set on ViT. We use zero-based indexing for dimensions. BERTRecall from Figure 1 that all the outliers are only present in hidden dimensions #123, #180,4 #225, #308, #381, #526, #720 (with the majority of them in #180, #720). In Figures 9 and 10 we show more6 examples of the discovered self-attention patterns for attention heads #3 and #12 ( hidden dim #1807 and #720, respectively). We also show self-attention patterns in attention heads and layers which are8 not associated with the outliers in Figures 11 and 12, respectively.9

artificial intelligence, attention layer, machine learning, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Supplementary material to Generalization Error Rates in Kernel Ridge Regression The Crossover from the Noiseless to Noisy Regime of the decays

Neural Information Processing SystemsApr-25-2026, 22:47:04 GMT

A.1 Equations for Gaussian design In this Appendix we discuss the derivation of eqs. Exact asymptotic formulas for the excess prediction error of least-squares and ridge regression are a classic result in high-dimensional statistics, and have been derived in many different works [23, 32, 52, 53]. In this manuscript, we follow the presentation given in [25], which is particularly adapted to our derivation and has the advantage to hold rigorously at large but finite number of samples nand features p. We start by reviewing the formulas in [25]. Note that the risk considered in eq.

artificial intelligence, machine learning, regime, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

543bec10c8325987595fcdc492a525f4-Paper.pdf

Neural Information Processing SystemsApr-25-2026, 22:47:00 GMT

artificial intelligence, machine learning, regime, (16 more...)

Neural Information Processing Systems

Country:

Europe (0.46)
North America > United States (0.28)

Genre: Research Report (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

How Data Augmentation affects Optimization for Linear Regression

Neural Information Processing SystemsApr-25-2026, 15:44:40 GMT

Though data augmentation has rapidly emerged as a key tool for optimization in modern machine learning, a clear picture of how augmentation schedules affect optimization and interact with optimization hyperparameters such as learning rate is nascent.

artificial intelligence, augmentation, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States (0.67)

Genre: Research Report (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.50)

Add feedback

Self-Adaptable Point Processes with Nonparametric Time Decays

Neural Information Processing SystemsApr-25-2026, 03:46:42 GMT

Many applications involve multi-type event data. Understanding the complex influences of the events on each other is critical to discover useful knowledge and to predict future events and their types. Existing methods either ignore or partially account for these influences. Recent works use recurrent neural networks to model the event rate. While being highly expressive, they couple all the temporal dependencies in a black-box and can hardly extract meaningful knowledge. More important, most methods assume an exponential time decay of the influence strength, which is over-simplified and can miss many important strength varying patterns.

artificial intelligence, event type, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States > Utah (0.14)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Communications (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

0d5bd023a3ee11c7abca5b42a93c4866-Supplemental.pdf

Neural Information Processing SystemsApr-24-2026, 16:30:18 GMT

To compute the discrepancy term dst, we add a per-location domain classifier h tw ˆ . It W consti semantic tutes map corresponds to the either source or target domain. On the other hand, hˆ predicts the Bird-Eye View binary segmentation map. In figure 9.1 we show the Lift-Splat Adapt diagram. Our training strategy requires little modification to the original architecture, e.g.

agent, artificial intelligence, machine learning, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.32)

Add feedback

Filters

Collaborating Authors

decay

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Anytime Training with Schedule-Free Spectral Optimization

Three Costs of Amortizing Gaussian Process Inference with Neural Processes

Supplementary materials for Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing Anonymous Author(s) Affiliation Address email AAdditional graphs from outlier analysis1

Supplementary material to Generalization Error Rates in Kernel Ridge Regression The Crossover from the Noiseless to Noisy Regime of the decays

543bec10c8325987595fcdc492a525f4-Paper.pdf

How Data Augmentation affects Optimization for Linear Regression

Self-Adaptable Point Processes with Nonparametric Time Decays

0f580c1ace3b857a390575ca42de7938-Paper-Conference.pdf

0d5bd023a3ee11c7abca5b42a93c4866-Supplemental.pdf

03474669b759f6d38cdca6fb4eb905f4-Paper-Conference.pdf