AITopics | validation loss

Collaborating Authors

validation loss

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Influence Functions for Edge Edits in Non-Convex Graph Neural Networks

Neural Information Processing SystemsJun-23-2026, 02:38:23 GMT

Understanding how individual edges influence the behavior of graph neural networks (GNNs) is essential for improving their interpretability and robustness. Graph influence functions have emerged as promising tools to efficiently estimate the effects of edge deletions without retraining. However, existing influence prediction methods rely on strict convexity assumptions, exclusively consider the influence of edge deletions while disregarding edge insertions, and fail to capture changes in message propagation caused by these modifications. In this work, we propose a proximal Bregman response function specifically tailored for GNNs, relaxing the convexity requirement and enabling accurate influence prediction for standard neural network architectures. Furthermore, our method explicitly accounts for message propagation effects and extends influence prediction to both edge deletions and insertions in a principled way. Experiments with real-world datasets demonstrate accurate influence predictions for different characteristics of GNNs. We further demonstrate that the influence function is versatile in applications such as graph rewiring and adversarial attacks.

artificial intelligence, influence function, machine learning, (19 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry: Information Technology > Security & Privacy (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Scaling Law with Learning Rate Annealing

Neural Information Processing SystemsJun-19-2026, 04:46:34 GMT

We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps: L(s) = L0 +A S α1 C S2, where L(s)is the validation loss at step s, S1 is the area under the LR curve, S2 is the LR annealing area, and L0, A, C, αare constant parameters.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country: North America > United States (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Workflow (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Scale-invariant attention

Neural Information Processing SystemsJun-15-2026, 02:16:47 GMT

One persistent challenge in LLM research is the development of attention mechanisms that are able to generalise from training on shorter contexts to inference on longer contexts. We propose two conditions that we expect all effective longcontext attention mechanisms to have: scale-invariant total attention, and scaleinvariant attention sparsity. Under a Gaussian assumption, we show that a simple position-dependent transformation of the attention logits is sufficient for these conditions to hold. Experimentally we find that the resulting scale-invariant attention scheme gives considerable benefits in terms of validation loss when zero-shot generalising from training on short contexts to validation on longer contexts, and is effective at long-context retrieval.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Diffusion Beats Autoregressive in Data-Constrained Settings

Neural Information Processing SystemsJun-14-2026, 19:01:52 GMT

Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings--where training involves repeated passes over limited data--and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We find new scaling laws for diffusion models and derive a closedform expression for the critical compute threshold at which diffusion begins to outperform AR. Finally, we explain why diffusion models excel in this regime: their randomized masking objective implicitly trains over a rich distribution of token orderings, acting as an implicit data augmentation that AR's fixed left-toright factorization lacks. Our results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm.

diffusion model, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Lau, Tim Tsz-Kit, Su, Weijie

arXiv.org Machine LearningMay-27-2026

A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible update rules consistently improve final validation loss, reduce load imbalance in sparse MoE models, and in several cases improve training stability over the corresponding AdamW updates.

large language model, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2605.18106

Country: North America > United States > Pennsylvania (0.27)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Anytime Training with Schedule-Free Spectral Optimization

Apte, Anuj, Deshpande, Pranav, Kumar, Niraj, Chakrabarti, Shouvanik, Kim, Junhyung Lyle

arXiv.org Machine LearningMay-25-2026

Standard neural network training relies on learning-rate schedules tied to a fixed horizon, leading to strong path dependence and costly re-tuning as data availability changes. Schedule-Free (SF) methods address this by removing explicit schedules, yet SF-AdamW, the current state-of-the-art anytime optimizer, consistently underperforms well-tuned AdamW baselines. We propose SF-NorMuon, a schedule-free spectral optimizer that closes this gap: with a single hyperparameter configuration, SF-NorMuon matches or exceeds tuned AdamW on 125M and 772M parameter language models across $1$--$8\times$ Chinchilla horizons. On the theoretical side, we prove a stationarity guarantee for schedule-free spectral dynamics and identify weight decay at the fast iterate as essential for long-horizon stability. SF-NorMuon enables practitioners to obtain high-quality checkpoints at any point during training without committing to a horizon in advance. By closing the performance gap with tuned baselines, SF-NorMuon makes horizon-free optimization more practical, taking a step towards truly open-ended, continual learning.

decay, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2605.23061

Country: North America > United States (0.92)

Genre: Research Report (0.52)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

159f7fe5b51ecd663b85337e8e28ce65-Supplemental-Conference.pdf

Neural Information Processing SystemsMay-1-2026, 01:42:34 GMT

artificial intelligence, machine learning, nre, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

3953630da28e5181cffca1278517e3cf-Supplemental.pdf

Neural Information Processing SystemsApr-25-2026, 12:06:15 GMT

algorithm, artificial intelligence, machine learning, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Forecasting Multivariate Time Series under Predictive Heterogeneity: A Validation-Driven Clustering Framework

Ma, Ziling, Oriona, Ángel López, Ombao, Hernando, Sun, Ying

arXiv.org Machine LearningApr-16-2026

We study adaptive pooling under predictive heterogeneity in high-dimensional multivariate time series forecasting, where global models improve statistical efficiency but may fail to capture heterogeneous predictive structure, while naive specialization can induce negative transfer. We formulate adaptive pooling as a statistical decision problem and propose a validation-driven framework that determines when and how specialization should be applied. Rather than grouping series based on representation similarity, we define partitions through out-of-sample predictive performance, thereby aligning data organization with predictive risk, defined as expected out-of-sample loss and approximated via validation error. Cluster assignments are iteratively updated using validation losses for both point (Huber) and probabilistic (pinball) forecasting, improving robustness to heavy-tailed errors and local anomalies. To ensure reliability, we introduce a leakage-free fallback mechanism that reverts to a global model whenever specialization fails to improve validation performance, providing a safeguard against performance degradation under a strict training-validation-test protocol. Experiments on large-scale traffic datasets demonstrate consistent improvements over strong baselines while avoiding degradation when heterogeneity is weak. Overall, the proposed framework provides a principled and practically reliable approach to adaptive pooling in high-dimensional forecasting problems.

data mining, forecasting, machine learning, (19 more...)

arXiv.org Machine Learning

2604.13748

Country:

North America > United States > California (0.04)
Asia > Middle East > Saudi Arabia (0.04)
Asia > India (0.04)

Genre: Research Report (0.50)

Industry: Energy (0.93)

Technology:

Information Technology > Modeling & Simulation (0.89)
Information Technology > Data Science > Data Mining (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
(2 more...)

Add feedback

Filters

Collaborating Authors

validation loss

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Influence Functions for Edge Edits in Non-Convex Graph Neural Networks

Scaling Law with Learning Rate Annealing

Scale-invariant attention

Diffusion Beats Autoregressive in Data-Constrained Settings

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Anytime Training with Schedule-Free Spectral Optimization

159f7fe5b51ecd663b85337e8e28ce65-Supplemental-Conference.pdf

3953630da28e5181cffca1278517e3cf-Supplemental.pdf

2130eb640e0a272898a51da41363542d-Supplemental.pdf

Forecasting Multivariate Time Series under Predictive Heterogeneity: A Validation-Driven Clustering Framework