AITopics

2605.17177

Country: North America > United States > New York (0.27)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Kim, Beomchang, Roy, Subhrajyoty, Das, Priyam

BOOOM: Loss-Function-Agnostic Black-Box Optimization over Orthonormal Manifolds for Machine Learning and Statistical Inference

arXiv.org Machine LearningMay-7-2026

Optimization over the Stiefel manifold $\mathrm{St}(p,d)$, the set of $p \times d$ column-orthonormal matrices, is fundamental in statistics, machine learning, and scientific computing, yet remains challenging in the presence of non-convex, non-smooth, or black-box objectives. Existing methods largely rely on either convex relaxations or gradient-based Riemannian optimization, limiting applicability in derivative-free and highly multimodal settings. We propose \textsc{BOOOM} (Black-box Optimization Over Orthonormal Manifolds), a general-purpose framework for loss-function-agnostic optimization on $\mathrm{St}(p,d)$. The key idea is a global Givens rotation-based parametrization that maps the manifold to an unconstrained Euclidean angle space while preserving feasibility exactly. Building on this representation, BOOOM employs a structured, parallelizable, derivative-free search based on Recursive Modified Pattern Search, enabling systematic exploration through plane-wise rotations without requiring gradient information and facilitating escape from poor local optima. We establish a unified theoretical framework showing equivalence between angle-space and manifold optimization, transfer of stationarity, and global convergence in probability under mild conditions. Empirical results across diverse problems, including heterogeneous quadratic optimization, low-rank and sparse matrix decomposition, independent component analysis, and orthogonal joint diagonalization, among other widely studied settings, demonstrate strong performance relative to state-of-the-art methods, particularly in non-smooth and highly multimodal regimes. We further illustrate its practical utility through a novel supervised PCA formulation applied to metabolomics data in colorectal cancer.

artificial intelligence, machine learning, optimization problem, (14 more...)

2605.04087

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Oncology > Colorectal Cancer (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Shigida, Boris, Hanin, Boris, Gromov, Andrey

Learning Rate Transfer in Normalized Transformers

arXiv.org Machine LearningMay-4-2026

The Normalized Transformer, or nGPT (arXiv:2410.01131) achieves impressive training speedups and does not require weight decay or learning rate warmup. However, despite having hyperparameters that explicitly scale with model size, we observe that nGPT does not exhibit learning rate transfer across model dimension and token horizon. To rectify this, we combine numerical experiments with a principled use of alignment exponents (arXiv:2407.05872) to revisit and modify the $μ$P approach to hyperparameter transfer (arXiv:2011.14522). The result is a novel nGPT parameterization we call $ν$GPT. Through extensive empirical validation, we find $ν$GPT exhibits learning rate transfer across width, depth, and token horizon.

artificial intelligence, initialization, machine learning, (14 more...)

2604.27077

Genre: Research Report (0.41)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Raviola, Matteo, Peherstorfer, Benjamin

A Dirac-Frenkel-Onsager principle: Instantaneous residual minimization with gauge momentum for nonlinear parametrizations of PDE solutions

arXiv.org Machine LearningMay-4-2026

Dirac-Frenkel instantaneous residual minimization evolves nonlinear parametrizations of PDE solutions in time, but ill-conditioning can render the parameter dynamics non-unique. We interpret this non-uniqueness as a gauge freedom: nullspace directions that leave the time derivative unchanged can be used to select better-conditioned parameter velocities. Building on Onsager's minimum-dissipation principle, we introduce a history variable -- interpretable as momentum -- and inject it only along the nullspace directions. The resulting Dirac-Frenkel-Onsager dynamics preserve instantaneous residual minimization, in contrast to standard regularization that can introduce bias, while promoting temporally smooth parameter evolutions. Examples demonstrate that the approach leads to increased robustness in singular and near-singular regimes.

artificial intelligence, machine learning, parameter velocity, (18 more...)

2605.00284

Country: North America > United States (0.46)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsApr-30-2026, 19:56:40 GMT

00ac8ed3b4327bdd4ebbebcb2ba10a00-Paper.pdf

data mining, machine learning, neural information processing system, (16 more...)

Country:

North America (0.28)
Europe > Germany (0.28)

Industry:

Health & Medicine (0.46)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.94)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Communications (0.70)
(3 more...)

Neural Information Processing SystemsApr-29-2026, 10:56:40 GMT

Beyond Average Return in Markov Decision Processes

What are the functionals of the reward that can be computed and optimized exactly in Markov Decision Processes? In the finite-horizon, undiscounted setting, Dynamic Programming (DP) can only handle these operations efficiently for certain classes of statistics. We summarize the characterization of these classes for policy evaluation, and give a new answer for the planning problem. Interestingly, we prove that only generalized means can be optimized exactly, even in the more general framework of Distributional Reinforcement Learning (DistRL). DistRL permits, however, to evaluate other functionals approximately. We provide error bounds on the resulting estimators, and discuss the potential of this approach as well as its limitations. These results contribute to advancing the theory of Markov Decision Processes by examining overall characteristics of the return, and particularly risk-conscious strategies.

dynamic programming, machine learning, reinforcement learning, (14 more...)

Country: Europe (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.90)

Neural Information Processing SystemsApr-24-2026, 12:11:10 GMT

Understanding End-to-End Model-Based Reinforcement Learning Methods as Implicit Parameterization

Estimating the per-state expected cumulative rewards is a critical aspect of reinforcement learning approaches, however the experience is obtained, but standard deep neural-network function-approximation methods are often inefficient in this setting. An alternative approach, exemplified by value iteration networks, is to learn transition and reward models of a latent Markov decision process whose value predictions fit the data. This approach has been shown empirically to converge faster to a more robust solution in many cases, but there has been little theoretical study of this phenomenon. In this paper, we explore such implicit representations of value functions via theory and focused experimentation. We prove that, for a linear parametrization, gradient descent converges to global optima despite nonlinearity and non-convexity introduced by the implicit representation. Furthermore, we derive convergence rates for both cases which allow us to identify conditions under which stochastic gradient descent (SGD) with this implicit representation converges substantially faster than its explicit counterpart. Finally, we provide empirical results in some simple domains that illustrate the theoretical findings.

machine learning, parameterization, reinforcement learning, (14 more...)

Country:

Europe (0.46)
North America > United States (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Alexandr, Yulia, Duan, Hao, Montúfar, Guido

Algebraic Invariants of Lightning Self-Attention

arXiv.org Machine LearningApr-21-2026

We study the polynomial coefficients of lightning self-attention as coordinates of an algebraic variety. We identify linear and nonlinear families of algebraic invariants, including Chow-type, low-rank, Veronese-type, and Sylvester resultant-based constraints.

artificial intelligence, coefficient, machine learning, (18 more...)

2604.15632

Country: Europe > Germany > Saxony > Leipzig (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)

Zekri, Oussama, Uscidda, Théo, Boullé, Nicolas, Korba, Anna

Generalized Discrete Diffusion from Snapshots

arXiv.org Machine LearningMar-24-2026

We introduce Generalized Discrete Diffusion from Snapshots (GDDS), a unified framework for discrete diffusion modeling that supports arbitrary noising processes over large discrete state spaces. Our formulation encompasses all existing discrete diffusion approaches, while allowing significantly greater flexibility in the choice of corruption dynamics. The forward noising process relies on uniformization and enables fast arbitrary corruption. For the reverse process, we derive a simple evidence lower bound (ELBO) based on snapshot latents, instead of the entire noising path, that allows efficient training of standard generative modeling architectures with clear probabilistic interpretation. Our experiments on large-vocabulary discrete generation tasks suggest that the proposed framework outperforms existing discrete diffusion methods in terms of training efficiency and generation quality, and beats autoregressive models for the first time at this scale. We provide the code along with a blog post on the project page : \href{https://oussamazekri.fr/gdds}{https://oussamazekri.fr/gdds}.

large language model, machine learning, natural language, (20 more...)

2603.21342

Country:

Asia > Middle East > Saudi Arabia (0.04)
Asia > Middle East > Syria (0.04)
North America > United States > Illinois (0.04)
(11 more...)

Genre: Research Report (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Law (0.92)
(3 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.87)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Neural Information Processing SystemsMar-21-2026, 07:25:28 GMT

Group and Shuffle: Efficient Structured Orthogonal Parametrization

The increasing size of neural networks has led to a growing demand for methods of efficient finetuning. Recently, an orthogonal finetuning paradigm was introduced that uses orthogonal matrices for adapting the weights of a pretrained model. In this paper, we introduce a new class of structured matrices, which unifies and generalizes structured classes from previous works. We examine properties of this class and build a structured orthogonal parametrization upon it. We then use this parametrization to modify the orthogonal finetuning framework, improving parameter efficiency.

artificial intelligence, machine learning, proceedings, (6 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.32)