AITopics

2606.28242

Country: North America > United States (0.14)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.64)

Neural Information Processing SystemsJun-23-2026, 04:47:54 GMT

Optimal Spectral Transitions in High-Dimensional Multi-Index Models

We consider the problem of how many samples from a Gaussian multi-index model are required to weakly reconstruct the relevant index subspace. Despite its increasing popularity as a testbed for investigating the computational complexity of neural networks, results beyond the single-index setting remain elusive. In this work, we introduce spectral algorithms based on the linearization of a message passing scheme tailored to this problem. Our main contribution is to show that the proposed methods achieve the optimal reconstruction threshold. Leveraging a high-dimensional characterization of the algorithms, we show that above the critical threshold the leading eigenvector correlates with the relevant index subspace, a phenomenon reminiscent of the Baik-Ben Arous-Peche (BBP) transition in spiked models arising in random matrix theory.

artificial intelligence, eigenvalue, machine learning, (15 more...)

Country:

North America > United States (0.28)
Europe > France (0.28)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.24)

Genre: Research Report > Experimental Study (1.00)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.66)

Neural Information Processing SystemsJun-20-2026, 04:02:43 GMT

Bayes optimal learning of attention-indexed models

We introduce the attention-indexed model (AIM), a theoretical framework for analyzing learning in deep attention layers. Inspired by multi-index models, AIM captures how token-level outputs emerge from layered bilinear interactions over high-dimensional embeddings. Unlike prior tractable attention models, AIM allows full-width key and query matrices, aligning more closely with practical transformers. Using tools from statistical mechanics and random matrix theory, we derive closed-form predictions for Bayes-optimal generalization error and identify sharp phase transitions as a function of sample complexity, model width, and sequence length. We propose a matching approximate message passing algorithm and show that gradient descent can reach optimal performance. AIM offers a solvable playground for understanding learning in self-attention layers, that are key components of modern architectures.

artificial intelligence, machine learning, natural language, (19 more...)

Country: North America > United States (0.27)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Neural Information Processing SystemsJun-19-2026, 01:34:58 GMT

The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks

We study the high-dimensional asymptotics of empirical risk minimization (ERM) in over-parametrized two-layer neural networks with quadratic activations trained on synthetic data. We derive sharp asymptotics for both training and test errors by mapping the ℓ2-regularized learning problem to a convex matrix sensing task with nuclear norm penalization. This reveals that capacity control in such networks emerges from a low-rank structure in the learned feature maps. Our results characterize the global minima of the loss and yield precise generalization thresholds, showing how the width of the target function governs learnability. This analysis bridges and extends ideas from spin-glass methods, matrix factorization, and convex optimization and emphasizes the deep link between low-rank matrix sensing and learning in quadratic neural networks.

artificial intelligence, machine learning, regularization, (15 more...)

Country: Europe (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry: Education (0.65)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsJun-18-2026, 12:26:07 GMT

Learning with Restricted Boltzmann Machines: Asymptotics of AMP and GD in High Dimensions

The Restricted Boltzmann Machine (RBM) is one of the simplest generative neural networks capable of learning input distributions. Despite its simplicity, the analysis of its performance in learning from the training data is only well understood in cases that essentially reduce to singular value decomposition of the data. Here, we consider the limit of a large dimension of the input space and a constant number of hidden units. In this limit, we simplify the standard RBM training objective into a form that is equivalent to the multi-index model with non-separable regularization. This opens a path to analyze training of the RBM using methods that are established for multi-index models, such as Approximate Message Passing (AMP) and its state evolution, and the analysis of Gradient Descent (GD) via the dynamical mean-field theory. We then give rigorous asymptotics of the training dynamics of RBMs on data generated by the spiked covariance model as a prototype of a structure suitable for unsupervised learning. We show in particular that RBMs reach the optimal computational weak recovery threshold, aligning with the Baik-Ben Arous-Péché (BBP) transition, in the spiked covariance model.

artificial intelligence, machine learning, rbm, (18 more...)

Country: Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.24)

Genre: Research Report > Experimental Study (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.61)

Moniri, Behrad, Hassani, Hamed

Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent

arXiv.org Machine LearningMay-25-2026

We study feature learning in two-layer neural networks within the linear-width regime, where the number of hidden neurons, sample size, and input dimension scale proportionally. While recent work has analyzed feature learning via a single step of gradient descent on the first layer weights in this regime, such one-step update schemes are fundamentally limited: the update to the weights is approximately rank-one, captures only a single direction, and requires the target function to have an information exponent of one. In this paper, we go beyond one-step updates to provide a full characterization of the features learned during the \textit{second step} of gradient descent with step-sizes $η_1\asymp N^{α_1}$ and $η_2 \asymp N^{α_2}$ for $α_1, α_2 \in [0,0.5)$, where $N$ is the number of hidden neurons. We derive a spectral characterization of the updated weights, demonstrating they behave as a spiked random matrix with multiple outliers, each corresponding to a learned direction. We show that the number of the outliers is determined by the parameters $α_1, α_2$ through $\lfloor \frac{α_2}{1/2 - α_1} \rfloor$. Furthermore, by analyzing the alignment between the learned directions and the target function, we identify a gap between training with independent versus reused batches. While independent batches restrict learning to directions with an information exponent of one, batch reuse enables the second update to capture directions even when the information exponent exceeds one, provided that $α_1, α_2$ are chosen properly. This shows that the benefits of batch reuse, previously observed in narrow-width regimes, persist in the linear-width limit as well. By characterizing these early-phase evolutions, our work proposes a tractable framework for studying optimization and feature learning phenomenology in modern overparameterized networks.

artificial intelligence, machine learning, neural network, (13 more...)

2605.17767

Country: North America > United States > Pennsylvania (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.91)

Wortsman-Zurich, Arie, Tabanelli, Hugo, Dandi, Yatin, Krzakala, Florent, Loureiro, Bruno

Scaling Laws from Sequential Feature Recovery: A Solvable Hierarchical Model

arXiv.org Machine LearningMay-15-2026

We propose a simple mechanism by which scaling laws emerge from feature learning in multi-layer networks. We study a high-dimensional hierarchical target that is a globally high-degree function, but that can be represented by a combination of latent compositional features whose weights decrease as a power law. We show that a layer-wise spectral algorithm adapted to this compositional structure achieves improved scaling relative to shallow, non-adaptive methods, and recovers the latent directions sequentially: strong features become detectable at small sample sizes, while weaker features require more data. We prove sharp feature-wise recovery thresholds and show that aggregating these transitions yields an explicit power-law decay of the prediction error. Technically, the analysis relies on random matrix methods and a resolvent-based perturbation argument, which gives matching upper and lower bounds for individual eigenvector recovery beyond what standard gap-based perturbation bounds provide. Numerical experiments confirm the predicted sequential recovery, finite-size smoothing of the thresholds, and separation from non-hierarchical kernel baselines. Together, these results show how smooth scaling laws can emerge from a cascade of sharp feature-learning transitions.

artificial intelligence, equation, machine learning, (15 more...)

2605.14567

Country:

Europe (0.67)
North America > United States (0.67)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Dandi, Yatin, Vilucchio, Matteo, Arnaboldi, Luca, Tabanelli, Hugo, Krzakala, Florent

Deep Learning as Neural Low-Degree Filtering: A Spectral Theory of Hierarchical Feature Learning

arXiv.org Machine LearningMay-14-2026

Understanding how deep neural networks learn useful internal representations from data remains a central open problem in the theory of deep learning. We introduce Neural Low-Degree Filtering (Neural LoFi), a stylized limit of gradient-based training in which hierarchical feature learning becomes an explicit iterative spectral procedure. In this limit, the dynamics at each layer decouple: given the current representation, the next layer selects directions with maximal accessible low-degree correlation to the label. This yields a tractable surrogate mechanism for deep learning, together with a natural kernel-space interpretation. Neural LoFi provides a mathematically explicit framework for studying multi-layer feature learning beyond the lazy regime. It predicts how representations are selected layer by layer, explains how emergence of concepts arises with given sample complexity,and gives a concrete mechanism by which depth progressively constructs new features from old ones through low-degree compositionality. We complement the theory with mechanistic experiments on fully connected and convolutional architectures, showing that Neural LoFi improves over lazy random-feature baselines, recovers meaningful structured filters, and predicts representations aligned with early gradient-descent feature discovery with real datasets.

artificial intelligence, machine learning, representation, (14 more...)

2605.13612

Country: North America (0.45)

Genre: Research Report > New Finding (0.45)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Nguyen, Minh-Toan, Barbier, Jean

Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks

arXiv.org Machine LearningMay-13-2026

We study the information-theoretic limits of learning a one-hidden-layer teacher network with hierarchical features from noisy queries, in the context of knowledge transfer to a smaller student model. We work in the high-dimensional regime where the teacher width $k$ scales linearly with the input dimension $d$ -- a setting that captures large-but-finite-width networks and has only recently become analytically tractable. Using a heuristic leave-one-out decoupling argument, validated numerically throughout, we derive asymptotically sharp characterizations of the Bayes-optimal generalization error and individual feature overlaps via a system of closed fixed-point equations. These equations reveal that feature learnability is governed by a sequence of sharp phase transitions: as data grows, teacher features become recoverable sequentially, each through a discontinuous jump in overlap. This sequential acquisition underlies a precise notion of \textit{effective width} $k_c$ -- the number of learnable features at a given data budget $n$ -- which unifies two distinct scaling regimes: a feature-learning regime in which the Bayes-optimal generalization error $\varepsilon^{\rm BO}$ scales as $ n^{1/(2β)-1}$, and a refinement regime in which it scales as $n^{-1}$, where $β>1/2$ is the exponent of the power-law feature hierarchy. Both laws collapse to the single relation $\varepsilon^{\rm BO}=Θ(k_c d/n)$. We further show empirically that a student trained with \textsc{Adam} near the effective width $k_c$ achieves these optimal scaling laws (up to a small algorithmic gap), and provide an information-theoretic account of the associated scaling in model size.

artificial intelligence, machine learning, regime, (15 more...)

2605.10395

Country: Europe (0.28)

Genre: Research Report (0.40)

Industry: Education (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)