AITopics

2606.17665

Country:

North America > Canada (0.45)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.45)
Europe > France (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Neural Information Processing SystemsJun-15-2026, 11:42:25 GMT

Asymptotics of SGD in Sequence-Single Index Models and Single-Layer Attention Networks

We study the dynamics of stochastic gradient descent (SGD) for a class of sequence models termed Sequence Single-Index (SSI) models, where the target depends on a single direction in input space applied to a sequence of tokens. This setting generalizes classical single-index models to the sequential domain, encompassing simplified one-layer attention architectures. We derive a closed-form expression for the population loss in terms of a pair of sufficient statistics capturing semantic and positional alignment, and characterize the induced high-dimensional SGD dynamics for these coordinates. Our analysis reveals two distinct training phases: escape from uninformative initialization and alignment with the target subspace, and demonstrates how the sequence length and positional encoding influence convergence speed and learning trajectories. These results provide a rigorous and interpretable foundation for understanding how sequential structure in data can be beneficial for learning with attention-based models. Stochastic Gradient Descent (SGD) is the core optimization tool driving modern machine learning. Recent years have seen substantial progress in understanding its dynamics, particularly in two-layer networks [Saad and Solla, 1995, Mei et al., 2018, Chizat and Bach, 2018, Rotskoff and VandenEijnden, 2022, Sirignano and Spiliopoulos, 2020, Arnaboldi et al., 2023a]. While global convergence is qualitatively well-understood when the network is wide enough, quantitative results are scarcer. A particularly fruitful body of recent theoretical work addressing this gap has focused on deriving precise convergence rates for particular model classes on synthetic data, such as high-dimensional Gaussian single and multi-index models [Ben Arous et al., 2021, Abbe et al., 2022, 2023].

artificial intelligence, machine learning, sie, (17 more...)

Country:

Europe > France (0.46)
North America > United States (0.45)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.25)

Genre: Research Report > Experimental Study (1.00)

Industry: Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Ricci, Fabiola, Merger, Claudia, Goldt, Sebastian

A Fourier perspective on the learning dynamics of neural networks: from sample complexities to mechanistic insights

arXiv.org Machine LearningMay-19-2026

Neural networks trained with gradient-based methods exhibit a strong simplicity bias: they learn simpler statistical features of their data before moving to more complex features. Previous analyses of this phenomenon have largely focused on settings with (quasi-)isotropic inputs. In this work, we study the simplicity bias from a Fourier perspective, which allows us to include two key features of natural images in the analysis: approximate translation-invariance and power-law spectra. We first show experimentally that simple neural networks trained on image classification tasks first rely on amplitude information -- related to pair-wise correlations between pixels -- before exploiting phase information, which encodes edges and higher-order correlations. In view of this, we introduce a synthetic data model for translation-invariant inputs that allows precise control over amplitudes and phases while remaining tractable. We rigorously establish that for isotropic and high-dimensional inputs, classification based on phase information alone is a genuinely hard task: online stochastic gradient descent (SGD) cannot distinguish the structured inputs from noise within $n \ll N^3$ steps, but needs at least $n \gg N^3 \log^2{N}$ steps. In contrast, we show both experimentally and theoretically that power-law spectra can dramatically accelerate the speed of learning phase information, even if the spectra do not help with classification. Simulations with two-layer networks trained on textures and with deep convolutional networks on ImageNet and CIFAR100 confirm this non-trivial interaction between amplitudes and phases, providing mechanistic insights into how deep neural networks can learn natural image distributions efficiently.

artificial intelligence, fourier data model, machine learning, (15 more...)

2605.16913

Country: Europe (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)

Awano, Ryoya, Suzuki, Taiji

The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge

arXiv.org Machine LearningMay-14-2026

Weak-to-strong (W2S) generalization, in which a strong model is fine-tuned on outputs of a weaker, task-specialized model, has been proposed as an approach to aligning superhuman AI systems. Existing theoretical analyses either fix the student's representations or operate in restricted settings. Whether multi-step SGD can succeed in feature learning while preserving diverse pre-trained capabilities remains open. We study W2S in the setting of reward-model learning with two-layer neural networks. The strong model has pre-trained representations organized into low-dimensional subspaces $V_k$, and is fine-tuned under the supervision of a weak model specialized on task $κ$. We prove that the strong model efficiently learns task $κ$, eliciting its pre-trained knowledge while retaining general capabilities. This establishes W2S generalization in the feature-learning regime, in the sense that the strong model acquires the target feature direction through W2S training, rather than having it given a priori. Moreover, W2S preserves pre-trained off-target features, whereas standard supervised fine-tuning causes catastrophic forgetting when off-target feature directions are correlated with the target's. Numerical experiments on synthetic data confirm our theoretical results.

high probability, machine learning, natural language, (17 more...)

2605.12908

Country: Asia (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Neural Information Processing SystemsApr-24-2026, 05:31:37 GMT

Smoothing the Landscape Boosts the Signal for SGD Optimal Sample Complexity for Learning Single Index Models

We focus on the task of learning a single index model σ(w x) with respect to the isotropic Gaussian distribution in d dimensions. Prior work has shown that the sample complexity of learning w is governed by the information exponent k of the link function σ, which is defined as the index of the first nonzero Hermite coefficient of σ.

artificial intelligence, gradient descent, machine learning, (13 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.30)

Neural Information Processing SystemsDec-27-2025, 16:54:36 GMT

Smoothing the Landscape Boosts the Signal for SGD Optimal Sample Complexity for Learning Single Index Models

The outline of our paper is as follows.

algorithm, gradient descent, sample complexity, (11 more...)

Country:

Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.05)
North America > United States (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Neural Information Processing SystemsOct-8-2025, 06:31:37 GMT

On Single Index Models beyond Gaussian Data

In essence, these tools exploit the stability and spherical symmetry of Gaussian distributions.

assumption 4, landscape, proposition 4, (14 more...)

Country: Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.05)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.30)

Arnaboldi, Luca, Loureiro, Bruno, Stephan, Ludovic, Krzakala, Florent, Zdeborova, Lenka

Asymptotics of SGD in Sequence-Single Index Models and Single-Layer Attention Networks

arXiv.org Machine LearningJun-5-2025

artificial intelligence, machine learning, natural language, (16 more...)

2506.02651

Country:

Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.05)
Europe > Switzerland > Vaud > Lausanne (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Jones-McCormick, Taj, Jagannath, Aukosh, Sen, Subhabrata

Provable Benefits of Unsupervised Pre-training and Transfer Learning via Single-Index Models

arXiv.org Machine LearningFeb-24-2025

Unsupervised pre-training and transfer learning are commonly used techniques to initialize training algorithms for neural networks, particularly in settings with limited labeled data. In this paper, we study the effects of unsupervised pre-training and transfer learning on the sample complexity of high-dimensional supervised learning. Specifically, we consider the problem of training a single-layer neural network via online stochastic gradient descent. We establish that pre-training and transfer learning (under concept shift) reduce sample complexity by polynomial factors (in the dimension) under very general assumptions. We also uncover some surprising settings where pre-training grants exponential improvement over random initialization in terms of sample complexity.

initialization, sample complexity, theorem 3, (14 more...)

2502.16849

Country:

Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.05)
North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)
North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Machine LearningJun-4-2024

Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs

Arnaboldi, Luca, Dandi, Yatin, Krzakala, Florent, Loureiro, Bruno, Pesce, Luca, Stephan, Ludovic

We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches $n_b \lesssim d^{\frac{\ell}{2}}$ minimizes the training time without changing the total sample complexity, where $\ell$ is the information exponent of the target to be learned \citep{arous2021online} and $d$ is the input dimension. However, larger batch sizes than $n_b \gg d^{\frac{\ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, \textit{Correlation loss SGD}, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.

batch size, correlation loss sgd, sgd, (12 more...)

2406.02157

Country:

North America > United States (0.14)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.05)
Europe > Switzerland > Vaud > Lausanne (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Education > Educational Setting > Online (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)