AITopics

2511.02773

Country: North America > United States > Minnesota (0.27)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

arXiv.org Machine LearningNov-5-2025

Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks

Rangriz, Parsa

This paper studies the high-dimensional scaling limits of online stochastic gradient descent (SGD) for single-layer networks. Building on the seminal work of Saad and Solla, which analyzed the deterministic (ballistic) scaling limits of SGD corresponding to the gradient flow of the population loss, we focus on the critical scaling regime of the step size. Below this critical scale, the effective dynamics are governed by ballistic (ODE) limits, but at the critical scale, new correction term appears that changes the phase diagram. In this regime, near the fixed points, the corresponding diffusive (SDE) limits of the effective dynamics reduces to an Ornstein-Uhlenbeck process under certain conditions. These results highlight how the information exponent controls sample complexity and illustrates the limitations of deterministic scaling limit in capturing the stochastic fluctuations of high-dimensional learning dynamics.

artificial intelligence, information exponent, machine learning, (13 more...)

2511.02258

Country: North America > United States > California (0.28)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

arXiv.org Machine LearningNov-4-2025

Gradient Flow Sampler-based Distributionally Robust Optimization

Xu, Zusen, Zhu, Jia-Jie

We propose a mathematically principled PDE gradient flow framework for distributionally robust optimization (DRO). Exploiting the recent advances in the intersection of Markov Chain Monte Carlo sampling and gradient flow theory, we show that our theoretical framework can be implemented as practical algorithms for sampling from worst-case distributions and, consequently, DRO. While numerous previous works have proposed various reformulation techniques and iterative algorithms, we contribute a sound gradient flow view of the distributional optimization that can be used to construct new algorithms. As an example of applications, we solve a class of Wasserstein and Sinkhorn DRO problems using the recently-discovered Wasserstein Fisher-Rao and Stein variational gradient flows. Notably, we also show some simple reductions of our framework recover exactly previously proposed popular DRO methods, and provide new insights into their theoretical limit and optimization dynamics. Numerical studies based on stochastic gradient descent provide empirical backing for our theoretical findings.

algorithm, artificial intelligence, machine learning, (17 more...)

2510.25956

Country:

Asia > Middle East > Jordan (0.04)
Europe > Germany > Berlin (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

arXiv.org Artificial IntelligenceNov-4-2025

Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

Wang, Fali, Chen, Jihai, Yang, Shuhua, Bao, Runxue, Zhao, Tianxiang, Zhang, Zhiwei, Tang, Xianfeng, Liu, Hui, He, Qi, Wang, Suhang

Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single-model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute-optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi-LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task-specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi-LLM collaboration graphs. Experiments show that Agent-REINFORCE outperforms both traditional and LLM-based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.

large language model, machine learning, natural language, (19 more...)

2511.00086

Country:

Asia (0.92)
North America > United States (0.28)
Europe > Austria (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceNov-4-2025

LSHFed: Robust and Communication-Efficient Federated Learning with Locally-Sensitive Hashing Gradient Mapping

Cheng, Guanjie, Yang, Mengzhen, Zhao, Xinkui, Yu, Shuyi, Du, Tianyu, Wu, Yangyang, Zhu, Mengying, Deng, Shuiguang

Federated learning (FL) enables collaborative model training across distributed nodes without exposing raw data, but its decentralized nature makes it vulnerable in trust-deficient environments. Inference attacks may recover sensitive information from gradient updates, while poisoning attacks can degrade model performance or induce malicious behaviors. Existing defenses often suffer from high communication and computation costs, or limited detection precision. To address these issues, we propose LSHFed, a robust and communication-efficient FL framework that simultaneously enhances aggregation robustness and privacy preservation. At its core, LSHFed incorporates LSHGM, a novel gradient verification mechanism that projects high-dimensional gradients into compact binary representations via multi-hyperplane locally-sensitive hashing. This enables accurate detection and filtering of malicious gradients using only their irreversible hash forms, thus mitigating privacy leakage risks and substantially reducing transmission overhead. Extensive experiments demonstrate that LSHFed maintains high model performance even when up to 50% of participants are collusive adversaries while achieving up to a 1000x reduction in gradient verification communication compared to full-gradient methods.

artificial intelligence, federated learning, machine learning, (16 more...)

2511.01296

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Thomsen, Daniel Berg, Taylor, Adrien, Dieuleveut, Aymeric

Tight analyses of first-order methods with error feedback

arXiv.org Artificial IntelligenceNov-4-2025

Communication between agents often constitutes a major computational bottleneck in distributed learning. One of the most common mitigation strategies is to compress the information exchanged, thereby reducing communication overhead. To counteract the degradation in convergence associated with compressed communication, error feedback schemes -- most notably $\mathrm{EF}$ and $\mathrm{EF}^{21}$ -- were introduced. In this work, we provide a tight analysis of both of these methods. Specifically, we find the Lyapunov function that yields the best possible convergence rate for each method -- with matching lower bounds. This principled approach yields sharp performance guarantees and enables a rigorous, apples-to-apples comparison between $\mathrm{EF}$, $\mathrm{EF}^{21}$, and compressed gradient descent. Our analysis is carried out in the simplified single-agent setting, which allows for clean theoretical insights and fair comparison of the underlying mechanisms.

artificial intelligence, lyapunov function, machine learning, (16 more...)

2506.05271

Country: Europe > France (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Technology:

Information Technology > Software (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.46)
(2 more...)

Busbib, Daniel, Wiesel, Ami

Estimation of Toeplitz Covariance Matrices using Overparameterized Gradient Descent

arXiv.org Machine LearningNov-4-2025

We consider covariance estimation under Toeplitz structure. Numerous sophisticated optimization methods have been developed to maximize the Gaussian log-likelihood under Toeplitz constraints. In contrast, recent advances in deep learning demonstrate the surprising power of simple gradient descent (GD) applied to overparameterized models. Motivated by this trend, we revisit Toeplitz covariance estimation through the lens of overparameterized GD. We model the $P\times P$ covariance as a sum of $K$ complex sinusoids with learnable parameters and optimize them via GD. We show that when $K = P$, GD may converge to suboptimal solutions. However, mild overparameterization ($K = 2P$ or $4P$) consistently enables global convergence from random initializations. We further propose an accelerated GD variant with separate learning rates for amplitudes and frequencies. When frequencies are fixed and only amplitudes are optimized, we prove that the optimization landscape is asymptotically benign and any stationary point recovers the true covariance. Finally, numerical experiments demonstrate that overparameterized GD can match or exceed the accuracy of state-of-the-art methods in challenging settings, while remaining simple and scalable.

artificial intelligence, estimation, machine learning, (17 more...)

2511.01605

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)

Genans, Ferdinand, Godichon-Baggioni, Antoine, Vialard, François-Xavier, Wintenberger, Olivier

Decreasing Entropic Regularization Averaged Gradient for Semi-Discrete Optimal Transport

arXiv.org Machine LearningNov-3-2025

Adding entropic regularization to Optimal Transport (OT) problems has become a standard approach for designing efficient and scalable solvers. However, regularization introduces a bias from the true solution. To mitigate this bias while still benefiting from the acceleration provided by regularization, a natural solver would adaptively decrease the regularization as it approaches the solution. Although some algorithms heuristically implement this idea, their theoretical guarantees and the extent of their acceleration compared to using a fixed regularization remain largely open. In the setting of semi-discrete OT, where the source measure is continuous and the target is discrete, we prove that decreasing the regularization can indeed accelerate convergence. To this end, we introduce DRAG: Decreasing (entropic) Regularization Averaged Gradient, a stochastic gradient descent algorithm where the regularization decreases with the number of optimization steps. We provide a theoretical analysis showing that DRAG benefits from decreasing regularization compared to a fixed scheme, achieving an unbiased $\mathcal{O}(1/t)$ sample and iteration complexity for both the OT cost and the potential estimation, and a $\mathcal{O}(1/\sqrt{t})$ rate for the OT map. Our theoretical findings are supported by numerical experiments that validate the effectiveness of DRAG and highlight its practical advantages.

artificial intelligence, machine learning, optimization problem, (16 more...)

2510.2734

Country:

North America > United States (0.04)
Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Le, Tuan, Moothedath, Shana

Byzantine Resilient Federated Multi-Task Representation Learning

arXiv.org Artificial IntelligenceNov-3-2025

In this paper, we propose BR-MTRL, a Byzantine-resilient multi-task representation learning framework that handles faulty or malicious agents. Our approach leverages representation learning through a shared neural network model, where all clients share fixed layers, except for a client-specific final layer. This structure captures shared features among clients while enabling individual adaptation, making it a promising approach for leveraging client data and computational power in heterogeneous federated settings to learn personalized models. To learn the model, we employ an alternating gradient descent strategy: each client optimizes its local model, updates its final layer, and sends estimates of the shared representation to a central server for aggregation. To defend against Byzantine agents, we employ two robust aggregation methods for client-server communication, Geometric Median and Krum. Our method enables personalized learning while maintaining resilience in distributed settings. We implemented the proposed algorithm in a federated testbed built using Amazon Web Services (AWS) platform and compared its performance with various benchmark algorithms and their variations. Through experiments using real-world datasets, including CIFAR-10 and FEMNIST, we demonstrated the effectiveness and robustness of our approach and its transferability to new unseen clients with limited data, even in the presence of Byzantine adversaries.

artificial intelligence, machine learning, representation, (16 more...)

2503.19209

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (0.47)
Information Technology > Services (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)

Vastola, John J., Gershman, Samuel J., Rajan, Kanaka

Gradient Descent as Loss Landscape Navigation: a Normative Framework for Deriving Learning Rules

arXiv.org Artificial IntelligenceNov-3-2025

Learning rules -- prescriptions for updating model parameters to improve performance -- are typically assumed rather than derived. Why do some learning rules work better than others, and under what assumptions can a given rule be considered optimal? We propose a theoretical framework that casts learning rules as policies for navigating (partially observable) loss landscapes, and identifies optimal rules as solutions to an associated optimal control problem. A range of well-known rules emerge naturally within this framework under different assumptions: gradient descent from short-horizon optimization, momentum from longer-horizon planning, natural gradients from accounting for parameter space geometry, non-gradient rules from partial controllability, and adaptive optimizers like Adam from online Bayesian inference of loss landscape shape. We further show that continual learning strategies like weight resetting can be understood as optimal responses to task uncertainty. By unifying these phenomena under a single objective, our framework clarifies the computational structure of learning and offers a principled foundation for designing adaptive algorithms.

artificial intelligence, machine learning, objective, (17 more...)

2510.26997

Country: North America > United States (0.67)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.72)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.48)