AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Parameter Estimation in Stochastic Differential Equations via Wiener Chaos Expansion and Stochastic Gradient Descent

Delgado-Vences, Francisco, Pavón-Español, José Julián, Ornelas, Arelly

arXiv.org Machine LearningMar-31-2026

This study addresses the inverse problem of parameter estimation for Stochastic Differential Equations (SDEs) by minimizing a regularized discrepancy functional via Stochastic Gradient Descent (SGD). To achieve computational efficiency, we leverage the Wiener Chaos Expansion (WCE), a spectral decomposition technique that projects the stochastic solution onto an orthogonal basis of Hermite polynomials. This transformation effectively maps the stochastic dynamics into a hierarchical system of deterministic functions, termed the \textit{propagator}. By reducing the stochastic inference task to a deterministic optimization problem, our framework circumvents the heavy computational burden and sampling requirements of traditional simulation-based methods like MCMC or MLE. The robustness and scalability of the proposed approach are demonstrated through numerical experiments on various non-linear SDEs, including models for individual biological growth. Results show that the WCE-SGD framework provides accurate parameter recovery even from discrete, noisy observations, offering a significant paradigm shift in the efficient modeling of complex stochastic systems.

artificial intelligence, machine learning, optimization problem, (13 more...)

arXiv.org Machine Learning

2603.27019

Country:

North America > Mexico (0.04)
North America > United States (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report (0.84)

Industry: Banking & Finance (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Add feedback

Minimax Generalized Cross-Entropy

Bondugula, Kartheek, Mazuelas, Santiago, Pérez, Aritz, Liu, Anqi

arXiv.org Machine LearningMar-26-2026

Loss functions play a central role in supervised classification. Cross-entropy (CE) is widely used, whereas the mean absolute error (MAE) loss can offer robustness but is difficult to optimize. Interpolating between the CE and MAE losses, generalized cross-entropy (GCE) has recently been introduced to provide a trade-off between optimization difficulty and robustness. Existing formulations of GCE result in a non-convex optimization over classification margins that is prone to underfitting, leading to poor performances with complex datasets. In this paper, we propose a minimax formulation of generalized cross-entropy (MGCE) that results in a convex optimization over classification margins. Moreover, we show that MGCEs can provide an upper bound on the classification error. The proposed bilevel convex optimization can be efficiently implemented using stochastic gradient computed via implicit differentiation. Using benchmark datasets, we show that MGCE achieves strong accuracy, faster convergence, and better calibration, especially in the presence of label noise.

accuracy, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

2603.19874

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.05)
North America > United States (0.04)
Europe > Spain > Basque Country > Biscay Province > Bilbao (0.04)
Africa > Middle East > Morocco > Tanger-Tetouan-Al Hoceima Region > Tangier (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Two-Time-Scale Learning Dynamics: A Population View of Neural Network Training

Borghi, Giacomo, Im, Hyesung, Pareschi, Lorenzo

arXiv.org Machine LearningMar-26-2026

Population-based learning paradigms, including evolutionary strategies, Population-Based Training (PBT), and recent model-merging methods, combine fast within-model optimisation with slower population-level adaptation. Despite their empirical success, a general mathematical description of the resulting collective training dynamics remains incomplete. We introduce a theoretical framework for neural network training based on two-time-scale population dynamics. We model a population of neural networks as an interacting agent system in which network parameters evolve through fast noisy gradient updates of SGD/Langevin type, while hyperparameters evolve through slower selection--mutation dynamics. We prove the large-population limit for the joint distribution of parameters and hyperparameters and, under strong time-scale separation, derive a selection--mutation equation for the hyperparameter density. For each fixed hyperparameter, the fast parameter dynamics relaxes to a Boltzmann--Gibbs measure, inducing an effective fitness for the slow evolution. The averaged dynamics connects population-based learning with bilevel optimisation and classical replicator--mutator models, yields conditions under which the population mean moves toward the fittest hyperparameter, and clarifies the role of noise and diversity in balancing optimisation and exploration. Numerical experiments illustrate both the large-population regime and the reduced two-time-scale dynamics, and indicate that access to the effective fitness, either in closed form or through population-level estimation, can improve population-level updates.

artificial intelligence, evolutionary algorithm, machine learning, (17 more...)

arXiv.org Machine Learning

2603.19808

Country:

North America > United States > New York > New York County > New York City (0.14)
Europe > Italy (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.34)

Add feedback

SymCircuit: Bayesian Structure Inference for Tractable Probabilistic Circuits via Entropy-Regularized Reinforcement Learning

Ju, Y. Sungtaek

arXiv.org Machine LearningMar-24-2026

Probabilistic circuit (PC) structure learning is hampered by greedy algorithms that make irreversible, locally optimal decisions. We propose SymCircuit, which replaces greedy search with a learned generative policy trained via entropy-regularized reinforcement learning. Instantiating the RL-as-inference framework in the PC domain, we show the optimal policy is a tempered Bayesian posterior, recovering the exact posterior when the regularization temperature is set inversely proportional to the dataset size. The policy is implemented as SymFormer, a grammar-constrained autoregressive Transformer with tree-relative self-attention that guarantees valid circuits at every generation step. We introduce option-level REINFORCE, restricting gradient updates to structural decisions rather than all tokens, yielding an SNR (signal to noise ratio) improvement and >10 times sample efficiency gain on the NLTCS dataset. A three-layer uncertainty decomposition (structural via model averaging, parametric via the delta method, leaf via conjugate Dirichlet-Categorical propagation) is grounded in the multilinear polynomial structure of PC outputs. On NLTCS, SymCircuit closes 93% of the gap to LearnSPN; preliminary results on Plants (69 variables) suggest scalability.

machine learning, posterior, reinforcement learning, (18 more...)

arXiv.org Machine Learning

2603.20392

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Oceania > Australia > New South Wales > Sydney (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.85)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Add feedback

Gradient Descent with Projection Finds Over-Parameterized Neural Networks for Learning Low-Degree Polynomials with Nearly Minimax Optimal Rate

Yang, Yingzhen, Li, Ping

arXiv.org Machine LearningMar-24-2026

We study the problem of learning a low-degree spherical polynomial of degree $k_0 = Θ(1) \ge 1$ defined on the unit sphere in $\RR^d$ by training an over-parameterized two-layer neural network with augmented feature in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\eps \in (0, Θ(d^{-k_0})]$, an over-parameterized two-layer neural network trained by a novel Gradient Descent with Projection (GDP) requires a sample complexity of $n \asymp Θ( \log(4/δ) \cdot d^{k_0}/\eps)$ with probability $1-δ$ for $δ\in (0,1)$, in contrast with the representative sample complexity $Θ(d^{k_0} \max\set{\eps^{-2},\log d})$. Moreover, such sample complexity is nearly unimprovable since the trained network renders a nearly optimal rate of the nonparametric regression risk of the order $\log({4}/δ) \cdot Θ(d^{k_0}/{n})$ with probability at least $1-δ$. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $Θ(d^{k_0})$ is $Θ(d^{k_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GDP is nearly minimax optimal. In the case that the ground truth degree $k_0$ is unknown, we present a novel and provable adaptive degree selection algorithm which identifies the true degree and achieves the same nearly optimal regression rate. To the best of our knowledge, this is the first time that a nearly optimal risk bound is obtained by training an over-parameterized neural network with a popular activation function (ReLU) and algorithmic guarantee for learning low-degree spherical polynomials. Due to the feature learning capability of GDP, our results are beyond the regular Neural Tangent Kernel (NTK) limit.

artificial intelligence, machine learning, probability, (18 more...)

arXiv.org Machine Learning

2603.21062

Country:

Europe > Austria > Vienna (0.14)
Europe > United Kingdom > England > Greater London > London (0.04)
North America > United States > Washington > King County > Bellevue (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Add feedback

Hard labels sampled from sparse targets mislead rotation invariant algorithms

Ghosh, Avrajit, Yu, Bin, Warmuth, Manfred, Bartlett, Peter

arXiv.org Machine LearningMar-24-2026

One of the most common machine learning setups is logistic regression. In many classification models, including neural networks, the final prediction is obtained by applying a logistic link function to a linear score. In binary logistic regression, the feedback can be either soft labels, corresponding to the true conditional probability of the data (as in distillation), or sampled hard labels (taking values $\pm 1$). We point out a fundamental problem that arises even in a particularly favorable setting, where the goal is to learn a noise-free soft target of the form $σ(\mathbf{x}^{\top}\mathbf{w}^{\star})$. In the over-constrained case (i.e. the number of samples $n$ exceeds the input dimension $d$) with examples $(\mathbf{x}_i,σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star}))$, it is sufficient to recover $\mathbf{w}^{\star}$ and hence achieve the Bayes risk. However, we prove that when the examples are labeled by hard labels $y_i$ sampled from the same conditional distribution $σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star})$ and $\mathbf{w}^{\star}$ is $s$-sparse, then rotation-invariant algorithms are provably suboptimal: they incur an excess risk $Ω\!\left(\frac{d-1}{n}\right)$, while there are simple non-rotation invariant algorithms with excess risk $O(\frac{s\log d}{n})$. The simplest rotation invariant algorithm is gradient descent on the logistic loss (with early stopping). A simple non-rotation-invariant algorithm for sparse targets that achieves the above upper bounds uses gradient descent on the weights $u_i,v_i$, where now the linear weight $w_i$ is reparameterized as $u_iv_i$.

artificial intelligence, machine learning, regression, (19 more...)

arXiv.org Machine Learning

2603.20967

Country: North America > United States > California > Alameda County > Berkeley (0.04)

Genre:

Research Report > New Finding (0.89)
Research Report > Experimental Study (0.75)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Add feedback

Heavy-Tailed and Long-Range Dependent Noise in Stochastic Approximation: A Finite-Time Analysis

Chandak, Siddharth, Yadav, Anuj, Ozgur, Ayfer, Bambos, Nicholas

arXiv.org Machine LearningMar-23-2026

Stochastic approximation (SA) is a fundamental iterative framework with broad applications in reinforcement learning and optimization. Classical analyses typically rely on martingale difference or Markov noise with bounded second moments, but many practical settings, including finance and communications, frequently encounter heavy-tailed and long-range dependent (LRD) noise. In this work, we study SA for finding the root of a strongly monotone operator under these non-classical noise models. We establish the first finite-time moment bounds in both settings, providing explicit convergence rates that quantify the impact of heavy tails and temporal dependence. Our analysis employs a noise-averaging argument that regularizes the impact of noise without modifying the iteration. Finally, we apply our general framework to stochastic gradient descent (SGD) and gradient play, and corroborate our finite-time analysis through numerical experiments.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

2603.19648

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > New York (0.04)
North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Improving the Privacy and Practicality of Objective Perturbation for Differentially Private Linear Learners

Neural Information Processing SystemsMar-19-2026, 06:19:45 GMT

In the arena of privacy-preserving machine learning, differentially private stochastic gradient descent (DP-SGD) has outstripped the objective perturbation mechanism in popularity and interest. Though unrivaled in versatility, DP-SGD requires a non-trivial privacy overhead (for privately tuning the model's hyperparameters) and a computational complexity which might be extravagant for simple models such as linear and logistic regression. This paper revamps the objective perturbation mechanism with tighter privacy analyses and new computational tools that boost it to perform competitively with DP-SGD on unconstrained convex generalized linear problems.

artificial intelligence, machine learning, proceedings, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)

Add feedback

Self-Regularized Learning Methods

Schölpple, Max, Fanghui, Liu, Steinwart, Ingo

arXiv.org Machine LearningMar-19-2026

We introduce a general framework for analyzing learning algorithms based on the notion of self-regularization, which captures implicit complexity control without requiring explicit regularization. This is motivated by previous observations that many algorithms, such as gradient-descent based learning, exhibit implicit regularization. In a nutshell, for a self-regularized algorithm the complexity of the predictor is inherently controlled by that of the simplest comparator achieving the same empirical risk. This framework is sufficiently rich to cover both classical regularized empirical risk minimization and gradient descent. Building on self-regularization, we provide a thorough statistical analysis of such algorithms including minmax-optimal rates, where it suffices to show that the algorithm is self-regularized -- all further requirements stem from the learning problem itself. Finally, we discuss the problem of data-dependent hyperparameter selection, providing a general result which yields minmax-optimal rates up to a double logarithmic factor and covers data-driven early stopping for RKHS-based gradient descent.

artificial intelligence, assumption, machine learning, (17 more...)

arXiv.org Machine Learning

2603.1716

Country:

Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
Asia > Singapore (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.40)

Industry: Education (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)

Add feedback

Mirror Descent on Riemannian Manifolds

Jiang, Jiaxin, Shi, Lei, Tan, Jiyuan

arXiv.org Machine LearningMar-19-2026

Mirror Descent (MD) is a scalable first-order method widely used in large-scale optimization, with applications in image processing, policy optimization, and neural network training. This paper generalizes MD to optimization on Riemannian manifolds. In particular, we develop a Riemannian Mirror Descent (RMD) framework via reparameterization and further propose a stochastic variant of RMD. We also establish non-asymptotic convergence guarantees for both RMD and stochastic RMD. As an application to the Stiefel manifold, our RMD framework reduces to the Curvilinear Gradient Descent (CGD) method proposed in [26]. Moreover, when specializing the stochastic RMD framework to the Stiefel setting, we obtain a stochastic extension of CGD, which effectively addresses large-scale manifold optimization problems.

artificial intelligence, machine learning, manifold, (14 more...)

arXiv.org Machine Learning

2603.17527

Country:

Asia > China > Shanghai > Shanghai (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)

Add feedback