AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)

Neural Information Processing SystemsMay-26-2025, 15:23:35 GMT

HyperLogic: Enhancing Diversity and Accuracy in Rule Learning with HyperNets

Exploring the integration of if-then logic rules within neural network architectures presents an intriguing area. This integration seamlessly transforms the rule learning task into neural network training using backpropagation and stochastic gradient descent. From a well-trained sparse and shallow neural network, one can interpret each layer and neuron through the language of logic rules, and a global explanatory rule set can be directly extracted. However, ensuring interpretability may impose constraints on the flexibility, depth, and width of neural networks. In this paper, we propose HyperLogic: a novel framework leveraging hypernetworks to generate weights of the main network.

artificial intelligence, hyperlogic, machine learning, (6 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.64)

Neural Information Processing SystemsMay-26-2025, 15:21:32 GMT

Global Convergence of Gradient Descent for Asymmetric Low-Rank Matrix Factorization

This is a canonical problem that admits two difficulties in optimization: 1) non-convexity and 2) non-smoothness (due to unbalancedness of \mathbf{U} and \mathbf{V}). This is also a prototype for more complex problems such as asymmetric matrix sensing and matrix completion. Despite being non-convex and non-smooth, it has been observed empirically that the randomly initialized gradient descent algorithm can solve this problem in polynomial time. Existing theories to explain this phenomenon all require artificial modifications of the algorithm, such as adding noise in each iteration and adding a balancing regularizer to balance the \mathbf{U} and \mathbf{V} .This paper presents the first proof that shows randomly initialized gradient descent converges to a global minimum of the asymmetric low-rank factorization problem with a polynomial rate. For the proof, we develop 1) a new symmetrization technique to capture the magnitudes of the symmetry and asymmetry, and 2) a quantitative perturbation analysis to approximate matrix derivatives.

artificial intelligence, machine learning, mathbf, (6 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

Neural Information Processing SystemsMay-26-2025, 15:02:24 GMT

Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays

The existing analysis of asynchronous stochastic gradient descent (SGD) degrades dramatically when any delay is large, giving the impression that performance depends primarily on the delay. On the contrary, we prove much better guarantees for the same asynchronous SGD algorithm regardless of the delays in the gradients, depending instead just on the number of parallel devices used to implement the algorithm. Our guarantees are strictly better than the existing analyses, and we also argue that asynchronous SGD outperforms synchronous minibatch SGD in the settings we consider. For our analysis, we introduce a novel recursion based on virtual iterates'' and delay-adaptive stepsizes, which allow us to derive state-of-the-art guarantees for both convex and non-convex objectives.

artificial intelligence, asynchronous sgd beat minibatch sgd, machine learning, (1 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

Neural Information Processing SystemsMay-26-2025, 14:58:07 GMT

Understanding End-to-End Model-Based Reinforcement Learning Methods as Implicit Parameterization

Estimating the per-state expected cumulative rewards is a critical aspect of reinforcement learning approaches, however the experience is obtained, but standard deep neural-network function-approximation methods are often inefficient in this setting. An alternative approach, exemplified by value iteration networks, is to learn transition and reward models of a latent Markov decision process whose value predictions fit the data. This approach has been shown empirically to converge faster to a more robust solution in many cases, but there has been little theoretical study of this phenomenon. In this paper, we explore such implicit representations of value functions via theory and focused experimentation. We prove that, for a linear parametrization, gradient descent converges to global optima despite non-linearity and non-convexity introduced by the implicit representation.

artificial intelligence, end-to-end model-based reinforcement learning method, machine learning, (3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Mishra, Abhiti, Patel, Yash, Tewari, Ambuj

Continuum Transformers Perform In-Context Learning by Operator Gradient Descent

arXiv.org Machine LearningMay-26-2025

Transformers robustly exhibit the ability to perform in-context learning, whereby their predictive accuracy on a task can increase not by parameter updates but merely with the placement of training samples in their context windows. Recent works have shown that transformers achieve this by implementing gradient descent in their forward passes. Such results, however, are restricted to standard transformer architectures, which handle finite-dimensional inputs. In the space of PDE surrogate modeling, a generalization of transformers to handle infinite-dimensional function inputs, known as "continuum transformers," has been proposed and similarly observed to exhibit in-context learning. Despite impressive empirical performance, such in-context learning has yet to be theoretically characterized. We herein demonstrate that continuum transformers perform in-context operator learning by performing gradient descent in an operator RKHS. We demonstrate this using novel proof strategies that leverage a generalized representer theorem for Hilbert spaces and gradient flows over the space of functionals of a Hilbert space. We additionally show the operator learned in context is the Bayes Optimal Predictor in the infinite depth limit of the transformer. We then provide empirical validations of this optimality result and demonstrate that the parameters under which such gradient descent is performed are recovered through the continuum transformer training.

artificial intelligence, machine learning, operator, (15 more...)

2505.17838

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Carter, Selina, Kuchibhotla, Arun K

Statistical Inference for Online Algorithms

arXiv.org Machine LearningMay-26-2025

Construction of confidence intervals and hypothesis tests for functionals based on asymptotically normal estimators is a classical topic in statistical inference. The simplest and in many cases optimal inference procedure is the Wald interval or the likelihood ratio test, both of which require an estimator and an estimate of the asymptotic variance of the estimator. Estimators obtained from online/sequential algorithms forces one to consider the computational aspects of the inference problem, i.e., one cannot access all of the data as many times as needed. Several works on this topic explored the online estimation of asymptotic variance. In this article, we propose computationally efficient, rate-optimal, and asymptotically valid confidence regions based on the output of online algorithms {\em without} estimating the asymptotic variance. As a special case, this implies inference from any algorithm that yields an asymptotically normal estimator. We focus our efforts on stochastic gradient descent with Polyak averaging to understand the practical performance of the proposed method.

artificial intelligence, machine learning, regression, (17 more...)

2505.173

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > California (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.57)

Cortild, Daniel, Ketels, Lucas, Peypouquet, Juan, Garrigos, Guillaume

New Tight Bounds for SGD without Variance Assumption: A Computer-Aided Lyapunov Analysis

arXiv.org Machine LearningMay-26-2025

The analysis of Stochastic Gradient Descent (SGD) often relies on making some assumption on the variance of the stochastic gradients, which is usually not satisfied or difficult to verify in practice. This paper contributes to a recent line of works which attempt to provide guarantees without making any variance assumption, leveraging only the (strong) convexity and smoothness of the loss functions. In this context, we prove new theoretical bounds derived from the monotonicity of a simple Lyapunov energy, improving the current state-of-the-art and extending their validity to larger step-sizes. Our theoretical analysis is backed by a Performance Estimation Problem analysis, which allows us to claim that, empirically, the bias term in our bounds is tight within our framework.

artificial intelligence, assumption, machine learning, (17 more...)

2505.17965

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Netherlands (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)

arXiv.org Machine LearningMay-23-2025

Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks

Hashimoto, Ichiro

In this paper, we prove directional convergence of network parameters of fixed width leaky ReLU two-layer neural networks optimized by gradient descent with exponential loss, which was previously only known for gradient flow. By a careful analysis of the convergent direction, we establish sufficient conditions of benign overfitting and discover a new phase transition in the test error bound. All of these results hold beyond the nearly orthogonal data setting which was studied in prior works. As an application, we demonstrate that benign overfitting occurs with high probability in sub-Gaussian mixture models.

artificial intelligence, assumption 4, machine learning, (14 more...)

2505.16204

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)

Bombari, Simone, Seroussi, Inbar, Mondelli, Marco

Better Rates for Private Linear Regression in the Proportional Regime via Aggressive Clipping

arXiv.org Machine LearningMay-23-2025

Differentially private (DP) linear regression has received significant attention in the recent theoretical literature, with several works aimed at obtaining improved error rates. A common approach is to set the clipping constant much larger than the expected norm of the per-sample gradients. While simplifying the analysis, this is however in sharp contrast with what empirical evidence suggests to optimize performance. Our work bridges this gap between theory and practice: we provide sharper rates for DP stochastic gradient descent (DP-SGD) by crucially operating in a regime where clipping happens frequently. Specifically, we consider the setting where the data is multivariate Gaussian, the number of training samples $n$ is proportional to the input dimension $d$, and the algorithm guarantees constant-order zero concentrated DP. Our method relies on establishing a deterministic equivalent for the trajectory of DP-SGD in terms of a family of ordinary differential equations (ODEs). As a consequence, the risk of DP-SGD is bounded between two ODEs, with upper and lower bounds matching for isotropic data. By studying these ODEs when $n / d$ is large enough, we demonstrate the optimality of aggressive clipping, and we uncover the benefits of decaying learning rate and private noise scheduling.

artificial intelligence, machine learning, probability, (16 more...)

2505.16329

Country:

North America > United States (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Austria (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)