Gradient Descent
On the Power of Context-Enhanced Learning in LLMs
Zhu, Xingyu, Panigrahi, Abhishek, Arora, Sanjeev
We formalize a new concept for LLMs, context-enhanced learning. It involves standard gradient-based learning on text except that the context is enhanced with additional data on which no auto-regressive gradients are computed. This setting is a gradient-based analog of usual in-context learning (ICL) and appears in some recent works. Using a multi-step reasoning task, we prove in a simplified setting that context-enhanced learning can be exponentially more sample-efficient than standard learning when the model is capable of ICL. At a mechanistic level, we find that the benefit of context-enhancement arises from a more accurate gradient learning signal. We also experimentally demonstrate that it appears hard to detect or recover learning materials that were used in the context during training. This may have implications for data security as well as copyright.
No Plan but Everything Under Control: Robustly Solving Sequential Tasks with Dynamically Composed Gradient Descent
We introduce a novel gradient-based approach for solving sequential tasks by dynamically adjusting the underlying myopic potential field in response to feedback and the world's regularities. This adjustment implicitly considers subgoals encoded in these regularities, enabling the solution of long sequential tasks, as demonstrated by solving the traditional planning domain of Blocks World - without any planning. Unlike conventional planning methods, our feedback-driven approach adapts to uncertain and dynamic environments, as demonstrated by one hundred real-world trials involving drawer manipulation. These experiments highlight the robustness of our method compared to planning and show how interactive perception and error recovery naturally emerge from gradient descent without explicitly implementing them. This offers a computationally efficient alternative to planning for a variety of sequential tasks, while aligning with observations on biological problem-solving strategies.
Non-convergence to the optimal risk for Adam and stochastic gradient descent optimization in the training of deep neural networks
Do, Thang, Jentzen, Arnulf, Riekert, Adrian
Stochastic gradient descent (SGD) optimization methods are the method of choice to train deep artificial neural networks (ANNs) in data-driven learning problems (see, for instance, [1, 7, 33, 36, 38, 41] and the references therein) as well as scientific computing problems (see, for example, [3, 13, 19, 22, 26, 40] and the references therein). However, often not the plain vanilla standard SGD optimization method is the employed optimizer but instead more sophisticated accelerated and adaptive variants of the standard SGD method such as the adaptive moment estimation SGD (Adam) optimizer (see [32]) are used in practically relevant deep ANN training problems. We also refer, for instance, to [2, 20, 23, 29, 35, 39] for monographs and surveys treating SGD optimization methods for the training of ANNs. The considered SGD optimization method is used with the aim to minimize the true risk function (the objective function) of the considered ANN learning problem so that, roughly speaking, the realization function of the deep ANN minimizing the true risk function approximates as best as possible the output data given the input data. Despite the omnipresent use of SGD optimization methods in the training of ANNs, it remains, in basically all practically relevant scenarios, a fundamental open problem to provide a rigorous theoretical description and explanation for the convergence (and non-convergence) properties of SGD optimization methods in deep learning. In particular, it remains an open question to prove or disprove convergence of the true risk of SGD optimization methods to the optimal/infimal true risk value in the training of deep ANNs (cf., for example, [11, 18, 31, 34] and the literature review in Subsection 1.4 below). In this work we contribute to this open problem of research in two aspects.
Gradients can train reward models: An Empirical Risk Minimization Approach for Offline Inverse RL and Dynamic Discrete Choice Model
Kang, Enoch H., Yoganarasimhan, Hema, Jain, Lalit
We study the problem of estimating Dynamic Discrete Choice (DDC) models, also known as offline Maximum Entropy-Regularized Inverse Reinforcement Learning (offline MaxEnt-IRL) in machine learning. The objective is to recover reward or $Q^*$ functions that govern agent behavior from offline behavior data. In this paper, we propose a globally convergent gradient-based method for solving these problems without the restrictive assumption of linearly parameterized rewards. The novelty of our approach lies in introducing the Empirical Risk Minimization (ERM) based IRL/DDC framework, which circumvents the need for explicit state transition probability estimation in the Bellman equation. Furthermore, our method is compatible with non-parametric estimation techniques such as neural networks. Therefore, the proposed method has the potential to be scaled to high-dimensional, infinite state spaces. A key theoretical insight underlying our approach is that the Bellman residual satisfies the Polyak-Lojasiewicz (PL) condition -- a property that, while weaker than strong convexity, is sufficient to ensure fast global convergence guarantees. Through a series of synthetic experiments, we demonstrate that our approach consistently outperforms benchmark methods and state-of-the-art alternatives.
Aligned Multi Objective Optimization
Efroni, Yonathan, Kretzu, Ben, Jiang, Daniel, Bhandari, Jalaj, Zheqing, null, Zhu, null, Ullrich, Karen
To date, the multi-objective optimization literature has mainly focused on conflicting objectives, studying the Pareto front, or requiring users to balance tradeoffs. Yet, in machine learning practice, there are many scenarios where such conflict does not take place. Recent findings from multi-task learning, reinforcement learning, and LLMs training show that diverse related tasks can enhance performance across objectives simultaneously. Despite this evidence, such phenomenon has not been examined from an optimization perspective. This leads to a lack of generic gradient-based methods that can scale to scenarios with a large number of related objectives. To address this gap, we introduce the Aligned Multi-Objective Optimization framework, propose new algorithms for this setting, and provide theoretical guarantees of their superior performance compared to naive approaches.
Online Inference for Quantiles by Constant Learning-Rate Stochastic Gradient Descent
Wei, Ziyang, Li, Jiaqi, Chen, Likai, Wu, Wei Biao
This paper proposes an online inference method of the stochastic gradient descent (SGD) with a constant learning rate for quantile loss functions with theoretical guarantees. Since the quantile loss function is neither smooth nor strongly convex, we view such SGD iterates as an irreducible and positive recurrent Markov chain. By leveraging this interpretation, we show the existence of a unique asymptotic stationary distribution, regardless of the arbitrarily fixed initialization. To characterize the exact form of this limiting distribution, we derive bounds for its moment generating function and tail probabilities, controlling over the first and second moments of SGD iterates. By these techniques, we prove that the stationary distribution converges to a Gaussian distribution as the constant learning rate $\eta\rightarrow0$. Our findings provide the first central limit theorem (CLT)-type theoretical guarantees for the last iterate of constant learning-rate SGD in non-smooth and non-strongly convex settings. We further propose a recursive algorithm to construct confidence intervals of SGD iterates in an online manner. Numerical studies demonstrate strong finite-sample performance of our proposed quantile estimator and inference method. The theoretical tools in this study are of independent interest to investigate general transition kernels in Markov chains.
Vector Copula Variational Inference and Dependent Block Posterior Approximations
Fu, Yu, Smith, Michael Stanley, Panagiotelis, Anastasios
Variational inference (VI) is a popular method to estimate statistical and econometric models. The key to VI is the selection of a tractable density to approximate the Bayesian posterior. For large and complex models a common choice is to assume independence between multivariate blocks in a partition of the parameter space. While this simplifies the problem it can reduce accuracy. This paper proposes using vector copulas to capture dependence between the blocks parsimoniously. Tailored multivariate marginals are constructed using learnable cyclically monotone transformations. We call the resulting joint distribution a ``dependent block posterior'' approximation. Vector copula models are suggested that make tractable and flexible variational approximations. They allow for differing marginals, numbers of blocks, block sizes and forms of between block dependence. They also allow for solution of the variational optimization using fast and efficient stochastic gradient methods. The efficacy and versatility of the approach is demonstrated using four different statistical models and 16 datasets which have posteriors that are challenging to approximate. In all cases, our method produces more accurate posterior approximations than benchmark VI methods that either assume block independence or factor-based dependence, at limited additional computational cost.
Parameter Expanded Stochastic Gradient Markov Chain Monte Carlo
Kim, Hyunsu, Nam, Giung, Yun, Chulhee, Yang, Hongseok, Lee, Juho
Bayesian Neural Networks (BNNs) provide a promising framework for modeling predictive uncertainty and enhancing out-of-distribution robustness (OOD) by estimating the posterior distribution of network parameters. Stochastic Gradient Markov Chain Monte Carlo (SGMCMC) is one of the most powerful methods for scalable posterior sampling in BNNs, achieving efficiency by combining stochastic gradient descent with second-order Langevin dynamics. However, SGMCMC often suffers from limited sample diversity in practice, which affects uncertainty estimation and model performance. We propose a simple yet effective approach to enhance sample diversity in SGMCMC without the need for tempering or running multiple chains. Our approach reparameterizes the neural network by decomposing each of its weight matrices into a product of matrices, resulting in a sampling trajectory that better explores the target parameter space. This approach produces a more diverse set of samples, allowing faster mixing within the same computational budget. Notably, our sampler achieves these improvements without increasing the inference cost compared to the standard SGMCMC. Extensive experiments on image classification tasks, including OOD robustness, diversity, loss surface analyses, and a comparative study with Hamiltonian Monte Carlo, demonstrate the superiority of the proposed approach.
On the Saturation Effects of Spectral Algorithms in Large Dimensions
Lu, Weihao, Zhang, Haobo, Li, Yicheng, Lin, Qian
The saturation effects, which originally refer to the fact that kernel ridge regression (KRR) fails to achieve the information-theoretical lower bound when the regression function is over-smooth, have been observed for almost 20 years and were rigorously proved recently for kernel ridge regression and some other spectral algorithms over a fixed dimensional domain. The main focus of this paper is to explore the saturation effects for a large class of spectral algorithms (including the KRR, gradient descent, etc.) in large dimensional settings where $n \asymp d^{\gamma}$. More precisely, we first propose an improved minimax lower bound for the kernel regression problem in large dimensional settings and show that the gradient flow with early stopping strategy will result in an estimator achieving this lower bound (up to a logarithmic factor). Similar to the results in KRR, we can further determine the exact convergence rates (both upper and lower bounds) of a large class of (optimal tuned) spectral algorithms with different qualification $\tau$'s. In particular, we find that these exact rate curves (varying along $\gamma$) exhibit the periodic plateau behavior and the polynomial approximation barrier. Consequently, we can fully depict the saturation effects of the spectral algorithms and reveal a new phenomenon in large dimensional settings (i.e., the saturation effect occurs in large dimensional setting as long as the source condition $s>\tau$ while it occurs in fixed dimensional setting as long as $s>2\tau$).
Convergence of energy-based learning in linear resistive networks
Huijzer, Anne-Men, Chaffey, Thomas, Besselink, Bart, van Waarde, Henk J.
-- Energy-based learning algorithms are alternatives to backpropagation and are well-suited to distributed implementations in analog electronic devices. However, a rigorous theory of convergence is lacking. We make a first step in this direction by analysing a particular energy-based learning algorithm, Contrastive Learning, applied to a network of linear adjustable resistors. It is shown that, in this setup, Contrastive Learning is equivalent to projected gradient descent on a convex function, for any step size, giving a guarantee of convergence for the algorithm. Backpropagation is the most popular method of training artificial neural networks. However, while artificial neural networks are inspired by biological nervous systems, it has long been observed that backpropagation is not biologically plausible [1]-[3]. Several biologically plausible alternatives to backpropagation have been proposed in the literature, among them so-called energy-based learning algorithms [4]- [11]. These algorithms apply to energy-based models, which come equipped with some generalized notion of energy, and associate to each input a minimum of this energy. The basic idea is to probe the system in two states, one free and one clamped, or dictated by the training data, and use the energy difference between these states as a cost function. An iterative procedure is then applied to minimise this cost function. Several clamping mechanisms and iterative procedures have been defined, among them Contrastive Learning [4], [5], [12], Equilibrium Propagation [7], Coupled Learning [9] and Temporal Contrastive Learning [13]. These algorithms all resemble gradient descent, where the gradient of the cost function is replaced by a gradient-like quantity which may be computed in a distributed manner across a network. The energy-based learning paradigm is particularly suited to learning in analog electronic devices, as they have a natural notion of generalized energy: the heat dissipated by electrical resistance (in this case, a power rather than energy). M. A. Huijzer, B. Besselink, and H.J. van Waarde are with the Bernoulli Institute for Mathematics, Computer Science, and Artificial Intelligence, University of Groningen, Groningen, The Netherlands; email: m.a.huijzer@rug.nl; Chaffey was with the Control Group, Department of Engineering, University of Cambridge, UK, and is now with the School of Electrical and Computer Engineering, University of Sydney, Australia; email: thomas.chaffey@sydney.edu.au. This is, in part, due to the ability of analog circuits to perform inference many times faster than conventional neural networks [20]-[22].