Gradient Descent
ANITA: An Optimal Loopless Accelerated Variance-Reduced Gradient Method
In this paper, we propose a novel accelerated gradient method called ANITA for solving the fundamental finite-sum optimization problems. Concretely, we consider both general convex and strongly convex settings: i) For general convex finite-sum problems, ANITA improves previous state-of-the-art result given by Varag (Lan et al., 2019). In particular, for large-scale problems or the convergence error is not very small, i.e., $n \geq \frac{1}{\epsilon^2}$, ANITA obtains the \emph{first} optimal result $O(n)$, matching the lower bound $\Omega(n)$ provided by Woodworth and Srebro (2016), while previous results are $O(n \log \frac{1}{\epsilon})$ of Varag (Lan et al., 2019) and $O(\frac{n}{\sqrt{\epsilon}})$ of Katyusha (Allen-Zhu, 2017). ii) For strongly convex finite-sum problems, we also show that ANITA can achieve the optimal convergence rate $O\big((n+\sqrt{\frac{nL}{\mu}})\log\frac{1}{\epsilon}\big)$ matching the lower bound $\Omega\big((n+\sqrt{\frac{nL}{\mu}})\log\frac{1}{\epsilon}\big)$ provided by Lan and Zhou (2015). Besides, ANITA enjoys a simpler loopless algorithmic structure unlike previous accelerated algorithms such as Varag (Lan et al., 2019) and Katyusha (Allen-Zhu, 2017) where they use double-loop structures. Moreover, we provide a novel \emph{dynamic multi-stage convergence analysis}, which is the key technical part for improving previous results to the optimal rates. We believe that our new theoretical rates and novel convergence analysis for the fundamental finite-sum problem will directly lead to key improvements for many other related problems, such as distributed/federated/decentralized optimization problems (e.g., Li and Richt\'arik, 2021). Finally, the numerical experiments show that ANITA converges faster than the previous state-of-the-art Varag (Lan et al., 2019), validating our theoretical results and confirming the practical superiority of ANITA.
Manifold Free Riemannian Optimization
Shustin, Boris, Avron, Haim, Sober, Barak
Riemannian optimization is a principled framework for solving optimization problems where the desired optimum is constrained to a smooth manifold $\mathcal{M}$. Algorithms designed in this framework usually require some geometrical description of the manifold, which typically includes tangent spaces, retractions, and gradients of the cost function. However, in many cases, only a subset (or none at all) of these elements can be accessed due to lack of information or intractability. In this paper, we propose a novel approach that can perform approximate Riemannian optimization in such cases, where the constraining manifold is a submanifold of $\R^{D}$. At the bare minimum, our method requires only a noiseless sample set of the cost function $(\x_{i}, y_{i})\in {\mathcal{M}} \times \mathbb{R}$ and the intrinsic dimension of the manifold $\mathcal{M}$. Using the samples, and utilizing the Manifold-MLS framework (Sober and Levin 2020), we construct approximations of the missing components entertaining provable guarantees and analyze their computational costs. In case some of the components are given analytically (e.g., if the cost function and its gradient are given explicitly, or if the tangent spaces can be computed), the algorithm can be easily adapted to use the accurate expressions instead of the approximations. We analyze the global convergence of Riemannian gradient-based methods using our approach, and we demonstrate empirically the strength of this method, together with a conjugate-gradients type method based upon similar principles.
Concept-modulated model-based offline reinforcement learning for rapid generalization
Ketz, Nicholas A., Pilly, Praveen K.
The robustness of any machine learning solution is fundamentally bound by the data it was trained on. One way to generalize beyond the original training is through human-informed augmentation of the original dataset; however, it is impossible to specify all possible failure cases that can occur during deployment. To address this limitation we combine model-based reinforcement learning and model-interpretability methods to propose a solution that self-generates simulated scenarios constrained by environmental concepts and dynamics learned in an unsupervised manner. In particular, an internal model of the agent's environment is conditioned on low-dimensional concept representations of the input space that are sensitive to the agent's actions. We demonstrate this method within a standard realistic driving simulator in a simple point-to-point navigation task, where we show dramatic improvements in one-shot generalization to different instances of specified failure cases as well as zero-shot generalization to similar variations compared to model-based and model-free approaches.
How the Adam Optimization technique works(Artificial Intelligence)
Abstract: A common way to train neural networks is the Backpropagation. This algorithm includes a gradient descent method, which needs an adaptive step size. In the area of neural networks, the ADAM-Optimizer is one of the most popular adaptive step size methods. The 5865 citations in only three years shows additionally the importance of the given paper. We discovered that the given convergence proof of the optimizer contains some mistakes, so that the proof will be wrong.
Finite-Time Error Bounds for Greedy-GQ
Wang, Yue, Zhou, Yi, Zou, Shaofeng
Greedy-GQ with linear function approximation, originally proposed in \cite{maei2010toward}, is a value-based off-policy algorithm for optimal control in reinforcement learning, and it has a non-linear two timescale structure with a non-convex objective function. This paper develops its finite-time error bounds. We show that the Greedy-GQ algorithm converges as fast as $\mathcal{O}({1}/{\sqrt{T}})$ under the i.i.d.\ setting and $\mathcal{O}({\log T}/{\sqrt{T}})$ under the Markovian setting. We further design a variant of the vanilla Greedy-GQ algorithm using the nested-loop approach, and show that its sample complexity is $\mathcal{O}({\log(1/\epsilon)\epsilon^{-2}})$, which matches with the one of the vanilla Greedy-GQ. Our finite-time error bounds match with one of the stochastic gradient descent algorithms for general smooth non-convex optimization problems. Our finite-sample analysis provides theoretical guidance on choosing step-sizes for faster convergence in practice and suggests the trade-off between the convergence rate and the quality of the obtained policy. Our techniques in this paper provide a general approach for finite-sample analysis of non-convex two timescale value-based reinforcement learning algorithms.
Future Gradient Descent for Adapting the Temporal Shifting Data Distribution in Online Recommendation Systems
Ye, Mao, Jiang, Ruichen, Wang, Haoxiang, Choudhary, Dhruv, Du, Xiaocong, Bhushanam, Bhargav, Mokhtari, Aryan, Kejariwal, Arun, Liu, Qiang
One of the key challenges of learning an online recommendation model is the temporal domain shift, which causes the mismatch between the training and testing data distribution and hence domain generalization error. To overcome, we propose to learn a meta future gradient generator that forecasts the gradient information of the future data distribution for training so that the recommendation model can be trained as if we were able to look ahead at the future of its deployment. Compared with Batch Update, a widely used paradigm, our theory suggests that the proposed algorithm achieves smaller temporal domain generalization error measured by a gradient variation term in a local regret. We demonstrate the empirical advantage by comparing with various representative baselines.
Learning Stochastic Graph Neural Networks with Constrained Variance
Stochastic graph neural networks (SGNNs) are information processing architectures that learn representations from data over random graphs. SGNNs are trained with respect to the expected performance, which comes with no guarantee about deviations of particular output realizations around the optimal expectation. To overcome this issue, we propose a variance-constrained optimization problem for SGNNs, balancing the expected performance and the stochastic deviation. An alternating primal-dual learning procedure is undertaken that solves the problem by updating the SGNN parameters with gradient descent and the dual variable with gradient ascent. To characterize the explicit effect of the variance-constrained learning, we conduct a theoretical analysis on the variance of the SGNN output and identify a trade-off between the stochastic robustness and the discrimination power. We further analyze the duality gap of the variance-constrained optimization problem and the converging behavior of the primal-dual learning procedure. The former indicates the optimality loss induced by the dual transformation and the latter characterizes the limiting error of the iterative algorithm, both of which guarantee the performance of the variance-constrained learning. Through numerical simulations, we corroborate our theoretical findings and observe a strong expected performance with a controllable standard deviation.
Gradient Descent for Machine Learning - A Beginners Playbook
Gradient Descent is the most widely used optimization strategy in machine learning and deep learning. Whenever the question comes to train data models, gradient descent is joined with other algorithms and ease to implement and understand. There is a common understanding that whoever wants to work with the machine learning must understand the concepts in detail. This article will also try to curate the information available with us from different sources, as a result, you will learn the basics. This week, I have got a task in my MSc AI course on gradient descent. If you are new to this journal, Open Tech Talks is your weekly sandbox for technology insights, experimentation, and inspiration with the primary objective of learning and sharing.
Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization
Xie, Zeke, Yuan, Li, Zhu, Zhanxing, Sugiyama, Masashi
It is well-known that stochastic gradient noise (SGN) acts as implicit regularization for deep learning and is essentially important for both optimization and generalization of deep networks. Some works attempted to artificially simulate SGN by injecting random noise to improve deep learning. However, it turned out that the injected simple random noise cannot work as well as SGN, which is anisotropic and parameter-dependent. For simulating SGN at low computational costs and without changing the learning rate or batch size, we propose the Positive-Negative Momentum (PNM) approach that is a powerful alternative to conventional Momentum in classic optimizers. The introduced PNM method maintains two approximate independent momentum terms. Then, we can control the magnitude of SGN explicitly by adjusting the momentum difference. We theoretically prove the convergence guarantee and the generalization advantage of PNM over Stochastic Gradient Descent (SGD). By incorporating PNM into the two conventional optimizers, SGD with Momentum and Adam, our extensive experiments empirically verified the significant advantage of the PNM-based variants over the corresponding conventional Momentum-based optimizers.
TAG: Task-based Accumulated Gradients for Lifelong learning
Malviya, Pranshu, Ravindran, Balaraman, Chandar, Sarath
When an agent encounters a continual stream of new tasks in the lifelong learning setting, it leverages the knowledge it gained from the earlier tasks to help learn the new tasks better. In such a scenario, identifying an efficient knowledge representation becomes a challenging problem. Most research works propose to either store a subset of examples from the past tasks in a replay buffer, dedicate a separate set of parameters to each task or penalize excessive updates over parameters by introducing a regularization term. While existing methods employ the general task-agnostic stochastic gradient descent update rule, we propose a task-aware optimizer that adapts the learning rate based on the relatedness among tasks. We utilize the directions taken by the parameters during the updates by additively accumulating the gradients specific to each task. These task-based accumulated gradients act as a knowledge base that is maintained and updated throughout the stream. We empirically show that our proposed adaptive learning rate not only accounts for catastrophic forgetting but also exhibits knowledge transfer. We also show that our method performs better than several state-of-the-art methods in lifelong learning on complex datasets. Moreover, our method can also be combined with the existing methods and achieve substantial improvement in performance. Lifelong learning (LLL), also known as continual learning, is a setting where an agent continuously learns from data belonging to different tasks (Parisi et al., 2019). Here, the goal is to maximize performance on all the tasks arriving in a stream without replaying the entire datasets from past tasks (Riemer et al., 2018). Approaches proposed in this setting involve investigating the stability-plasticity dilemma (Mermillod et al., 2013) in different ways where stability refers to preventing the forgetting of past knowledge and plasticity refers to accumulating new knowledge by learning new tasks (Mermillod et al., 2013; Delange et al., 2021). 1 Unlike human beings, who can efficiently assess the correctness and applicability of the past knowledge (Chen & Liu, 2018), neural networks and other machine learning models often face various issues in this setting. Whenever data from a new task arrives, these models often tend to forget the previously obtained knowledge due to dependency on the input data distribution, limited capacity, diversity among tasks, etc.