Gradient Descent
Local AdaGrad-Type Algorithm for Stochastic Convex-Concave Optimization
Liao, Luofeng, Shen, Li, Duan, Jia, Kolar, Mladen, Tao, Dacheng
Large scale convex-concave minimax problems arise in numerous applications, including game theory, robust training, and training of generative adversarial networks. Despite their wide applicability, solving such problems efficiently and effectively is challenging in the presence of large amounts of data using existing stochastic minimax methods. We study a class of stochastic minimax methods and develop a communication-efficient distributed stochastic extragradient algorithm, LocalAdaSEG, with an adaptive learning rate suitable for solving convex-concave minimax problems in the Parameter-Server model. LocalAdaSEG has three main features: (i) a periodic communication strategy that reduces the communication cost between workers and the server; (ii) an adaptive learning rate that is computed locally and allows for tuning-free implementation; and (iii) theoretically, a nearly linear speed-up with respect to the dominant variance term, arising from the estimation of the stochastic gradient, is proven in both the smooth and nonsmooth convex-concave settings. LocalAdaSEG is used to solve a stochastic bilinear game, and train a generative adversarial network. We compare LocalAdaSEG against several existing optimizers for minimax problems and demonstrate its efficacy through several experiments in both homogeneous and heterogeneous settings.
Error Mitigation-Aided Optimization of Parameterized Quantum Circuits: Convergence Analysis
Jose, Sharu Theresa, Simeone, Osvaldo
Variational quantum algorithms (VQAs) offer the most promising path to obtaining quantum advantages via noisy intermediate-scale quantum (NISQ) processors. Such systems leverage classical optimization to tune the parameters of a parameterized quantum circuit (PQC). The goal is minimizing a cost function that depends on measurement outputs obtained from the PQC. Optimization is typically implemented via stochastic gradient descent (SGD). On NISQ computers, gate noise due to imperfections and decoherence affects the stochastic gradient estimates by introducing a bias. Quantum error mitigation (QEM) techniques can reduce the estimation bias without requiring any increase in the number of qubits, but they in turn cause an increase in the variance of the gradient estimates. This work studies the impact of quantum gate noise on the convergence of SGD for the variational eigensolver (VQE), a fundamental instance of VQAs. The main goal is ascertaining conditions under which QEM can enhance the performance of SGD for VQEs. It is shown that quantum gate noise induces a non-zero error-floor on the convergence error of SGD (evaluated with respect to a reference noiseless PQC), which depends on the number of noisy gates, the strength of the noise, as well as the eigenspectrum of the observable being measured and minimized. In contrast, with QEM, any arbitrarily small error can be obtained. Furthermore, for error levels attainable with or without QEM, QEM can reduce the number of required iterations, but only as long as the quantum noise level is sufficiently small, and a sufficiently large number of measurements is allowed at each SGD iteration. Numerical examples for a max-cut problem corroborate the main theoretical findings.
STEADY: Simultaneous State Estimation and Dynamics Learning from Indirect Observations
Wei, Jiayi, Holtz, Jarrett, Dillig, Isil, Biswas, Joydeep
Accurate kinodynamic models play a crucial role in many robotics applications such as off-road navigation and high-speed driving. Many state-of-the-art approaches in learning stochastic kinodynamic models, however, require precise measurements of robot states as labeled input/output examples, which can be hard to obtain in outdoor settings due to limited sensor capabilities and the absence of ground truth. In this work, we propose a new technique for learning neural stochastic kinodynamic models from noisy and indirect observations by performing simultaneous state estimation and dynamics learning. The proposed technique iteratively improves the kinodynamic model in an expectation-maximization loop, where the E Step samples posterior state trajectories using particle filtering, and the M Step updates the dynamics to be more consistent with the sampled trajectories via stochastic gradient ascent. We evaluate our approach on both simulation and real-world benchmarks and compare it with several baseline techniques. Our approach not only achieves significantly higher accuracy but is also more robust to observation noise, thereby showing promise for boosting the performance of many other robotics applications.
Two-terminal source coding with common sum reconstruction
Adikari, Tharindu, Draper, Stark
Abstract--We present the problem of two-terminal source coding with Common Sum Reconstruction (CSR). Both terminals want to reconstruct the sum of the two sources under some average distortion constraint, and the reconstructions at two terminals must be identical with high probability. We employ existing achievability results for Steinberg's common reconstruction and Wyner-Ziv's source Figure 1: The dashed line separates the two terminals. For example, let for some distortion measure d(,) and D 0. We obtain the two terminals in Figure 1 be two compute nodes optimizing the "Two-terminal Source Coding with Common Sum Reconstruction" some function with synchronous SGD, and let X Two stochastic gradients are correlated since must produce a Common Reconstruction (CR) of the sum they are noisy estimates of the gradient of the function. The butterfly all-reduce algorithm employs is a Doubly Symmetric Binary Source (DSBS) and d(,) is the two-terminal communication setup as a basic building Hamming distortion measure.
Improved Binary Forward Exploration: Learning Rate Scheduling Method for Stochastic Optimization
A new gradient-based optimization approach by automatically scheduling the learning rate has been proposed recently, which is called Binary Forward Exploration (BFE). The Adaptive version of BFE has also been discussed thereafter. In this paper, the improved algorithms based on them will be investigated, in order to optimize the efficiency and robustness of the new methodology. This improved approach provides a new perspective to scheduling the update of learning rate and will be compared with the stochastic gradient descent, aka SGD algorithm with momentum or Nesterov momentum and the most successful adaptive learning rate algorithm e.g. Adam. The goal of this method does not aim to beat others but provide a different viewpoint to optimize the gradient descent process. This approach combines the advantages of the first-order and second-order optimizations in the aspects of speed and efficiency.
Theoretical Analysis of Primal-Dual Algorithm for Non-Convex Stochastic Decentralized Optimization
Takezawa, Yuki, Niwa, Kenta, Yamada, Makoto
In recent years, decentralized learning has emerged as a powerful tool not only for large-scale machine learning, but also for preserving privacy. One of the key challenges in decentralized learning is that the data distribution held by each node is statistically heterogeneous. To address this challenge, the primal-dual algorithm called the Edge-Consensus Learning (ECL) was proposed and was experimentally shown to be robust to the heterogeneity of data distributions. However, the convergence rate of the ECL is provided only when the objective function is convex, and has not been shown in a standard machine learning setting where the objective function is non-convex. Furthermore, the intuitive reason why the ECL is robust to the heterogeneity of data distributions has not been investigated. In this work, we first investigate the relationship between the ECL and Gossip algorithm and show that the update formulas of the ECL can be regarded as correcting the local stochastic gradient in the Gossip algorithm. Then, we propose the Generalized ECL (G-ECL), which contains the ECL as a special case, and provide the convergence rates of the G-ECL in both (strongly) convex and non-convex settings, which do not depend on the heterogeneity of data distributions. Through synthetic experiments, we demonstrate that the numerical results of both the G-ECL and ECL coincide with the convergence rate of the G-ECL.
Projected Gradient Descent Algorithms for Solving Nonlinear Inverse Problems with Generative Priors
In this paper, we propose projected gradient descent (PGD) algorithms for signal estimation from noisy nonlinear measurements. We assume that the unknown $p$-dimensional signal lies near the range of an $L$-Lipschitz continuous generative model with bounded $k$-dimensional inputs. In particular, we consider two cases when the nonlinear link function is either unknown or known. For unknown nonlinearity, similarly to \cite{liu2020generalized}, we make the assumption of sub-Gaussian observations and propose a linear least-squares estimator. We show that when there is no representation error and the sensing vectors are Gaussian, roughly $O(k \log L)$ samples suffice to ensure that a PGD algorithm converges linearly to a point achieving the optimal statistical rate using arbitrary initialization. For known nonlinearity, we assume monotonicity as in \cite{yang2016sparse}, and make much weaker assumptions on the sensing vectors and allow for representation error. We propose a nonlinear least-squares estimator that is guaranteed to enjoy an optimal statistical rate. A corresponding PGD algorithm is provided and is shown to also converge linearly to the estimator using arbitrary initialization. In addition, we present experimental results on image datasets to demonstrate the performance of our PGD algorithms.
MAC: A Meta-Learning Approach for Feature Learning and Recombination
Tiwari, S., Gogoi, M., Verma, S., Singh, K. P.
Optimization-based meta-learning aims to learn an initialization so that a new unseen task can be learned within a few gradient updates. Model Agnostic Meta-Learning (MAML) is a benchmark algorithm comprising two optimization loops. The inner loop is dedicated to learning a new task and the outer loop leads to meta-initialization. However, ANIL (almost no inner loop) algorithm shows that feature reuse is an alternative to rapid learning in MAML. Thus, the meta-initialization phase makes MAML primed for feature reuse and obviates the need for rapid learning. Contrary to ANIL, we hypothesize that there may be a need to learn new features during meta-testing. A new unseen task from non-similar distribution would necessitate rapid learning in addition reuse and recombination of existing features. In this paper, we invoke the width-depth duality of neural networks, wherein, we increase the width of the network by adding extra computational units (ACU). The ACUs enable the learning of new atomic features in the meta-testing task, and the associated increased width facilitates information propagation in the forwarding pass. The newly learnt features combine with existing features in the last layer for meta-learning. Experimental results show that our proposed MAC method outperformed existing ANIL algorithm for non-similar task distribution by approximately 13% (5-shot task setting)
Gradient Norm Minimization of Nesterov Acceleration: $o(1/k^3)$
Chen, Shuo, Shi, Bin, Yuan, Ya-xiang
In the history of first-order algorithms, Nesterov's accelerated gradient descent (NAG) is one of the milestones. However, the cause of the acceleration has been a mystery for a long time. It has not been revealed with the existence of gradient correction until the high-resolution differential equation framework proposed in [Shi et al., 2021]. In this paper, we continue to investigate the acceleration phenomenon. First, we provide a significantly simplified proof based on precise observation and a tighter inequality for $L$-smooth functions. Then, a new implicit-velocity high-resolution differential equation framework, as well as the corresponding implicit-velocity version of phase-space representation and Lyapunov function, is proposed to investigate the convergence behavior of the iterative sequence $\{x_k\}_{k=0}^{\infty}$ of NAG. Furthermore, from two kinds of phase-space representations, we find that the role played by gradient correction is equivalent to that by velocity included implicitly in the gradient, where the only difference comes from the iterative sequence $\{y_{k}\}_{k=0}^{\infty}$ replaced by $\{x_k\}_{k=0}^{\infty}$. Finally, for the open question of whether the gradient norm minimization of NAG has a faster rate $o(1/k^3)$, we figure out a positive answer with its proof. Meanwhile, a faster rate of objective value minimization $o(1/k^2)$ is shown for the case $r > 2$.
Sobolev Acceleration and Statistical Optimality for Learning Elliptic Equations via Gradient Descent
Lu, Yiping, Blanchet, Jose, Ying, Lexing
In this paper, we study the statistical limits in terms of Sobolev norms of gradient descent for solving inverse problem from randomly sampled noisy observations using a general class of objective functions. Our class of objective functions includes Sobolev training for kernel regression, Deep Ritz Methods (DRM), and Physics Informed Neural Networks (PINN) for solving elliptic partial differential equations (PDEs) as special cases. We consider a potentially infinite-dimensional parameterization of our model using a suitable Reproducing Kernel Hilbert Space and a continuous parameterization of problem hardness through the definition of kernel integral operators. We prove that gradient descent over this objective function can also achieve statistical optimality and the optimal number of passes over the data increases with sample size. Based on our theory, we explain an implicit acceleration of using a Sobolev norm as the objective function for training, inferring that the optimal number of epochs of DRM becomes larger than the number of PINN when both the data size and the hardness of tasks increase, although both DRM and PINN can achieve statistical optimality.