Goto

Collaborating Authors

 asymptotic convergence


c336346c777707e09cab2a3c79174d90-Supplemental.pdf

Neural Information Processing Systems

We also establish new convergence complexities to achieve an approximate KKT solution when the objective can be smooth/nonsmooth, deterministic/stochastic and convex/nonconvex with complexity that is on a par with gradient descent for unconstrained optimization problems in respective cases. To the best of our knowledge, this is the first study of the first-order methods with complexity guarantee for nonconvex sparse-constrained problems.



Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

The scheme finds a target point for each block in parallel in a chosen subset of blocks by minimizing the sum of a strongly convex approximation to the smooth part on this block (with matching gradients) and the non-smooth part. Each block in this subset is then updated (in parallel) as a convex combination of the previous value and the target points. A parallel proximal gradient scheme can be obtained as a special case; though using a convex combination of the iterates yield a slightly different scheme than previous work. The suggested algorithm is very similar to [9], except that in [9] the subset was chosen using a greedy scheme (which can be expensive), whereas this submission explores both randomized schemes or a cyclic scheme. For these, the authors prove the asymptotic convergence to a stationary point of the algorithm under standard Lipschitz gradient conditions.


Understanding the Role of Momentum in Stochastic Gradient Methods

Igor Gitman, Hunter Lang, Pengchuan Zhang, Lin Xiao

Neural Information Processing Systems

The use of momentum in stochastic gradient methods has become a widespread practice in machine learning. Different variants of momentum, including heavy-ball momentum, Nesterov's accelerated gradient (NAG), and quasi-hyperbolic momentum (QHM), have demonstrated success on various tasks.




Supplementary Materials A Proof of Theorem 2: Asymptotic Convergence of Robust Q-Learning

Neural Information Processing Systems

V null, (15) which is the expectation of the estimated update in line 5 of Algorithm 1. A.1 Robust Bellman operator is a contraction It was shown in [Iyengar, 2005, Roy et al., 2017] that the robust Bellman operator is a contraction. Here, for completeness, we include the proof for our R-contamination uncertainty set. In this section, we develop the finite-time analysis of the Algorithm 1. B.1 Notations We first introduce some notations. D. (44) Hence from the Bernstein inequality ([Li et al., 2020]), we have that |k This hence completes the proof.Lemma 4. F or any t T, |k In this section we prove Theorem 4. First note that for any x,y R In this section we develop the finite-time analysis of the robust TDC algorithm. For the convenience of proof, we add a projection step to the algorithm, i.e., we let θ The approach in [Kaledin et al., 2020] transforms the D.1 Lipschitz Smoothness In this section, we first show that J (θ) is Lipschitz.


Review for NeurIPS paper: Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction

Neural Information Processing Systems

Weaknesses: My main concern about the paper is whether this proposed algorithm is actually implementable due to the specific expression of the (constant) learning rate. I have two concerns: 1. The learning rate depends on t_{mix} in Theorem 1 and on the universal constants c_1 in both Theorem 1 and Theorem 2. How can we compute/approximate t_{mix} in advance? If we cannot, is it sufficient to employ a lower-bound on t_{mix}? Looking at the proofs c_1 is a function of constant c (Equation 55) that in turn derives from Bernstein's inequality (Equation 81) and subsequently \tilde{c} (Equation 84), but its value is never explicitly computed. I am aware that also in [33] the learning rate schedule (that is not constant) depends on \mu_{min} and t_{mix}, but I think the authors should elaborate more on this and explain how to deal with it in practice, if possible.


A Subsampling Based Neural Network for Spatial Data

Thakur, Debjoy

arXiv.org Machine Learning

The application of deep neural networks in geospatial data has become a trending research problem in the present day. A significant amount of statistical research has already been introduced, such as generalized least square optimization by incorporating spatial variance-covariance matrix, considering basis functions in the input nodes of the neural networks, and so on. However, for lattice data, there is no available literature about the utilization of asymptotic analysis of neural networks in regression for spatial data. This article proposes a consistent localized two-layer deep neural network-based regression for spatial data. We have proved the consistency of this deep neural network for bounded and unbounded spatial domains under a fixed sampling design of mixed-increasing spatial regions. We have proved that its asymptotic convergence rate is faster than that of \cite{zhan2024neural}'s neural network and an improved generalization of \cite{shen2023asymptotic}'s neural network structure. We empirically observe the rate of convergence of discrepancy measures between the empirical probability distribution of observed and predicted data, which will become faster for a less smooth spatial surface. We have applied our asymptotic analysis of deep neural networks to the estimation of the monthly average temperature of major cities in the USA from its satellite image. This application is an effective showcase of non-linear spatial regression. We demonstrate our methodology with simulated lattice data in various scenarios.


Asymptotic and Non-Asymptotic Convergence Analysis of AdaGrad for Non-Convex Optimization via Novel Stopping Time-based Analysis

Jin, Ruinan, Wang, Xiaoyu, Wang, Baoxiang

arXiv.org Machine Learning

Adaptive optimizers have emerged as powerful tools in deep learning, dynamically adjusting the learning rate based on iterative gradients. These adaptive methods have significantly succeeded in various deep learning tasks, outperforming stochastic gradient descent (SGD). However, although AdaGrad is a cornerstone adaptive optimizer, its theoretical analysis is inadequate in addressing asymptotic convergence and non-asymptotic convergence rates on non-convex optimization. This study aims to provide a comprehensive analysis and complete picture of AdaGrad. We first introduce a novel stopping time technique from probabilistic theory to establish stability for the norm version of AdaGrad under milder conditions. We further derive two forms of asymptotic convergence: almost sure and mean-square. Furthermore, we demonstrate the near-optimal non-asymptotic convergence rate measured by the average-squared gradients in expectation, which is rarely explored and stronger than the existing high-probability results, under the mild assumptions. The techniques developed in this work are potentially independent of interest for future research on other adaptive stochastic algorithms.