Goto

Collaborating Authors

 Mathematical & Statistical Methods


Exploiting Local Convergence of Quasi-Newton Methods Globally: Adaptive Sample Size Approach

Neural Information Processing Systems

In this paper, we study the application of quasi-Newton methods for solving empirical risk minimization (ERM) problems defined over a large dataset. Traditional deterministic and stochastic quasi-Newton methods can be executed to solve such problems; however, it is known that their global convergence rate may not be better than first-order methods, and their local superlinear convergence only appears towards the end of the learning process. In this paper, we use an adaptive sample size scheme that exploits the superlinear convergence of quasi-Newton methods globally and throughout the entire learning process. The main idea of the proposed adaptive sample size algorithms is to start with a small subset of data points and solve their corresponding ERM problem within its statistical accuracy, and then enlarge the sample size geometrically and use the optimal solution of the problem corresponding to the smaller set as an initial point for solving the subsequent ERM problem with more samples. We show that if the initial sample size is sufficiently large and we use quasi-Newton methods to solve each subproblem, the subproblems can be solved superlinearly fast (after at most three iterations), as we guarantee that the iterates always stay within a neighborhood that quasi-Newton methods converge superlinearly.


Random Quadratic Forms with Dependence: Applications to Restricted Isometry and Beyond

Neural Information Processing Systems

Several important families of computational and statistical results in machine learning and randomized algorithms rely on uniform bounds on quadratic forms of random vectors or matrices. Such results include the Johnson-Lindenstrauss (J-L) Lemma, the Restricted Isometry Property (RIP), randomized sketching algorithms, and approximate linear algebra. The existing results critically depend on statistical independence, e.g., independent entries for random vectors, independent rows for random matrices, etc., which prevent their usage in dependent or adaptive modeling settings. In this paper, we show that such independence is in fact not needed for such results which continue to hold under fairly general dependence structures. In particular, we present uniform bounds on random quadratic forms of stochastic processes which are conditionally independent and sub-Gaussian given another (latent) process.


SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence

Neural Information Processing Systems

Stein Variational Gradient Descent (SVGD), a popular sampling algorithm, is often described as the kernelized gradient flow for the Kullback-Leibler divergence in the geometry of optimal transport. We introduce a new perspective on SVGD that instead views SVGD as the kernelized gradient flow of the chi-squared divergence. Motivated by this perspective, we provide a convergence analysis of the chi-squared gradient flow. We also show that our new perspective provides better guidelines for choosing effective kernels for SVGD.


Neural Lyapunov Control for Discrete-Time Systems

Neural Information Processing Systems

While ensuring stability for linear systems is well understood, it remains a major challenge for nonlinear systems. A general approach in such cases is to compute a combination of a Lyapunov function and an associated control policy. However, finding Lyapunov functions for general nonlinear systems is a challenging task. To address this challenge, several methods have been proposed that represent Lyapunov functions using neural networks. However, such approaches either focus on continuous-time systems, or highly restricted classes of nonlinear dynamics.


Learning Erdos-Renyi Random Graphs via Edge Detecting Queries

Neural Information Processing Systems

In this paper, we consider the problem of learning an unknown graph via queries on groups of nodes, with the result indicating whether or not at least one edge is present among those nodes. While learning arbitrary graphs with n nodes and k edges is known to be hard in the sense of requiring \Omega( \min\{ k 2 \log n, n 2\}) tests (even when a small probability of error is allowed), we show that learning an Erd\H{o}s-R\'enyi random graph with an average of \kbar edges is much easier; namely, one can attain asymptotically vanishing error probability with only O(\kbar \log n) tests. We establish such bounds for a variety of algorithms inspired by the group testing problem, with explicit constant factors indicating a near-optimal number of tests, and in some cases asymptotic optimality including constant factors. In addition, we present an alternative design that permits a near-optimal sublinear decoding time of O(\kbar \log 2 \kbar \kbar \log n) .


Stochastic Cubic Regularization for Fast Nonconvex Optimization

Neural Information Processing Systems

This paper proposes a stochastic variant of a classic algorithm---the cubic-regularized Newton method [Nesterov and Polyak]. The proposed algorithm efficiently escapes saddle points and finds approximate local minima for general smooth, nonconvex functions in only \mathcal{\tilde{O}}(\epsilon {-3.5}) stochastic gradient and stochastic Hessian-vector product evaluations. The latter can be computed as efficiently as stochastic gradients. This improves upon the \mathcal{\tilde{O}}(\epsilon {-4}) rate of stochastic gradient descent. Our rate matches the best-known result for finding local minima without requiring any delicate acceleration or variance-reduction techniques.


Uncertainty Sampling is Preconditioned Stochastic Gradient Descent on Zero-One Loss

Neural Information Processing Systems

Uncertainty sampling, a popular active learning algorithm, is used to reduce the amount of data required to learn a classifier, but it has been observed in practice to converge to different parameters depending on the initialization and sometimes to even better parameters than standard training on all the data. In this work, we give a theoretical explanation of this phenomenon, showing that uncertainty sampling on a convex (e.g., logistic) loss can be interpreted as performing a preconditioned stochastic gradient step on the population zero-one loss. Experiments on synthetic and real datasets support this connection.


Reviews: A Stein variational Newton method

Neural Information Processing Systems

Summary: SVGD iteratively moves a set of particles toward the target by choosing a perturbative direction to maximumly decrease the KL divergence with the target distribution in RKHS. The paper proposes to add second-order information into SVGD updates, preliminary empirical results show that their method converges faster in few cases. The paper is well written, and the proofs seem correct. An important reason in using second-order information is the hope to achieve a faster convergence rate. My major concern is a lack of theoretical analysis of convergence rate in this paper: 1) An appealing property of SVGD is that the optimal decreasing rate equals to Stein discrepancy D_F(q p), where F is a function set that includes all possible velocity fields.


Reviews: Efficient Stochastic Gradient Hard Thresholding

Neural Information Processing Systems

The article analyses convergence in hard-thresholding algorithms and proposes an accelerated stochastic hybrid hard thresholding method that displays better convergence with respect to the compared methods. The article is dense but relatively fine to follow. Theoretical development seems to be complete and accurate, though I admit I have not throughly followed the full derivation. Experimental section is in accordance with the theoretical claims and is more than sufficient. Just for the sake of reproducibility of the results an exhaustive pseudocode or repository should be made available as a companion to the article to further strength the autor's points.


Reviews: A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization

Neural Information Processing Systems

This paper focuses on the optimization problem min f(x) h(x), where f is of a finite sum structure (with n functions in the sum), with nonconvex but smooth components, and h is a convex but possibly nonsmooth function. So, this is a nonconvex finite sum problem with a convex regularizer. Function h is treated using a prox step. The authors propose a small modification to ProxSVRG (called ProxSVRG), and prove that this small modification has surprisingly interesting consequences. The modification consists in replacing the full gradient computation in the outer loop of ProxSVRG by an approximation thereof through subsampling/minibatch (batch size B).