Mathematical & Statistical Methods
Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization
Xu, Pan, Chen, Jinghui, Zou, Difan, Gu, Quanquan
We present a unified framework to analyze the global convergence of Langevin dynamics based algorithms for nonconvex finite-sum optimization with $n$ component functions. At the core of our analysis is a direct analysis of the ergodicity of the numerical approximations to Langevin dynamics, which leads to faster convergence rates. Specifically, we show that gradient Langevin dynamics (GLD) and stochastic gradient Langevin dynamics (SGLD) converge to the \textit{almost minimizer}\footnote{Following \citet{raginsky2017non}, an almost minimizer is defined to be a point which is within the ball of the global minimizer with radius $O(d\log(\beta+1)/\beta)$, where $d$ is the problem dimension and $\beta$ is the inverse temperature parameter.} within $\tilde O\big(nd/(\lambda\epsilon) \big)$\footnote{$\tilde O(\cdot)$ notation hides polynomials of logarithmic terms and constants.} and $\tilde O\big(d^7/(\lambda^5\epsilon^5) \big)$ stochastic gradient evaluations respectively, where $d$ is the problem dimension, and $\lambda$ is the spectral gap of the Markov chain generated by GLD. Both results improve upon the best known gradient complexity\footnote{Gradient complexity is defined as the total number of stochastic gradient evaluations of an algorithm, which is the number of stochastic gradients calculated per iteration times the total number of iterations.} results \citep{raginsky2017non}. Furthermore, for the first time we prove the global convergence guarantee for variance reduced stochastic gradient Langevin dynamics (VR-SGLD) to the almost minimizer within $\tilde O\big(\sqrt{n}d^5/(\lambda^4\epsilon^{5/2})\big)$ stochastic gradient evaluations, which outperforms the gradient complexities of GLD and SGLD in a wide regime. Our theoretical analyses shed some light on using Langevin dynamics based algorithms for nonconvex optimization with provable guarantees.
Stochastic Trust Region Inexact Newton Method for Large-scale Machine Learning
Chauhan, Vinod Kumar, Sharma, Anuj, Dahiya, Kalpana
Nowadays stochastic approximation methods are one of the major research direction to deal with the large-scale machine learning problems. From stochastic first order methods, now the focus is shifting to stochastic second order methods due to their faster convergence. In this paper, we have proposed a novel Stochastic Trust RegiOn inexact Newton method, called as STRON, which uses conjugate gradient (CG) to solve trust region subproblem. The method uses progressive subsampling in the calculation of gradient and Hessian values to take the advantage of both stochastic approximation and full batch regimes. We have extended STRON using existing variance reduction techniques to deal with the noisy gradients, and using preconditioned conjugate gradient (PCG) as subproblem solver. We further extend STRON to solve SVM. Finally, the theoretical results prove superlinear convergence for STRON and the empirical results prove the efficacy of the proposed method against existing methods with bench marked datasets.
Inference and Sampling of $K_{33}$-free Ising Models
Likhosherstov, Valerii, Maximov, Yury, Chertkov, Michael
We call an Ising model tractable when it is possible to compute its partition function value (statistical inference) in polynomial time. The tractability also implies an ability to sample configurations of this model in polynomial time. The notion of tractability extends the basic case of planar zero-field Ising models. Our starting point is to describe algorithms for the basic case computing partition function and sampling efficiently. To derive the algorithms, we use an equivalent linear transition to perfect matching counting and sampling on an expanded dual graph. Then, we extend our tractable inference and sampling algorithms to models, whose triconnected components are either planar or graphs of $O(1)$ size. In particular, it results in a polynomial-time inference and sampling algorithms for $K_{33}$ (minor) free topologies of zero-field Ising models - a generalization of planar graphs with a potentially unbounded genus.
Closed-form Inference and Prediction in Gaussian Process State-Space Models
Ialongo, Alessandro Davide, van der Wilk, Mark, Rasmussen, Carl Edward
We examine an analytic variational inference scheme for the Gaussian Process State Space Model (GPSSM) - a probabilistic model for system identification and time-series modelling. Our approach performs variational inference over both the system states and the transition function. We exploit Markov structure in the true posterior, as well as an inducing point approximation to achieve linear time complexity in the length of the time series. Contrary to previous approaches, no Monte Carlo sampling is required: inference is cast as a deterministic optimisation problem. In a number of experiments, we demonstrate the ability to model non-linear dynamics in the presence of both process and observation noise as well as to impute missing information (e.g. velocities from raw positions through time), to de-noise, and to estimate the underlying dimensionality of the system. Finally, we also introduce a closed-form method for multi-step prediction, and a novel criterion for assessing the quality of our approximate posterior.
Parallel-tempered Stochastic Gradient Hamiltonian Monte Carlo for Approximate Multimodal Posterior Sampling
Luo, Rui, Zhang, Qiang, Liu, Yuanyuan
We propose a new sampler that integrates the protocol of parallel tempering with the Nos\'e-Hoover (NH) dynamics. The proposed method can efficiently draw representative samples from complex posterior distributions with multiple isolated modes in the presence of noise arising from stochastic gradient. It potentially facilitates deep Bayesian learning on large datasets where complex multimodal posteriors and mini-batch gradient are encountered.
On stochastic gradient Langevin dynamics with dependent data streams in the logconcave case
Barkhagen, M., Chau, N. H., Moulines, ร., Rรกsonyi, M., Sabanis, S., Zhang, Y.
Stochastic Gradient Langevin Dynamics (SGLD) is a combination of a Robbins-Monro type algorithm with Langevin dynamics in order to perform data-driven stochastic optimization. In this paper, the SGLD method with fixed step size $\lambda$ is considered in order to sample from a logconcave target distribution $\pi$, known up to a normalisation factor. We assume that unbiased estimates of the gradient from possibly dependent observations are available. It is shown that, for all $\varepsilon>0$, the Wasserstein-$2$ distance of the $n$th iterate of the SGLD algorithm from $\pi$ is dominated by $c_1(\varepsilon)[\lambda^{1/2 - \varepsilon}+e^{-a\lambda n}]$ with appropriate constants $c_1(\varepsilon), a>0$.
Efficient random graph matching via degree profiles
Ding, Jian, Ma, Zongming, Wu, Yihong, Xu, Jiaming
Random graph matching refers to recovering the underlying vertex correspondence between two random graphs with correlated edges; a prominent example is when the two random graphs are given by Erd\H{o}s-R\'{e}nyi graphs $G(n,\frac{d}{n})$. This can be viewed as an average-case and noisy version of the graph isomorphism problem. Under this model, the maximum likelihood estimator is equivalent to solving the intractable quadratic assignment problem. This work develops an $\tilde{O}(n d^2+n^2)$-time algorithm which perfectly recovers the true vertex correspondence with high probability, provided that the average degree is at least $d = \Omega(\log^2 n)$ and the two graphs differ by at most $\delta = O( \log^{-2}(n) )$ fraction of edges. For dense graphs and sparse graphs, this can be improved to $\delta = O( \log^{-2/3}(n) )$ and $\delta = O( \log^{-2}(d) )$ respectively, both in polynomial time. The methodology is based on appropriately chosen distance statistics of the degree profiles (empirical distribution of the degrees of neighbors). Before this work, the best known result achieves $\delta=O(1)$ and $n^{o(1)} \leq d \leq n^c$ for some constant $c$ with an $n^{O(\log n)}$-time algorithm \cite{barak2018nearly} and $\delta=\tilde O((d/n)^4)$ and $d = \tilde{\Omega}(n^{4/5})$ with a polynomial-time algorithm \cite{dai2018performance}.
Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations
Li, Qianxiao, Tai, Cheng, E, Weinan
We develop the mathematical foundations of the stochastic modified equations (SME) framework for analyzing the dynamics of stochastic gradient algorithms, where the latter is approximated by a class of stochastic differential equations with small noise parameters. We prove that this approximation can be understood mathematically as an weak approximation, which leads to a number of precise and useful results on the approximations of stochastic gradient descent (SGD), momentum SGD and stochastic Nesterov's accelerated gradient method in the general setting of stochastic objectives. We also demonstrate through explicit calculations that this continuous-time approach can uncover important analytical insights into the stochastic gradient algorithms under consideration that may not be easy to obtain in a purely discrete-time setting. Keywords: stochastic gradient algorithms, modified equations, stochastic differential equations, momentum, Nesterov's accelerated gradient
Signature moments to characterize laws of stochastic processes
Chevyrev, Ilya, Oberhauser, Harald
The normalized sequence of moments characterizes the law of any finite-dimensional random variable. We prove an analogous result for path-valued random variables, that is stochastic processes, by using the normalized sequence of signature moments. We use this to define a metric for laws of stochastic processes. This metric can be efficiently estimated from finite samples, even if the stochastic processes themselves evolve in high-dimensional state spaces. As an application, we provide a non-parametric two-sample hypothesis test for laws of stochastic processes.
Visions of a generalized probability theory
In this Book we argue that the fruitful interaction of computer vision and belief calculus is capable of stimulating significant advances in both fields. From a methodological point of view, novel theoretical results concerning the geometric and algebraic properties of belief functions as mathematical objects are illustrated and discussed in Part II, with a focus on both a perspective 'geometric approach' to uncertainty and an algebraic solution to the issue of conflicting evidence. In Part III we show how these theoretical developments arise from important computer vision problems (such as articulated object tracking, data association and object pose estimation) to which, in turn, the evidential formalism is able to provide interesting new solutions. Finally, some initial steps towards a generalization of the notion of total probability to belief functions are taken, in the perspective of endowing the theory of evidence with a complete battery of estimation and inference tools to the benefit of all scientists and practitioners.