Goto

Collaborating Authors

 improved convergence rate


Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with Variance Reduction and its Application to Optimization

Neural Information Processing Systems

The stochastic gradient Langevin Dynamics is one of the most fundamental algorithms to solve sampling problems and non-convex optimization appearing in several machine learning applications. Especially, its variance reduced versions have nowadays gained particular attention. In this paper, we study two variants of this kind, namely, the Stochastic Variance Reduced Gradient Langevin Dynamics and the Stochastic Recursive Gradient Langevin Dynamics. We prove their convergence to the objective distribution in terms of KL-divergence under the sole assumptions of smoothness and Log-Sobolev inequality which are weaker conditions than those used in prior works for these algorithms. With the batch size and the inner loop length set to $\sqrt{n}$, the gradient complexity to achieve an $\epsilon$-precision is $\tilde{O}((n+dn^{1/2}\epsilon^{-1})\gamma^2 L^2\alpha^{-2})$, which is an improvement from any previous analyses. We also show some essential applications of our result to non-convex optimization.


Supplementary Material: Extrapolation Towards Imaginary 0-Nearest Neighbour and Its Improved Convergence Rate A Related works Györfi (1981) is the first work that proves the convergence rate O (n

Neural Information Processing Systems

In this section, we describe Nadaraya-Watson (NW) classifier, Local Polynomial (LP) classifier and their convergence rates (Audibert & Tsybakov, 2007). Proof of Corollary 2. Proposition 6 immediately proves the assertion. We basically follow the proof of Chaudhuri & Dasgupta (2014) Theorem 4(b). In Section G.1, we first define symbols In Section G.2, we describe the sketch of the proof and main differences between our proof and that of Section G.3 shows the main body of the Proof, by utilizing several Lemmas listed in A minimum radius whose measure of the ball is larger than t > 0, i.e., r Chaudhuri & Dasgupta (2014) Lemma 21) Then, the assertion is proved. See the following Section G.4 for Lemma 1-7 used in this proof.


Review for NeurIPS paper: Extrapolation Towards Imaginary 0-Nearest Neighbour and Its Improved Convergence Rate

Neural Information Processing Systems

This is previously done by leveraging assumptions over the distribution (its smoothness (\beta-Holder condition, \gamma-neighbour average smoothness) and "margin" (\alpha-margin condition, namely, upper bounding the probability of instances whose expected label expectation is 1/2)) along with using weighted k-NN, with several methods for defining the weights and their resulting convergence rate. The paper proposes a new method, called MS-k-NN, of estimating several unweighted k-NN estimators, for different values of k (\nu(k) for simplicity of notation in the review). For each k, a radius r is associated by taking the distance from the query to its k-closest neighbour. Then, pairs (r(k), \nu(k)) are obtained, and parameters b are obtained by linear regression in order to estimate \nu(k) by a polynomial in r(k). It is proven that this method obtains the optimal convergence rates obtained by more cumbersome methods of weighted k-NN. It is shown that experimentally, over a few datasets, the performance of MS-k-NN is similar to that of the weighted k-NN.


Review for NeurIPS paper: Extrapolation Towards Imaginary 0-Nearest Neighbour and Its Improved Convergence Rate

Neural Information Processing Systems

The paper presents a new nonparametric learning method, which seems to combine certain elements of k-nearest neighbors with elements of local regression estimation. It recovers the optimal rates for classification with smooth regression functions and Tsybakov noise, previously established for a local polynomial regression method, but uses a predictor representation involving far fewer parameters, as in a simple weighted k-NN predictor. The reviewers favor accepting the paper. However, they have some reservations, as they would prefer the paper be presented differently, with more space dedicated to presenting the new techniques, and with more investigation into the strengths of this particular method compared to the well-known standard techniques.


Extrapolation Towards Imaginary 0-Nearest Neighbour and Its Improved Convergence Rate

Neural Information Processing Systems

The weights and the parameter k \in \mathbb{N} regulate its bias-variance trade-off, and the trade-off implicitly affects the convergence rate of the excess risk for the k -NN classifier; several existing studies considered selecting optimal k and weights to obtain faster convergence rate. Whereas k -NN with non-negative weights has been developed widely, it was also proved that negative weights are essential for eradicating the bias terms and attaining optimal convergence rate. In this paper, we propose a novel multiscale k -NN (MS- k -NN), that extrapolates unweighted k -NN estimators from several k \ge 1 values to k 0, thus giving an imaginary 0-NN estimator. Our method implicitly computes optimal real-valued weights that are adaptive to the query and its neighbour points. We theoretically prove that the MS- k -NN attains the improved rate, which coincides with the existing optimal rate under some conditions.


Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with Variance Reduction and its Application to Optimization

Neural Information Processing Systems

The stochastic gradient Langevin Dynamics is one of the most fundamental algorithms to solve sampling problems and non-convex optimization appearing in several machine learning applications. Especially, its variance reduced versions have nowadays gained particular attention. In this paper, we study two variants of this kind, namely, the Stochastic Variance Reduced Gradient Langevin Dynamics and the Stochastic Recursive Gradient Langevin Dynamics. We prove their convergence to the objective distribution in terms of KL-divergence under the sole assumptions of smoothness and Log-Sobolev inequality which are weaker conditions than those used in prior works for these algorithms. With the batch size and the inner loop length set to \sqrt{n}, the gradient complexity to achieve an \epsilon -precision is \tilde{O}((n dn {1/2}\epsilon {-1})\gamma 2 L 2\alpha {-2}), which is an improvement from any previous analyses.