Goto

Collaborating Authors

 Lin, Junhong


Generalization Properties of Doubly Stochastic Learning Algorithms

arXiv.org Machine Learning

Doubly stochastic learning algorithms are scalable kernel methods that perform very well in practice. However, their generalization properties are not well understood and their analysis is challenging since the corresponding learning sequence may not be in the hypothesis space induced by the kernel. In this paper, we provide an in-depth theoretical analysis for different variants of doubly stochastic learning algorithms within the setting of nonparametric regression in a reproducing kernel Hilbert space and considering the square loss. Particularly, we derive convergence results on the generalization error for the studied algorithms either with or without an explicit penalty term. To the best of our knowledge, the derived results for the unregularized variants are the first of this kind, while the results for the regularized variants improve those in the literature. The novelties in our proof are a sample error bound that requires controlling the trace norm of a cumulative operator, and a refined analysis of bounding initial error.


Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral-Regularization Algorithms

arXiv.org Machine Learning

We study generalization properties of distributed algorithms in the setting of nonparametric regression over a reproducing kernel Hilbert space (RKHS). We first investigate distributed stochastic gradient methods (SGM), with mini-batches and multi-passes over the data. We show that optimal generalization error bounds can be retained for distributed SGM provided that the partition level is not too large. We then extend our results to spectral-regularization algorithms (SRA), including kernel ridge regression (KRR), kernel principal component analysis, and gradient methods. Our results are superior to the state-of-the-art theory. Particularly, our results show that distributed SGM has a smaller theoretical computational complexity, compared with distributed KRR and classic SGM. Moreover, even for non-distributed SRA, they provide the first optimal, capacity-dependent convergence rates, considering the case that the regression function may not be in the RKHS.


Optimal Rates for Spectral-regularized Algorithms with Least-Squares Regression over Hilbert Spaces

arXiv.org Machine Learning

In this paper, we study regression problems over a separable Hilbert space with the square loss, covering non-parametric regression over a reproducing kernel Hilbert space. We investigate a class of spectral-regularized algorithms, including ridge regression, principal component analysis, and gradient methods. We prove optimal, high-probability convergence results in terms of variants of norms for the studied algorithms, considering a capacity assumption on the hypothesis space and a general source condition on the target function. Consequently, we obtain almost sure convergence results with optimal rates. Our results improve and generalize previous results, filling a theoretical gap for the non-attainable cases.


Optimal Rates for Learning with Nystr\"om Stochastic Gradient Methods

arXiv.org Machine Learning

In the setting of nonparametric regression, we propose and study a combination of stochastic gradient methods with Nystr\"om subsampling, allowing multiple passes over the data and mini-batches. Generalization error bounds for the studied algorithm are provided. Particularly, optimal learning rates are derived considering different possible choices of the step-size, the mini-batch size, the number of iterations/passes, and the subsampling level. In comparison with state-of-the-art algorithms such as the classic stochastic gradient methods and kernel ridge regression with Nystr\"om, the studied algorithm has advantages on the computational complexity, while achieving the same optimal learning rates. Moreover, our results indicate that using mini-batches can reduce the total computational cost while achieving the same optimal statistical results.


Optimal Rates for Multi-pass Stochastic Gradient Methods

arXiv.org Machine Learning

We analyze the learning properties of the stochastic gradient method when multiple passes over the data and mini-batches are allowed. We study how regularization properties are controlled by the step-size, the number of passes and the mini-batch size. In particular, we consider the square loss and show that for a universal step-size choice, the number of passes acts as a regularization parameter, and optimal finite sample bounds can be achieved by early-stopping. Moreover, we show that larger step-sizes are allowed when considering mini-batches. Our analysis is based on a unifying approach, encompassing both batch and stochastic gradient methods as special cases. As a byproduct, we derive optimal convergence results for batch gradient methods (even in the non-attainable cases).


Optimal Learning for Multi-pass Stochastic Gradient Methods

Neural Information Processing Systems

We analyze the learning properties of the stochastic gradient method when multiple passes over the data and mini-batches are allowed. In particular, we consider the square loss and show that for a universal step-size choice, the number of passes acts as a regularization parameter, and optimal finite sample bounds can be achieved by early-stopping. Moreover, we show that larger step-sizes are allowed when considering mini-batches. Our analysis is based on a unifying approach, encompassing both batch and stochastic gradient methods as special cases.


Generalization Properties and Implicit Regularization for Multiple Passes SGM

arXiv.org Machine Learning

We study the generalization properties of stochastic gradient methods for learning with convex loss functions and linearly parameterized functions. We show that, in the absence of penalizations or constraints, the stability and approximation properties of the algorithm can be controlled by tuning either the step-size or the number of passes over the data. In this view, these parameters can be seen to control a form of implicit regularization. Numerical results complement the theoretical findings.


Iterative Regularization for Learning with Convex Loss Functions

arXiv.org Machine Learning

We consider the problem of supervised learning with convex loss functions and propose a new form of iterative regularization based on the subgradient method. Unlike other regularization approaches, in iterative regularization no constraint or penalization is considered, and generalization is achieved by (early) stopping an empirical iteration. We consider a nonparametric setting, in the framework of reproducing kernel Hilbert spaces, and prove finite sample bounds on the excess risk under general regularity conditions. Our study provides a new class of efficient regularized learning algorithms and gives insights on the interplay between statistics and optimization in machine learning.