Ying, Yiming
Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent
Lei, Yunwen, Ying, Yiming
Recently there are a considerable amount of work devoted to the study of the algorithmic stability and generalization for stochastic gradient descent (SGD). However, the existing stability analysis requires to impose restrictive assumptions on the boundedness of gradients, strong smoothness and convexity of loss functions. In this paper, we provide a fine-grained analysis of stability and generalization for SGD by substantially relaxing these assumptions. Firstly, we establish stability and generalization for SGD by removing the existing bounded gradient assumptions. The key idea is the introduction of a new stability measure called on-average model stability, for which we develop novel bounds controlled by the risks of SGD iterates. This yields generalization bounds depending on the behavior of the best model, and leads to the first-ever-known fast bounds in the low-noise setting using stability approach. Secondly, the smoothness assumption is relaxed by considering loss functions with Holder continuous (sub)gradients for which we show that optimal bounds are still achieved by balancing computation and stability. To our best knowledge, this gives the first-ever-known stability and generalization bounds for SGD with even non-differentiable loss functions. Finally, we study learning problems with (strongly) convex objectives but non-convex loss functions.
Stochastic AUC Maximization with Deep Neural Networks
Liu, Mingrui, Yuan, Zhuoning, Ying, Yiming, Yang, Tianbao
Stochastic AUC maximization has garnered an increasing interest due to better fit to imbalanced data classification. However, existing works are limited to stochastic AUC maximization with a linear predictive model, which restricts its predictive power when dealing with extremely complex data. In this paper, we consider stochastic AUC maximization problem with a deep neural network as the predictive model. Building on the saddle point reformulation of a surrogated loss of AUC, the problem can be cast into a {\it non-convex concave} min-max problem. The main contribution made in this paper is to make stochastic AUC maximization more practical for deep neural networks and big data with theoretical insights as well. In particular, we propose to explore Polyak-\L{}ojasiewicz (PL) condition that has been proved and observed in deep learning, which enables us to develop new stochastic algorithms with even faster convergence rate and more practical step size scheme. An AdaGrad-style algorithm is also analyzed under the PL condition with adaptive convergence rate. Our experimental results demonstrate the effectiveness of the proposed algorithms.
Stochastic Proximal AUC Maximization
Lei, Yunwen, Ying, Yiming
In this paper we consider the problem of maximizing the Area under the ROC curve (AUC) which is a widely used performance metric in imbalanced classification and anomaly detection. Due to the pairwise nonlinearity of the objective function, classical SGD algorithms do not apply to the task of AUC maximization. We propose a novel stochastic proximal algorithm for AUC maximization which is scalable to large scale streaming data. Our algorithm can accommodate general penalty terms and is easy to implement with favorable $O(d)$ space and per-iteration time complexities. We establish a high-probability convergence rate $O(1/\sqrt{T})$ for the general convex setting, and improve it to a fast convergence rate $O(1/T)$ for the cases of strongly convex regularizers and no regularization term (without strong convexity). Our proof does not need the uniform boundedness assumption on the loss function or the iterates which is more fidelity to the practice. Finally, we perform extensive experiments over various benchmark data sets from real-world application domains which show the superior performance of our algorithm over the existing AUC maximization algorithms.
Dual Averaging Method for Online Graph-structured Sparsity
Zhou, Baojian, Chen, Feng, Ying, Yiming
Online learning algorithms update models via one sample per iteration, thus efficient to process large-scale datasets and useful to detect malicious events for social benefits, such as disease outbreak and traffic congestion on the fly. However, existing algorithms for graph-structured models focused on the offline setting and the least square loss, incapable for online setting, while methods designed for online setting cannot be directly applied to the problem of complex (usually non-convex) graph-structured sparsity model. To address these limitations, in this paper we propose a new algorithm for graph-structured sparsity constraint problems under online setting, which we call \textsc{GraphDA}. The key part in \textsc{GraphDA} is to project both averaging gradient (in dual space) and primal variables (in primal space) onto lower dimensional subspaces, thus capturing the graph-structured sparsity effectively. Furthermore, the objective functions assumed here are generally convex so as to handle different losses for online learning settings. To the best of our knowledge, \textsc{GraphDA} is the first online learning algorithm for graph-structure constrained optimization problems. To validate our method, we conduct extensive experiments on both benchmark graph and real-world graph datasets. Our experiment results show that, compared to other baseline methods, \textsc{GraphDA} not only improves classification performance, but also successfully captures graph-structured features more effectively, hence stronger interpretability.
Stochastic Iterative Hard Thresholding for Graph-structured Sparsity Optimization
Zhou, Baojian, Chen, Feng, Ying, Yiming
Stochastic optimization algorithms update models with cheap per-iteration costs sequentially, which makes them amenable for large-scale data analysis. Such algorithms have been widely studied for structured sparse models where the sparsity information is very specific, e.g., convex sparsity-inducing norms or $\ell^0$-norm. However, these norms cannot be directly applied to the problem of complex (non-convex) graph-structured sparsity models, which have important application in disease outbreak and social networks, etc. In this paper, we propose a stochastic gradient-based method for solving graph-structured sparsity constraint problems, not restricted to the least square loss. We prove that our algorithm enjoys a linear convergence up to a constant error, which is competitive with the counterparts in the batch learning setting. We conduct extensive experiments to show the efficiency and effectiveness of the proposed algorithms.
Stability and Optimization Error of Stochastic Gradient Descent for Pairwise Learning
Shen, Wei, Yang, Zhenhuan, Ying, Yiming, Yuan, Xiaoming
In this paper we study the stability and its trade-off with optimization error for stochastic gradient descent (SGD) algorithms in the pairwise learning setting. Pairwise learning refers to a learning task which involves a loss function depending on pairs of instances among which notable examples are bipartite ranking, metric learning, area under ROC (AUC) maximization and minimum error entropy (MEE) principle. Our contribution is twofold. Firstly, we establish the stability results of SGD for pairwise learning in the convex, strongly convex and non-convex settings, from which generalization bounds can be naturally derived. Secondly, we establish the trade-off between stability and optimization error of SGD algorithms for pairwise learning. This is achieved by lower-bounding the sum of stability and optimization error by the minimax statistical error over a prescribed class of pairwise loss functions. From this fundamental trade-off, we obtain lower bounds for the optimization error of SGD algorithms and the excess expected risk over a class of pairwise losses. In addition, we illustrate our stability results by giving some specific examples of AUC maximization, metric learning and MEE.
Learning with Correntropy-induced Losses for Regression with Mixture of Symmetric Stable Noise
Feng, Yunlong, Ying, Yiming
In recent years, correntropy and its applications in machine learning have been drawing continuous attention owing to its merits in dealing with non-Gaussian noise and outliers. However, theoretical understanding of correntropy, especially in the statistical learning context, is still limited. In this study, within the statistical learning framework, we investigate correntropy based regression in the presence of non-Gaussian noise or outliers. To this purpose, we first introduce mixture of symmetric stable noise, which include Gaussian noise, Cauchy noise, and the mixture of Gaussian noise as special cases, to model non-Gaussian noise and outliers. We demonstrate that under the mixture of symmetric stable noise assumption, correntropy based regression can learn the conditional mean function or the conditional median function well without requiring the finite variance assumption of the noise. In particular, we establish learning rates for correntropy based regression estimators that are asymptotically of type $\mathcal{O}(n^{-1})$. We believe that the present study completes our understanding towards correntropy based regression from a statistical learning viewpoint, and may also shed some light on robust statistical learning for regression.
Learning with Average Top-k Loss
Fan, Yanbo, Lyu, Siwei, Ying, Yiming, Hu, Baogang
In this work, we introduce the average top-$k$ (\atk) loss as a new ensemble loss for supervised learning. The \atk loss provides a natural generalization of the two widely used ensemble losses, namely the average loss and the maximum loss. Furthermore, the \atk loss combines the advantages of them and can alleviate their corresponding drawbacks to better adapt to different data distributions. We show that the \atk loss affords an intuitive interpretation that reduces the penalty of continuous and convex individual losses on correctly classified data. The \atk loss can lead to convex optimization problems that can be solved effectively with conventional sub-gradient based method. We further study the Statistical Learning Theory of \matk by establishing its classification calibration and statistical consistency of \matk which provide useful insights on the practical choice of the parameter $k$. We demonstrate the applicability of \matk learning combined with different individual loss functions for binary and multi-class classification and regression using synthetic and real datasets.
Learning with Average Top-k Loss
Fan, Yanbo, Lyu, Siwei, Ying, Yiming, Hu, Bao-Gang
In this work, we introduce the {\em average top-$k$} (\atk) loss as a new aggregate loss for supervised learning, which is the average over the $k$ largest individual losses over a training dataset. We show that the \atk loss is a natural generalization of the two widely used aggregate losses, namely the average loss and the maximum loss, but can combine their advantages and mitigate their drawbacks to better adapt to different data distributions. Furthermore, it remains a convex function over all individual losses, which can lead to convex optimization problems that can be solved effectively with conventional gradient-based methods. We provide an intuitive interpretation of the \atk loss based on its equivalent effect on the continuous individual loss functions, suggesting that it can reduce the penalty on correctly classified data. We further give a learning theory analysis of \matk learning on the classification calibration of the \atk loss and the error bounds of \atk-SVM. We demonstrate the applicability of minimum average top-$k$ learning for binary classification and regression using synthetic and real datasets.
Stochastic Online AUC Maximization
Ying, Yiming, Wen, Longyin, Lyu, Siwei
Area under ROC (AUC) is a metric which is widely used for measuring the classification performance for imbalanced data. It is of theoretical and practical interest to develop online learning algorithms that maximizes AUC for large-scale data. A specific challenge in developing online AUC maximization algorithm is that the learning objective function is usually defined over a pair of training examples of opposite classes, and existing methods achieves on-line processing with higher space and time complexity. In this work, we propose a new stochastic online algorithm for AUC maximization. In particular, we show that AUC optimization can be equivalently formulated as a convex-concave saddle point problem. From this saddle representation, a stochastic online algorithm (SOLAM) is proposed which has time and space complexity of one datum. We establish theoretical convergence of SOLAM with high probability and demonstrate its effectiveness and efficiency on standard benchmark datasets.