learning bound
Learning Bounds for Risk-sensitive Learning
In risk-sensitive learning, one aims to find a hypothesis that minimizes a risk-averse (or risk-seeking) measure of loss, instead of the standard expected loss. In this paper, we propose to study the generalization properties of risk-sensitive learning schemes whose optimand is described via optimized certainty equivalents (OCE): our general scheme can handle various known risks, e.g., the entropic risk, mean-variance, and conditional value-at-risk, as special cases. We provide two learning bounds on the performance of empirical OCE minimizer. The first result gives an OCE guarantee based on the Rademacher average of the hypothesis space, which generalizes and improves existing results on the expected loss and the conditional value-at-risk. The second result, based on a novel variance-based characterization of OCE, gives an expected loss guarantee with a suppressed dependence on the smoothness of the selected OCE. Finally, we demonstrate the practical implications of the proposed bounds via exploratory experiments on neural networks.
Learning Bound for Parameter Transfer Learning
We consider a transfer-learning problem by using the parameter transfer approach, where a suitable parameter of feature mapping is learned through one task and applied to another objective task. Then, we introduce the notion of the local stability of parametric feature mapping and parameter transfer learnability, and thereby derive a learning bound for parameter transfer algorithms. As an application of parameter transfer learning, we discuss the performance of sparse coding in self-taught learning. Although self-taught learning algorithms with plentiful unlabeled data often show excellent empirical performance, their theoretical analysis has not been studied. In this paper, we also provide the first theoretical learning bound for self-taught learning.
Learning Bounds for Greedy Approximation with Explicit Feature Maps from Multiple Kernels
Nonlinear kernels can be approximated using finite-dimensional feature maps for efficient risk minimization. Due to the inherent trade-off between the dimension of the (mapped) feature space and the approximation accuracy, the key problem is to identify promising (explicit) features leading to a satisfactory out-of-sample performance. In this work, we tackle this problem by efficiently choosing such features from multiple kernels in a greedy fashion. Our method sequentially selects these explicit features from a set of candidate features using a correlation metric. We establish an out-of-sample error bound capturing the trade-off between the error in terms of explicit features (approximation error) and the error due to spectral properties of the best model in the Hilbert space associated to the combined kernel (spectral error). The result verifies that when the (best) underlying data model is sparse enough, i.e., the spectral error is negligible, one can control the test error with a small number of explicit features, that can scale poly-logarithmically with data. Our empirical results show that given a fixed number of explicit features, the method can achieve a lower test error with a smaller time cost, compared to the state-of-the-art in data-dependent random features.
Review for NeurIPS paper: Learning Bounds for Risk-sensitive Learning
Weaknesses: I have a handful of minor concerns. Exploring inverted OCEs would have been interesting too... (2) ... because while the OCE formulation is convex, at least in the loss, for CVaR (and probably entropic risk), the inverted OCEs look like they lead to a non-convex problem. While machine learning has learned to live with non-convexity in the models, some basic experiments could help assuage any concerns. When using complicated neural networks, my understanding is that these bounds are mostly vacuous because the Rademacher complexities are high, hence the battles over rethinking generalization or the shortcomings of uniform convergence. I don't view these issues as meaning that we shouldn't examine these types of theory problems, but I find the suggestion that the empirical terms will simply vanish and this will solve all our problems to be disingenuous.
Review for NeurIPS paper: Learning Bounds for Risk-sensitive Learning
This is a learning theory paper in situation where the usual mean loss objective function is replaced by a risk-sensitive objective with different weights attributed to data depending on the loss. This setting is of high importance in robust learning, where only a fraction of the sample with smallest losses is considered. This paper provides an analysis of this setting via Rademacher bounds. The paper suggests a connection to Sample-Variance-Penalization (SVP) and concludes with some experimental results. The appendix also contains robustness analysis.
Reviews: Learning Bound for Parameter Transfer Learning
The parameter transfer learning framework described in the paper is very interesting and deserves attention. The approach taken by the authors (describe in Section 2) is sound but lacks clarity. The notation is well chosen, but not always properly explained (see my "Specific comments" below). Also, as the transfer learning framework is very similar to domain adaptation, which is studied in many papers (as Ben-David et al. 2007 cited by the authors), it would be interesting to discuss the connection of Theorem 1 with existing domain adaptation results. Section 3 is difficult to follow for a reader not familiar with sparse coding (like myself).
Reviews: Learning Bounds for Greedy Approximation with Explicit Feature Maps from Multiple Kernels
In particular, [1, Algorithm 3] propose an approach for minimization of the expected loss of a linear predictor that aims at finding a good' sparse solution. The main idea of the algorithm from [1] is to iteratively add features by picking a previously unselected feature that amounts to the largest reduction in the expected risk. Then, a linear model is trained using the extended feature representation and afterwards the whole process is repeated. The authors use pretty much the same idea and take a large dictionary of features to represent the data. Following this, they run Algorithm 3 from [1] to pick informative' features and generate a sparse feature representation.
Learning Bound for Parameter Transfer Learning
We consider a transfer-learning problem by using the parameter transfer approach, where a suitable parameter of feature mapping is learned through one task and applied to another objective task. Then, we introduce the notion of the local stability and parameter transfer learnability of parametric feature mapping, and thereby derive a learning bound for parameter transfer algorithms. As an application of parameter transfer learning, we discuss the performance of sparse coding in selftaught learning. Although self-taught learning algorithms with plentiful unlabeled data often show excellent empirical performance, their theoretical analysis has not been studied. In this paper, we also provide the first theoretical learning bound for self-taught learning.
Learning Bounds for Domain Adaptation
Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization.