Computational Learning Theory
Globally Optimal Learning for Structured Elliptical Losses
Heavy tailed and contaminated data are common in various applications of machine learning. A standard technique to handle regression tasks that involve such data, is to use robust losses, e.g., the popular Huber's loss. In structured problems, however, where there are multiple labels and structural constraints on the labels are imposed (or learned), robust optimization is challenging, and more often than not the loss used is simply the negative log-likelihood of a Gaussian Markov random field. Heavy tailed and contaminated data are common in various applications of machine learning. A standard technique to handle regression tasks that involve such data, is to use robust losses, e.g., the popular Huber's loss.
Towards Problem-dependent Optimal Learning Rates
We study problem-dependent rates, i.e., generalization errors that scale tightly with the variance or the effective loss at the "best hypothesis." Existing uniform convergence and localization frameworks, the most widely used tools to study this problem, often fail to simultaneously provide parameter localization and optimal dependence on the sample size. As a result, existing problem-dependent rates are often rather weak when the hypothesis class is "rich" and the worst-case bound of the loss is large. In this paper we propose a new framework based on a "uniform localized convergence" principle. We provide the first (moment-penalized) estimator that achieves the optimal variance-dependent rate for general "rich" classes; we also establish improved loss-dependent rate for standard empirical risk minimization.
Minimum Description Length and Generalization Guarantees for Representation Learning
A major challenge in designing efficient statistical supervised learning algorithms is finding representations that perform well not only on available training samples but also on unseen data. While the study of representation learning has spurred much interest, most existing such approaches are heuristic; and very little is known about theoretical generalization guarantees. For example, the information bottleneck method seeks a good generalization by finding a minimal description of the input that is maximally informative about the label variable, where minimality and informativeness are both measured by Shannon's mutual information. In this paper, we establish a compressibility framework that allows us to derive upper bounds on the generalization error of a representation learning algorithm in terms of the Minimum Description Length'' (MDL) of the labels or the latent variables (representations). Rather than the mutual information between the encoder's input and the representation, which is often believed to reflect the algorithm's generalization capability in the related literature but in fact, falls short of doing so, our new bounds involve the "multi-letter" relative entropy between the distribution of the representations (or labels) of the training and test sets and a fixed prior.
Agnostic Multi-Group Active Learning
Inspired by the problem of improving classification accuracy on rare or hard subsets of a population, there has been recent interest in models of learning where the goal is to generalize to a collection of distributions, each representing a group''. We consider a variant of this problem from the perspective of active learning, where the learner is endowed with the power to decide which examples are labeled from each distribution in the collection, and the goal is to minimize the number of label queries while maintaining PAC-learning guarantees. Our main challenge is that standard active learning techniques such as disagreement-based active learning do not directly apply to the multi-group learning objective. We modify existing algorithms to provide a consistent active learning algorithm for an agnostic formulation of multi-group learning, which given a collection of G distributions and a hypothesis class \mathcal{H} with VC-dimension d, outputs an \epsilon -optimal hypothesis using \tilde{O}\left( ( u 2/\epsilon 2) G d \theta_{\mathcal{G}} 2 \log 2(1/\epsilon) G\log(1/\epsilon)/\epsilon 2 \right) label queries, where \theta_{\mathcal{G}} is the worst-case disagreement coefficient over the collection. Roughly speaking, this guarantee improves upon the label complexity of standard multi-group learning in regimes where disagreement-based active learning algorithms may be expected to succeed, and the number of groups is not too large.
Online Epsilon Net and Piercing Set for Geometric Concepts
Bhore, Sujoy, Dey, Devdan, Singh, Satyam
VC-dimension and $\varepsilon$-nets are key concepts in Statistical Learning Theory. Intuitively, VC-dimension is a measure of the size of a class of sets. The famous $\varepsilon$-net theorem, a fundamental result in Discrete Geometry, asserts that if the VC-dimension of a set system is bounded, then a small sample exists that intersects all sufficiently large sets. In online learning scenarios where data arrives sequentially, the VC-dimension helps to bound the complexity of the set system, and $\varepsilon$-nets ensure the selection of a small representative set. This sampling framework is crucial in various domains, including spatial data analysis, motion planning in dynamic environments, optimization of sensor networks, and feature extraction in computer vision, among others. Motivated by these applications, we study the online $\varepsilon$-net problem for geometric concepts with bounded VC-dimension. While the offline version of this problem has been extensively studied, surprisingly, there are no known theoretical results for the online version to date. We present the first deterministic online algorithm with an optimal competitive ratio for intervals in $\mathbb{R}$. Next, we give a randomized online algorithm with a near-optimal competitive ratio for axis-aligned boxes in $\mathbb{R}^d$, for $d\le 3$. Furthermore, we introduce a novel technique to analyze similar-sized objects of constant description complexity in $\mathbb{R}^d$, which may be of independent interest. Next, we focus on the continuous version of this problem, where ranges of the set system are geometric concepts in $\mathbb{R}^d$ arriving in an online manner, but the universe is the entire space, and the objective is to choose a small sample that intersects all the ranges.
Reviews: Tight Bounds for Collaborative PAC Learning via Multiplicative Weights
This paper nearly settles (effectively settles, for a large range of parameters) an obvious question: " what is the overhead (as a function of k, the number of underlying distributions) in the sample complexity of collaborative PAC learning, compared to vanilla PAC learning?" An easy upper bound is a factor of k. Blum et al. established an upper bound of O(log 2 k) on the overhead factor (for k O(d), where d is the VC dimension of the concept class to learn), and a lower bound of Omega(log k) (for the specific case k d). The main contribution of this paper is to provide an O(log k) upper bound (for k O(d), again; the general upper bound is slightly more complicated for general k) on that ratio; they also generalize the lower bound to hold for any k d {O(1)}. The lower bound is aobtained by generalized and "bootstarting" the lower bound construction of Blum et al., with some changes to handle the base case.
Reviews: Revisiting Perceptron: Efficient and Label-Optimal Learning of Halfspaces
It shows that their algorithm learns using a nearly tight number of samples in the random independent noise of bounded rate. Previous work had exponentially worse dependence on the noise rate. In addition, it shows that this algorithm can deal with adversarial noise of sufficiently low rate. The latter result improves polynomially on the sample complexity but requires a stronger condtion on the noise rate. The assumptions in this setting are very strong and as a result are highly unlikely to hold in any realistic problem.
Reviews: PAC-learning in the presence of adversaries
This paper develops a PAC framework for learning binary functions in the presence of evasion adversaries. Let X be the domain, and suppose we have an unknown function f: X - {0,1} to be learned, and we also have a hypothesis class H. Moreover, there is a "closeness" relationship defined on pairs in X, so any pair of points in X are either "close" or "far." We also have an unknown distribution P on X. For a hypothesis h, its loss is defined as follows: first we pick a data point x in X according to P. The true label of this point is f(x).
Reviews: Collaborative PAC Learning
I thank the authors for their feedback and clarification and my positive evaluation of this paper remains unchanged. The authors give theoretical guarantees for collaborative learning with shared information under the PAC regime. They show that the sample complexity for (\epsilon,\delta)-PAC learning k classifiers for a shared problem (or a size k majority-voting ensemble) only needs to grow by a factor of approximately 1 log(k) (or 1 log 2(k)) rather than a factor of k as naive theory might predict. The authors provide pseudocode for two algorithms treated by the theory. These were correct as far as I could see, though I didn't implement either.
Reviews: On Fast Leverage Score Sampling and Optimal Learning
After further thought, however, I have updated my review to a 5. While this works seems to have very high potential, the current draft has several important shortcomings: (1) It is not yet polished enough for broad consumption. Significant revisions would be necessary to increase clarity. Summary: This paper presents a novel algorithm for estimating leverage scores for kernel matrices. The running time of this algorithm improves upon state of the art.