Inductive Learning
Minimax Lower Bounds for Realizable Transductive Classification
Tolstikhin, Ilya, Lopez-Paz, David
Transductive learning considers a training set of m labeled samples and a test set of u unlabeled samples, with the goal of best labeling that particular test set. Conversely, inductive learning considers a training set of m labeled samples drawn iid from P(X,Y), with the goal of best labeling any future samples drawn iid from P(X). This comparison suggests that transduction is a much easier type of inference than induction, but is this really the case? This paper provides a negative answer to this question, by proving the first known minimax lower bounds for transductive, realizable, binary classification. Our lower bounds show that m should be at least Ω(d/ǫ log(1/δ)/ǫ) when ǫ-learning a concept class H of finite VC-dimension d with confidence 1 δ, for all m u. This result draws three important conclusions. First, general transduction is as hard as general induction, since both problems have Ω(d/m) minimax values. Second, the use of unlabeled data does not help general transduction, since supervised learning algorithms such as ERM and (Hanneke, 2015) match our transductive lower bounds while ignoring the unlabeled test set. Third, our transductive lower bounds imply lower bounds for semi-supervised learning, which add to the important discussion about the role of unlabeled data in machine learning.
Loss factorization, weakly supervised learning and label noise robustness
Patrini, Giorgio, Nielsen, Frank, Nock, Richard, Carioni, Marcello
We prove that the empirical risk of most well-known loss functions factors into a linear term aggregating all labels with a term that is label free, and can further be expressed by sums of the loss. This holds true even for non-smooth, non-convex losses and in any RKHS. The first term is a (kernel) mean operator --the focal quantity of this work-- which we characterize as the sufficient statistic for the labels. The result tightens known generalization bounds and sheds new light on their interpretation. Factorization has a direct application on weakly supervised learning. In particular, we demonstrate that algorithms like SGD and proximal methods can be adapted with minimal effort to handle weak supervision, once the mean operator has been estimated. We apply this idea to learning with asymmetric noisy labels, connecting and extending prior work. Furthermore, we show that most losses enjoy a data-dependent (by the mean operator) form of noise robustness, in contrast with known negative results.
Active Information Acquisition
He, He, Mineiro, Paul, Karampatziakis, Nikos
We propose a general framework for sequential and dynamic acquisition of useful information in order to solve a particular task. While our goal could in principle be tackled by general reinforcement learning, our particular setting is constrained enough to allow more efficient algorithms. In this paper, we work under the Learning to Search framework and show how to formulate the goal of finding a dynamic information acquisition policy in that framework. We apply our formulation on two tasks, sentiment analysis and image recognition, and show that the learned policies exhibit good statistical performance. As an emergent byproduct, the learned policies show a tendency to focus on the most prominent parts of each instance and give harder instances more attention without explicitly being trained to do so.
Incremental Spectral Sparsification for Large-Scale Graph-Based Semi-Supervised Learning
Calandriello, Daniele, Lazaric, Alessandro, Valko, Michal, Koutis, Ioannis
While the harmonic function solution performs well in many semi-supervised learning (SSL) tasks, it is known to scale poorly with the number of samples. Recent successful and scalable methods, such as the eigenfunction method focus on efficiently approximating the whole spectrum of the graph Laplacian constructed from the data. This is in contrast to various subsampling and quantization methods proposed in the past, which may fail in preserving the graph spectra. However, the impact of the approximation of the spectrum on the final generalization error is either unknown, or requires strong assumptions on the data. In this paper, we introduce Sparse-HFS, an efficient edge-sparsification algorithm for SSL. By constructing an edge-sparse and spectrally similar graph, we are able to leverage the approximation guarantees of spectral sparsification methods to bound the generalization error of Sparse-HFS. As a result, we obtain a theoretically-grounded approximation scheme for graph-based SSL that also empirically matches the performance of known large-scale methods.
Directional Decision Lists
In this paper we introduce a novel family of decision lists consisting of highly interpretable models which can be learned efficiently in a greedy manner. The defining property is that all rules are oriented in the same direction. Particular examples of this family are decision lists with monotonically decreasing (or increasing) probabilities. On simulated data we empirically confirm that the proposed model family is easier to train than general decision lists. We exemplify the practical usability of our approach by identifying problem symptoms in a manufacturing process.
Nonparametric semi-supervised learning of class proportions
Jain, Shantanu, White, Martha, Trosset, Michael W., Radivojac, Predrag
The problem of developing binary classifiers from positive and unlabeled data is often encountered in machine learning. A common requirement in this setting is to approximate posterior probabilities of positive and negative classes for a previously unseen data point. This problem can be decomposed into two steps: (i) the development of accurate predictors that discriminate between positive and unlabeled data, and (ii) the accurate estimation of the prior probabilities of positive and negative examples. In this work we primarily focus on the latter subproblem. We study nonparametric class prior estimation and formulate this problem as an estimation of mixing proportions in two-component mixture models, given a sample from one of the components and another sample from the mixture itself. We show that estimation of mixing proportions is generally ill-defined and propose a canonical form to obtain identifiability while maintaining the flexibility to model any distribution. We use insights from this theory to elucidate the optimization surface of the class priors and propose an algorithm for estimating them. To address the problems of high-dimensional density estimation, we provide practical transformations to low-dimensional spaces that preserve class priors. Finally, we demonstrate the efficacy of our method on univariate and multivariate data.
Learning Kernels for Structured Prediction using Polynomial Kernel Transformations
Tonde, Chetan, Elgammal, Ahmed
Learning the kernel functions used in kernel methods has been a vastly explored area in machine learning. It is now widely accepted that to obtain 'good' performance, learning a kernel function is the key challenge. In this work we focus on learning kernel representations for structured regression. We propose use of polynomials expansion of kernels, referred to as Schoenberg transforms and Gegenbaur transforms, which arise from the seminal result of Schoenberg (1938). These kernels can be thought of as polynomial combination of input features in a high dimensional reproducing kernel Hilbert space (RKHS). We learn kernels over input and output for structured data, such that, dependency between kernel features is maximized. We use Hilbert-Schmidt Independence Criterion (HSIC) to measure this. We also give an efficient, matrix decomposition-based algorithm to learn these kernel transformations, and demonstrate state-of-the-art results on several real-world datasets.
Efficient Learning by Directed Acyclic Graph For Resource Constrained Prediction
Wang, Joseph, Trapeznikov, Kirill, Saligrama, Venkatesh
We study the problem of reducing test-time acquisition costs in classification systems. Our goal is to learn decision rules that adaptively select sensors for each example as necessary to make a confident prediction. We model our system as a directed acyclic graph (DAG) where internal nodes correspond to sensor subsets and decision functions at each node choose whether to acquire a new sensor or classify using the available measurements. This problem can be naturally posed as an empirical risk minimization over training data. Rather than jointly optimizing such a highly coupled and non-convex problem over all decision nodes, we propose an efficient algorithm motivated by dynamic programming. We learn node policies in the DAG by reducing the global objective to a series of cost sensitive learning problems. Our approach is computationally efficient and has proven guarantees of convergence to the optimal system for a fixed architecture. In addition, we present an extension to map other budgeted learning problems with large number of sensors to our DAG architecture and demonstrate empirical performance exceeding state-of-the-art algorithms for data composed of both few and many sensors.
Cognition as a Service: An Industry Perspective
Spohrer, Jim (IBM Research, Almaden) | Banavar, Guruduth (IBM Research)
Recent advances in cognitive computing componentry combined with other factors are leading to commercially viable cognitive systems. From chips to smart phones to public and private clouds, industrial strength “cognition as a service” is beginning to appear at all scales in business and society. Furthermore, in the age of zettabytes on the way to yottabytes, the designers, engineers, and managers of future smart systems will depend on cognition as a service. Cognition as a service can help unlock the mysteries of big data and ultimately boost the creativity and productivity of professionals and their teams, the productive output of industries and organizations, as well as the GDP (gross domestic product) of regions and nations. In this and the next decade, cognition as a service will allow us to re-image work practices, augmenting and scaling expertise to transform professions, industries, and regions.
Embedding Inference for Structured Multilabel Prediction
Mirzazadeh, Farzaneh, Ravanbakhsh, Siamak, Ding, Nan, Schuurmans, Dale
A key bottleneck in structured output prediction is the need for inference during training and testing, usually requiring some form of dynamic programming. Rather than using approximate inference or tailoring a specialized inference method for a particular structure---standard responses to the scaling challenge---we propose to embed prediction constraints directly into the learned representation. By eliminating the need for explicit inference a more scalable approach to structured output prediction can be achieved, particularly at test time. We demonstrate the idea for multi-label prediction under subsumption and mutual exclusion constraints, where a relationship to maximum margin structured output prediction can be established. Experiments demonstrate that the benefits of structured output training can still be realized even after inference has been eliminated.