Inductive Learning
Structural Risk Minimization for Character Recognition
The method of Structural Risk Minimization refers to tuning the capacity of the classifier to the available amount of training data. This capac(cid:173) ity is influenced by several factors, including: (1) properties of the input space, (2) nature and structure of the classifier, and (3) learning algorithm. Actions based on these three factors are combined here to control the ca(cid:173) pacity of linear classifiers and improve generalization on the problem of handwritten digit recognition. A common way of training a given classifier is to adjust the parameters w in the classification function F( x, w) to minimize the training error Etrain, i.e. the fre(cid:173) quency of errors on a set of p training examples. Etrain estimates the expected risk based on the empirical data provided by the p available examples.
A Fast Stochastic Error-Descent Algorithm for Supervised Learning and Optimization
We address general optimization tasks that require finding a set of constant param(cid:173) eter values Pi that minimize a given error functional (p). For supervised learning, the error functional consists of some quantitative measure of the deviation between a desired state x T and the actual state of a network x, resulting from an input y and the parameters p. In such context the components of p consist of the con(cid:173) nection strengths, thresholds and other adjustable parameters in the network.
Learning Curves for Gaussian Processes
Within the neural networks community, there has in the last few years been a good deal of excitement about the use of Gaussian processes as an alternative to feedforward networks [lJ. The advantages of Gaussian processes are that prior assumptions about the problem to be learned are encoded in a very transparent way, and that inference-at least in the case of regression that I will consider-is relatively straightforward. One crucial question for applications is then how'fast' Gaussian processes learn, i.e., how many training examples are needed to achieve a certain level of generalization performance. The typical (as opposed to worst case) behaviour is captured in the learning curve, which gives the average generalization error as a function of the number of training examples n. Several workers have [2,3, 4J or studied its large n asymptotics. As I will illustrate derived bounds on (n) below, however, the existing bounds are often far from tight; and asymptotic results will not necessarily apply for realistic sample sizes n. My main aim in this paper is therefore to derive approximations to ( n) which get closer to the true learning curves than existing bounds, and apply both for small and large n. In its simplest form, the regression problem that I am considering is this: We are trying to learn a function 0* which maps inputs x (real-valued vectors) to (real(cid:173) valued scalar) outputs O*(x) .
Dynamics of Supervised Learning with Restricted Training Sets and Noisy Teachers
We generalize a recent formalism to describe the dynamics of supervised learning in layered neural networks, in the regime where data recycling is inevitable, to the case of noisy teachers. Our theory generates reliable predictions for the evolution in time of training- and generalization er(cid:173) rors, and extends the class of mathematically solvable learning processes in large neural networks to those situations where overfitting can occur.
A Rate Distortion Approach for Semi-Supervised Conditional Random Fields
We propose a novel information theoretic approach for semi-supervised learning of conditional random fields. Our approach defines a training objective that combines the conditional likelihood on labeled data and the mutual information on unlabeled data. Different from previous minimum conditional entropy semi-supervised discriminative learning methods, our approach can be naturally cast into the rate distortion theory framework in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label configurations. Our experimental results show that the rate distortion approach outperforms standard l_2 regularization and minimum conditional entropy regularization on both multi-class classification and sequence labeling problems.
Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data
We study the behavior of the popular Laplacian Regularization method for Semi-Supervised Learning at the regime of a fixed number of labeled points but a large number of unlabeled points. We show that in \R d, d \geq 2, the method is actually not well-posed, and as the number of unlabeled points increases the solution degenerates to a noninformative function. We also contrast the method with the Laplacian Eigenvector method, and discuss the smoothness assumptions associated with this alternate method.
Zero-shot Learning with Semantic Output Codes
We consider the problem of zero-shot learning, where the goal is to learn a classifier f: X \rightarrow Y that must predict novel values of Y that were omitted from the training set. To achieve this, we define the notion of a semantic output code classifier (SOC) which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes. We provide a formalism for this type of classifier and study its theoretical properties in a PAC framework, showing conditions under which the classifier can accurately predict novel classes. As a case study, we build a SOC classifier for a neural decoding task and show that it can often predict words that people are thinking about from functional magnetic resonance images (fMRI) of their neural activity, even without training examples for those words.
Cocktail Party Processing via Structured Prediction
While human listeners excel at selectively attending to a conversation in a cocktail party, machine performance is still far inferior by comparison. We show that the cocktail party problem, or the speech separation problem, can be effectively approached via structured prediction. To account for temporal dynamics in speech, we employ conditional random fields (CRFs) to classify speech dominance within each time-frequency unit for a sound mixture. To capture complex, nonlinear relationship between input and output, both state and transition feature functions in CRFs are learned by deep neural networks. The formulation of the problem as classification allows us to directly optimize a measure that is well correlated with human speech intelligibility.
Learning Adaptive Value of Information for Structured Prediction
Discriminative methods for learning structured models have enabled wide-spread use of very rich feature representations. However, the computational cost of feature extraction is prohibitive for large-scale or time-sensitive applications, often dominating the cost of inference in the models. Significant efforts have been devoted to sparsity-based model selection to decrease this cost. Such feature selection methods control computation statically and miss the opportunity to fine-tune feature extraction to each input at run-time. We address the key challenge of learning to control fine-grained feature extraction adaptively, exploiting non-homogeneity of the data.
Multitask Kernel-based Learning with Logic Constraints
Diligenti, Michelangelo, Gori, Marco, Maggini, Marco, Rigutini, Leonardo
This paper presents a general framework to integrate prior knowledge in the form of logic constraints among a set of task functions into kernel machines. The logic propositions provide a partial representation of the environment, in which the learner operates, that is exploited by the learning algorithm together with the information available in the supervised examples. In particular, we consider a multi-task learning scheme, where multiple unary predicates on the feature space are to be learned by kernel machines and a higher level abstract representation consists of logic clauses on these predicates, known to hold for any input. A general approach is presented to convert the logic clauses into a continuous implementation, that processes the outputs computed by the kernel-based predicates. The learning task is formulated as a primal optimization problem of a loss function that combines a term measuring the fitting of the supervised examples, a regularization term, and a penalty term that enforces the constraints on both supervised and unsupervised examples. The proposed semi-supervised learning framework is particularly suited for learning in high dimensionality feature spaces, where the supervised training examples tend to be sparse and generalization difficult. Unlike for standard kernel machines, the cost function to optimize is not generally guaranteed to be convex. However, the experimental results show that it is still possible to find good solutions using a two stage learning schema, in which first the supervised examples are learned until convergence and then the logic constraints are forced. Some promising experimental results on artificial multi-task learning tasks are reported, showing how the classification accuracy can be effectively improved by exploiting the a priori rules and the unsupervised examples.