Goto

Collaborating Authors

 Inductive Learning


Dynamics of Supervised Learning with Restricted Training Sets

Neural Information Processing Systems

We study the dynamics of supervised learning in layered neural net(cid:173) works, in the regime where the size p of the training set is proportional to the number N of inputs. Here the local fields are no longer described by Gaussian distributions. We use dynamical replica theory to predict the evolution of macroscopic observables, including the relevant error measures, incorporating the old formalism in the limit piN --t 00.


Training Data Selection for Optimal Generalization in Trigonometric Polynomial Networks

Neural Information Processing Systems

In this paper, we consider the problem of active learning in trigonomet(cid:173) ric polynomial networks and give a necessary and sufficient condition of sample points to provide the optimal generalization capability. By ana(cid:173) lyzing the condition from the functional analytic point of view, we clarify the mechanism of achieving the optimal generalization capability. We also show that a set of training examples satisfying the condition does not only provide the optimal generalization but also reduces the compu(cid:173) tational complexity and memory required for the calculation of learning results. Finally, examples of sample points satisfying the condition are given and computer simulations are performed to demonstrate the effec(cid:173) tiveness of the proposed active learning method.


Generalizable Singular Value Decomposition for Ill-posed Datasets

Neural Information Processing Systems

We demonstrate that statistical analysis of ill-posed data sets is subject to a bias, which can be observed when projecting indepen(cid:173) dent test set examples onto a basis defined by the training exam(cid:173) ples. Because the training examples in an ill-posed data set do not fully span the signal space the observed training set variances in each basis vector will be too high compared to the average vari(cid:173) ance of the test set projections onto the same basis vectors. On basis of this understanding we introduce the Generalizable Singu(cid:173) lar Value Decomposition (GenSVD) as a means to reduce this bias by re-estimation of the singular values obtained in a conventional Singular Value Decomposition, allowing for a generalization perfor(cid:173) mance increase of a subsequent statistical model. We demonstrate that the algorithm succesfully corrects bias in a data set from a functional PET activation study of the human brain. An ill-posed data set has more dimensions in each example than there are examples.


A Gradient-Based Boosting Algorithm for Regression Problems

Neural Information Processing Systems

Adaptive boosting methods are simple modular algorithms that operate as follows. Let 9: X -t Y be the function to be learned, where the label set Y is finite, typ(cid:173) ically binary-valued. The algorithm uses a learning procedure, which has access to n training examples, {(Xl, Y1), ..., (xn, Yn)}, drawn randomly from X x Yac(cid:173) cording to distribution D; it outputs a hypothesis I: X -t Y, whose error is the expected value of a loss function on I(x), g(x), where X is chosen according to D. Given f, cl 0 and access to random examples, a strong learning procedure outputs with probability 1 - cl a hypothesis with error at most f, with running time polyno(cid:173) mial in 1/ f, 1/ cl and the number of examples. A weak learning procedure satisfies the same conditions, but where f need only be better than random guessing. Schapire (1990) showed that any weak learning procedure, denoted WeakLeam, can be efficiently transformed ("boosted") into a strong learning procedure. The AdaBoost algorithm achieves this by calling WeakLeam multiple times, in a se(cid:173) quence of T stages, each time presenting it with a different distribution over a fixed training set and finally combining all of the hypotheses. The algorithm maintains a weight w: for each training example i at stage i, and a distribution D t is computed by normalizing these weights.


Computing with Finite and Infinite Networks

Neural Information Processing Systems

Using statistical mechanics results, I calculate learning curves (average generalization error) for Gaussian processes (GPs) and Bayesian neural networks (NNs) used for regression. Applying the results to learning a teacher defined by a two-layer network, I can directly compare GP and Bayesian NN learning. I find that a GP in general requires CJ (d S)-training examples to learn input features of order s (d is the input dimension), whereas a NN can learn the task with order the number of adjustable weights training examples. Since a GP can be considered as an infinite NN, the results show that even in the Bayesian approach, it is important to limit the complexity of the learning machine. The theoretical findings are confirmed in simulations with analytical GP learning and a NN mean field algorithm.


Gaussian Process Regression with Mismatched Models

Neural Information Processing Systems

Learning curves for Gaussian process regression are well understood when the'student' model happens to match the'teacher' (true data generation process). I derive approximations to the learning curves for the more generic case of mismatched models, and find very rich behaviour: For large input space dimensionality, where the results become exact, there are universal (student-independent) plateaux in the learning curve, with transitions in between that can exhibit arbitrarily many over-fitting maxima; over-fitting can occur even if the student estimates the teacher noise level correctly. In lower dimensions, plateaux also appear, and the learning curve remains dependent on the mismatch between student and teacher even in the asymptotic limit of a large number of training examples. Learn(cid:173) ing with excessively strong smoothness assumptions can be partic(cid:173) ularly dangerous: For example, a student with a standard radial basis function covariance function will learn a rougher teacher func(cid:173) tion only logarithmically slowly. All predictions are confirmed by simulations.


Iterative Double Clustering for Unsupervised and Semi-Supervised Learning

Neural Information Processing Systems

We present a powerful meta-clustering technique called Iterative Dou- ble Clustering (IDC). The IDC method is a natural extension of the recent Double Clustering (DC) method of Slonim and Tishby that ex- hibited impressive performance on text categorization tasks [12]. Us- ing synthetically generated data we empirically flnd that whenever the DC procedure is successful in recovering some of the structure hidden in the data, the extended IDC procedure can incrementally compute a signiflcantly more accurate classiflcation. IDC is especially advan- tageous when the data exhibits high attribute noise. Our simulation results also show the efiectiveness of IDC in text categorization prob- lems.


Learning Lateral Interactions for Feature Binding and Sensory Segmentation

Neural Information Processing Systems

We present a new approach to the supervised learning of lateral inter- actions for the competitive layer model (CLM) dynamic feature binding architecture. The method is based on consistency conditions, which were recently shown to characterize the attractor states of this linear threshold recurrent network. For a given set of training examples the learning prob- lem is formulated as a convex quadratic optimization problem in the lat- eral interaction weights. An efficient dimension reduction of the learning problem can be achieved by using a linear superposition of basis inter- actions. We show the successful application of the method to a medical image segmentation problem of fluorescence microscope cell images.


Learning with Multiple Labels

Neural Information Processing Systems

In this paper, we study a special kind of learning problem in which each training instance is given a set of (or distribution over) candidate class labels and only one of the candidate labels is the correct one. Such a problem can occur, e.g., in an information retrieval setting where a set of words is associated with an image, or if classes labels are organized hierarchically. We propose a novel discriminative approach for handling the ambiguity of class labels in the training examples. The experiments with the proposed approach over five different UCI datasets show that our approach is able to find the correct label among the set of candidate labels and actually achieve performance close to the case when each training instance is given a single correct label. In contrast, naIve methods degrade rapidly as more ambiguity is introduced into the labels.


Ranking with Large Margin Principle: Two Approaches

Neural Information Processing Systems

We discuss the problem of ranking k instances with the use of a "large margin" principle. We introduce two main approaches: the first is the "fixed margin" policy in which the margin of the closest neighboring classes is being maximized - which turns out to be a direct generaliza(cid:173) tion of SVM to ranking learning. The second approach allows for k - 1 different margins where the sum of margins is maximized. This approach is shown to reduce to lI-SVM when the number of classes k 2. Both approaches are optimal in size of 21 where I is the total number of training examples. Experiments performed on visual classification and "collab(cid:173) orative filtering" show that both approaches outperform existing ordinal regression algorithms applied for ranking and multi-class SVM applied to general multi-class classification.