Goto

Collaborating Authors

 Inductive Learning


Discontinuous Generalization in Large Committee Machines

Neural Information Processing Systems

The problem of learning from examples in multilayer networks is studied within the framework of statistical mechanics. Using the replica formalism we calculate the average generalization error of a fully connected committee machine in the limit of a large number of hidden units. If the number of training examples is proportional to the number of inputs in the network, the generalization error as a function of the training set size approaches a finite value. If the number of training examples is proportional to the number of weights in the network we find first-order phase transitions with a discontinuous drop in the generalization error for both binary and continuous weights.


A Comparison of Dynamic Reposing and Tangent Distance for Drug Activity Prediction

Neural Information Processing Systems

In drug activity prediction (as in handwritten character recogni(cid:173) tion), the features extracted to describe a training example depend on the pose (location, orientation, etc.) of the example. In hand(cid:173) written character recognition, one of the best techniques for ad(cid:173) dressing this problem is the tangent distance method of Simard, LeCun and Denker (1993). Jain, et al. (1993a; 1993b) introduce a new technique-dynamic reposing-that also addresses this prob(cid:173) lem. Dynamic reposing iteratively learns a neural network and then reposes the examples in an effort to maximize the predicted out(cid:173) put values. New models are trained and new poses computed until models and poses converge.


Cross-Validation Estimates IMSE

Neural Information Processing Systems

Let zN denote a given set of N training examples. Let QN(zN) denote the expected squared error (the expectation taken over all possible examples) of the network after being trained on zN. This measures the quality of fit afforded by training on a given set of N examples. Let IMSEN denote the Integrated Mean Squared Error for training sets of size N. Given reasonable assumptions, it is straightforward to show that IMSEN E[Q N(ZN)] - 0"2, where the expectation is now over all training sets of size N, ZN is a random training set of size N, and 0"2 is the noise variance. Let CN CN(zN) denote the "delete-one cross-validation" squared error measure for a network trained on zN.


Active Learning for Function Approximation

Neural Information Processing Systems

In function approximation, example-based learning can be formulated as synthesiz(cid:173) ing an approximation function for data sampled from an unknown target function (Poggio and Girosi, 1990). Active learning describes a class of example-based learning paradigms that seeks out new training examples from specific regions of the input space, instead of passively accepting examples from some data generating source.


Active Learning with Statistical Models

Neural Information Processing Systems

An active learning problem is one where the learner has the ability or need to influence or select its own training data. Many problems of great practical interest allow active learning, and many even require it. We consider the problem of actively learning a mapping X - Y based on a set of training examples {(Xi,Yi)} l' where Xi E X and Yi E Y. The learner is allowed to iteratively select new inputs x (possibly from a constrained set), observe the resulting output y, and incorporate the new examples (x, y) into its training set. The primary question of active learning is how to choose which x to try next. There are many heuristics for choosing x based on intuition, including choosing places where we don't have data, where we perform poorly [Linden and Weber, 1993], where we have low confidence [Thrun and Moller, 1992], where we expect it


Using Unlabeled Data for Supervised Learning

Neural Information Processing Systems

Many classification problems have the property that the only costly part of obtaining examples is the class label. This paper suggests a simple method for using distribution information contained in unlabeled examples to augment labeled examples in a supervised training framework. Empirical tests show that the technique de(cid:173) scribed in this paper can significantly improve the accuracy of a supervised learner when the learner is well below its asymptotic accuracy level.


Family Discovery

Neural Information Processing Systems

"Family discovery" is the task of learning the dimension and struc(cid:173) ture of a parameterized family of stochastic models. It is espe(cid:173) cially appropriate when the training examples are partitioned into "episodes" of samples drawn from a single parameter value. We present three family discovery algorithms based on surface learn(cid:173) ing and show that they significantly improve performance over two alternatives on a parameterized classification task.


For Valid Generalization the Size of the Weights is More Important than the Size of the Network

Neural Information Processing Systems

This paper shows that if a large neural network is used for a pattern classification problem, and the learning algorithm finds a network with small weights that has small squared error on the training patterns, then the generalization performance depends on the size of the weights rather than the number of weights. More specifi(cid:173) cally, consider an i-layer feed-forward network of sigmoid units, in which the sum of the magnitudes of the weights associated with each unit is bounded by A. The misclassification probability con(cid:173) verges to an error estimate (that is closely related to squared error on the training set) at rate O((cA)l(l 1)/2J(log n)jm) ignoring log factors, where m is the number of training patterns, n is the input dimension, and c is a constant. This may explain the gen(cid:173) eralization performance of neural networks, particularly when the number of training examples is considerably smaller than the num(cid:173) ber of weights. It also supports heuristics (such as weight decay and early stopping) that attempt to keep the weights small during training.


Learning Temporally Persistent Hierarchical Representations

Neural Information Processing Systems

A biologically motivated model of cortical self-organization is pro(cid:173) posed. Context is combined with bottom-up information via a maximum likelihood cost function. Clusters of one or more units are modulated by a common contextual gating Signal; they thereby organize themselves into mutually supportive predictors of abstract contextual features. The model was tested in its ability to discover viewpoint-invariant classes on a set of real image sequences of cen(cid:173) tered, gradually rotating faces. It performed considerably better than supervised back-propagation at generalizing to novel views from a small number of training examples. The importance of context effects l in perception has been demonstrated in many domains.


Generalization in Decision Trees and DNF: Does Size Matter?

Neural Information Processing Systems

Recent theoretical results for pattern classification with thresh(cid:173) olded real-valued functions (such as support vector machines, sig(cid:173) moid networks, and boosting) give bounds on misclassification probability that do not depend on the size of the classifier, and hence can be considerably smaller than the bounds that follow from the VC theory. In this paper, we show that these techniques can be more widely applied, by representing other boolean functions as two-layer neural networks (thresholded convex combinations of boolean functions). For example, we show that with high probabil(cid:173) ity any decision tree of depth no more than d that is consistent with m training examples has misclassification probability no more than o ( ( (Neff VCdim(U) log2 m log d)) 1/2), where U is the class of node decision functions, and Neff::; N can be thought of as the effective number of leaves (it becomes small as the distribution on the leaves induced by the training data gets far from uniform). This bound is qualitatively different from the VC bound and can be considerably smaller. We use the same technique to give similar results for DNF formulae.