Denker, John S.
Learning Curves: Asymptotic Values and Rate of Convergence
Cortes, Corinna, Jackel, L. D., Solla, Sara A., Vapnik, Vladimir, Denker, John S.
Training classifiers on large databases is computationally demanding. It is desirable to develop efficient procedures for a reliable prediction of a classifier's suitability for implementing a given task, so that resources can be assigned to the most promising candidates or freed for exploring new classifier candidates. We propose such a practical and principled predictive method. Practical because it avoids the costly procedure of training poor classifiers on the whole training set, and principled because of its theoretical foundation. The effectiveness of the proposed procedure is demonstrated for both single-and multi-layer networks.
Learning Curves: Asymptotic Values and Rate of Convergence
Cortes, Corinna, Jackel, L. D., Solla, Sara A., Vapnik, Vladimir, Denker, John S.
Training classifiers on large databases is computationally demanding. Itis desirable to develop efficient procedures for a reliable prediction of a classifier's suitability for implementing a given task, so that resources can be assigned to the most promising candidates or freed for exploring new classifier candidates. We propose such a practical and principled predictive method. Practical because it avoids the costly procedure of training poor classifiers on the whole training set, and principled because of its theoretical foundation. The effectiveness of the proposed procedure is demonstrated for both single-and multi-layer networks.
Learning Curves: Asymptotic Values and Rate of Convergence
Cortes, Corinna, Jackel, L. D., Solla, Sara A., Vapnik, Vladimir, Denker, John S.
Training classifiers on large databases is computationally demanding. It is desirable to develop efficient procedures for a reliable prediction of a classifier's suitability for implementing a given task, so that resources can be assigned to the most promising candidates or freed for exploring new classifier candidates. We propose such a practical and principled predictive method. Practical because it avoids the costly procedure of training poor classifiers on the whole training set, and principled because of its theoretical foundation. The effectiveness of the proposed procedure is demonstrated for both single-and multi-layer networks.
Efficient Pattern Recognition Using a New Transformation Distance
Simard, Patrice, LeCun, Yann, Denker, John S.
Memory-based classification algorithms such as radial basis functions orK-nearest neighbors typically rely on simple distances (Euclidean, dotproduct ...), which are not particularly meaningful on pattern vectors. More complex, better suited distance measures are often expensive and rather ad-hoc (elastic matching, deformable templates). We propose a new distance measure which (a) can be made locally invariant to any set of transformations of the input and (b) can be computed efficiently. We tested the method on large handwritten character databases provided by the Post Office and the NIST. Using invariances with respect to translation, rotation, scaling,shearing and line thickness, the method consistently outperformed all other systems tested on the same databases.
Multi-Digit Recognition Using a Space Displacement Neural Network
Matan, Ofer, Burges, Christopher J. C., LeCun, Yann, Denker, John S.
We present a feed-forward network architecture for recognizing an unconstrained handwritten multi-digit string. This is an extension of previous work on recognizing isolated digits. In this architecture a single digit recognizer is replicated over the input. The output layer of the network is coupled to a Viterbi alignment module that chooses the best interpretation of the input. Training errors are propagated through the Viterbi module.
Transforming Neural-Net Output Levels to Probability Distributions
Denker, John S., LeCun, Yann
John S. Denker and Yann leCun AT&T Bell Laboratories Holmdel, NJ 07733 Abstract (1) The outputs of a typical multi-output classification network do not satisfy the axioms of probability; probabilities should be positive and sum to one. This problem can be solved by treating the trained network as a preprocessor that produces a feature vector that can be further processed, for instance by classical statistical estimation techniques. It is particularly useful to combine these two ideas: we implement the ideas of section 1 using Parzen windows, where the shape and relative size of each window is computed using the ideas of section 2. This allows us to make contact between important theoretical ideas (e.g. the ensemble formalism) and practical techniques (e.g. Our results also shed new light on and generalize the well-known "softmax" scheme. 1 Distribution of Categories in Output Space In many neural-net applications, it is crucial to produce a set of C numbers that serve as estimates of the probability of C mutually exclusive outcomes. For example, inspeech recognition, these numbers represent the probability of C different phonemes; the probabilities of successive segments can be combined using a Hidden Markov Model.
Transforming Neural-Net Output Levels to Probability Distributions
Denker, John S., LeCun, Yann
John S. Denker and Yann leCun AT&T Bell Laboratories Holmdel, NJ 07733 Abstract (1) The outputs of a typical multi-output classification network do not satisfy the axioms of probability; probabilities should be positive and sum to one. This problem can be solved by treating the trained network as a preprocessor that produces a feature vector that can be further processed, for instance by classical statistical estimation techniques. It is particularly useful to combine these two ideas: we implement the ideas of section 1 using Parzen windows, where the shape and relative size of each window is computed using the ideas of section 2. This allows us to make contact between important theoretical ideas (e.g. the ensemble formalism) and practical techniques (e.g. Our results also shed new light on and generalize the well-known "soft max" scheme. For example, in speech recognition, these numbers represent the probability of C different phonemes; the probabilities of successive segments can be combined using a Hidden Markov Model.
Handwritten Digit Recognition with a Back-Propagation Network
LeCun, Yann, Boser, Bernhard E., Denker, John S., Henderson, Donnie, Howard, R. E., Hubbard, Wayne E., Jackel, Lawrence D.
We present an application of back-propagation networks to handwritten digitrecognition. Minimal preprocessing of the data was required, but architecture of the network was highly constrained and specifically designed for the task. The input of the network consists of normalized images of isolated digits. The method has 1 % error rate and about a 9% reject rate on zipcode digits provided by the U.S. Postal Service. 1 INTRODUCTION The main point of this paper is to show that large back-propagation (BP) networks canbe applied to real image-recognition problems without a large, complex preprocessing stage requiring detailed engineering. Unlike most previous work on the subject (Denker et al., 1989), the learning network is directly fed with images, rather than feature vectors, thus demonstrating the ability of BP networks to deal with large amounts of low level information. Previous work performed on simple digit images (Le Cun, 1989) showed that the architecture of the network strongly influences the network's generalization ability. Good generalization can only be obtained by designing a network architecture that contains a certain amount of a priori knowledge about the problem. The basic design principleis to minimize the number of free parameters that must be determined by the learning algorithm, without overly reducing the computational power of the network.
Optimal Brain Damage
LeCun, Yann, Denker, John S., Solla, Sara A.
We have used information-theoretic ideas to derive a class of practical and nearly optimal schemes for adapting the size of a neural network. By removing unimportant weights from a network, several improvements can be expected: better generalization, fewer training examples required, and improved speed of learning and/or classification. The basic idea is to use second-derivative information to make a tradeoff between network complexity and training set error. Experiments confirm the usefulness of the methods on a real-world application. 1 INTRODUCTION Most successful applications of neural network learning to real-world problems have been achieved using highly structured networks of rather large size [for example (Waibel, 1989; Le Cun et al., 1990a)]. As applications become more complex, the networks will presumably become even larger and more structured.