Goto

Collaborating Authors

 Bayesian Inference


A practical Bayesian framework for back-propagation networks

Classics

A quantitative and practical Bayesian framework is described for learning of mappings in feedforward networks. The framework makes possible (1) objective comparisons between solutions using alternative network architectures, (2) objective stopping rules for network pruning or growing procedures, (3) objective choice of magnitude and type of weight decay terms or additive regularizers (for penalizing large weights, etc.), (4) a measure of the effective number of well-determined parameters in a model, (5) quantified estimates of the error bars on network parameters and on network output, and (6) objective comparisons with alternative learning and interpolation models such as splines and radial basis functions. The Bayesian "evidence" automatically embodies "Occam's razor," penalizing overflexible and overcomplex models. The Bayesian approach helps detect poor underlying assumptions in learning models. For learning models well matched to a problem, a good correlation between generalization ability and the Bayesian evidence is obtained.


Asymptotic slowing down of the nearest-neighbor classifier

Neural Information Processing Systems

M2/n' for sufficiently large values of M. Here, Poo(error) denotes the probability of error in the infinite sample limit, and is at most twice the error of a Bayes classifier. Although the value of the coefficient a depends upon the underlying probability distributions, the exponent of M is largely distribution free. We thus obtain a concise relation between a classifier's ability to generalize from a finite reference sample and the dimensionality of the feature space, as well as an analytic validation of Bellman's well known "curse of dimensionality." 1 INTRODUCTION One of the primary tasks assigned to neural networks is pattern classification. Common applications include recognition problems dealing with speech, handwritten characters, DNA sequences, military targets, and (in this conference) sexual identity. Two fundamental concepts associated with pattern classification are generalization (how well does a classifier respond to input data it has never encountered before?) and scalability (how are a classifier's processing and training requirements affected by increasing the number of features that describe the input patterns?).



Modeling Time Varying Systems Using Hidden Control Neural Architecture

Neural Information Processing Systems

This paper introduces a generalization of the layered neural network that can implement a time-varying nonlinear mapping between its observable input and output. The variation of the network's mapping is due to an additional, hidden control input, while the network parameters remain unchanged. We proposed an algorithm for finding the network parameters and the hidden control sequence from a training set of examples of observable input and output. This algorithm implements an approximate maximum likelihood estimation of parameters of an equivalent statistical model, when only the dominant control sequence is taken into account. The conceptual difference between the proposed model and the HMM is that in the HMM approach, the observable data in each of the states is modeled as though it was produced by a memoryless source, and a parametric description of this source is obtained during training, while in the proposed model the observations in each state are produced by a nonlinear dynamical system driven by noise, and both the parametric form of the dynamics and the noise are estimated. The perfonnance of the model was illustrated for the tasks of nonlinear time-varying system modeling and continuously spoken digit recognition. The reported results show the potential of this model for providing high performance speech recognition capability. Acknowledgment Special thanks are due to N. Merhav for numerous comments and helpful discussions.


Asymptotic slowing down of the nearest-neighbor classifier

Neural Information Processing Systems

M2/n' for sufficiently large values of M. Here, Poo(error) denotes the probability of error in the infinite sample limit, and is at most twice the error of a Bayes classifier. Although the value of the coefficient a depends upon the underlying probability distributions, the exponent of M is largely distribution free. We thus obtain a concise relation between a classifier's ability to generalize from a finite reference sample and the dimensionality of the feature space, as well as an analytic validation of Bellman's well known "curse of dimensionality." 1 INTRODUCTION One of the primary tasks assigned to neural networks is pattern classification. Common applications include recognition problems dealing with speech, handwritten characters, DNA sequences, military targets, and (in this conference) sexual identity. Two fundamental concepts associated with pattern classification are generalization (how well does a classifier respond to input data it has never encountered before?) and scalability (how are a classifier's processing and training requirements affected by increasing the number of features that describe the input patterns?).



Modeling Time Varying Systems Using Hidden Control Neural Architecture

Neural Information Processing Systems

This paper introduces a generalization of the layered neural network that can implement a time-varying nonlinear mapping between its observable input and output. The variation of the network's mapping is due to an additional, hidden control input, while the network parameters remain unchanged. We proposed an algorithm for finding the network parameters and the hidden control sequence from a training set of examples of observable input and output. This algorithm implements an approximate maximum likelihood estimation of parameters of an equivalent statistical model, when only the dominant control sequence is taken into account. The conceptual difference between the proposed model and the HMM is that in the HMM approach, the observable data in each of the states is modeled as though it was produced by a memoryless source, and a parametric description of this source is obtained during training, while in the proposed model the observations in each state are produced by a nonlinear dynamical system driven by noise, and both the parametric form of the dynamics and the noise are estimated. The perfonnance of the model was illustrated for the tasks of nonlinear time-varying system modeling and continuously spoken digit recognition. The reported results show the potential of this model for providing high performance speech recognition capability. Acknowledgment Special thanks are due to N. Merhav for numerous comments and helpful discussions.


Modeling Time Varying Systems Using Hidden Control Neural Architecture

Neural Information Processing Systems

This paper introduces a generalization of the layered neural network that can implement a time-varying nonlinear mapping between its observable input and output. The variation of the network's mapping is due to an additional, hidden control input, while the network parameters remain unchanged. We proposed an algorithm for finding the network parameters and the hidden control sequence from a training set of examples of observable input and output. This algorithm implements an approximate maximum likelihood estimation of parameters of an equivalent statistical model, when only the dominant control sequence is taken into account. The conceptual difference between the proposed model and the HMM is that in the HMM approach, the observable data in each of the states is modeled as though it was produced by a memoryless source, and a parametric description of this source is obtained during training, while in the proposed model the observations in each state are produced by a nonlinear dynamical system driven by noise, and both the parametric form of the dynamics and the noise are estimated. The perfonnance of the model was illustrated for the tasks of nonlinear time-varying system modeling and continuously spoken digit recognition. The reported results show the potential of this model for providing high performance speech recognition capability. Acknowledgment Specialthanks are due to N. Merhav for numerous comments and helpful discussions.


Asymptotic slowing down of the nearest-neighbor classifier

Neural Information Processing Systems

Santosh S. Venkatesh Electrical Engineering University of Pennsylvania Philadelphia, PA 19104 If patterns are drawn from an n-dimensional feature space according to a probability distribution that obeys a weak smoothness criterion, we show that the probability that a random input pattern is misclassified by a nearest-neighbor classifier using M random reference patterns asymptotically satisfies a PM(error) "" Poo(error) M2/n' for sufficiently large values of M. Here, Poo(error) denotes the probability of error in the infinite sample limit, and is at most twice the error of a Bayes classifier. Although the value of the coefficient a depends upon the underlying probability distributions, the exponent of M is largely distribution free.We thus obtain a concise relation between a classifier's ability to generalize from a finite reference sample and the dimensionality of the feature space, as well as an analytic validation of Bellman's well known "curse of dimensionality." 1 INTRODUCTION One of the primary tasks assigned to neural networks is pattern classification.