Country
The Gamma MLP for Speech Phoneme Recognition
Lawrence, Steve, Tsoi, Ah Chung, Back, Andrew D.
Department of Electrical and Computer Engineering University of Queensland St. Lucia Qld 4072 Australia Abstract We define a Gamma multi-layer perceptron (MLP) as an MLP with the usual synaptic weights replaced by gamma filters (as proposed byde Vries and Principe (de Vries and Principe, 1992)) and associated gain terms throughout all layers. We derive gradient descent update equations and apply the model to the recognition of speech phonemes. We find that both the inclusion of gamma filters in all layers, and the inclusion of synaptic gains, improves the performance of the Gamma MLP. We compare the Gamma MLP with TDNN, Back-Tsoi FIR MLP, and Back-Tsoi I1R MLP architectures, and a local approximation scheme. We find that the Gamma MLP results in an substantial reduction in error rates. 1 INTRODUCTION 1.1 THE GAMMA FILTER Infinite Impulse Response (I1R) filters have a significant advantage over Finite Impulse Response(FIR) filters in signal processing: the length of the impulse response is uncoupled from the number of filter parameters.
Using Pairs of Data-Points to Define Splits for Decision Trees
Hinton, Geoffrey E., Revow, Michael
CART either split the data using axis-aligned hyperplanes or they perform a computationally expensivesearch in the continuous space of hyperplanes with unrestricted orientations. We show that the limitations of the former can be overcome without resorting to the latter. For every pair of training data-points, there is one hyperplane that is orthogonal tothe line joining the data-points and bisects this line. Such hyperplanes are plausible candidates for splits. In a comparison on a suite of 12 datasets we found that this method of generating candidate splits outperformed the standard methods, particularly when the training sets were small. 1 Introduction Binary decision trees come in many flavours, but they all rely on splitting the set of k-dimensional data-points at each internal node into two disjoint sets.
Forward-backward retraining of recurrent neural networks
Senior, Andrew W., Robinson, Anthony J.
This paper describes the training of a recurrent neural network as the letter posterior probability estimator for a hidden Markov model, off-line handwriting recognition system. The network estimates posteriordistributions for each of a series of frames representing sectionsof a handwritten word. The supervised training algorithm, backpropagation through time, requires target outputs to be provided for each frame. Three methods for deriving these targets are presented. A novel method based upon the forwardbackward algorithmis found to result in the recognizer with the lowest error rate. 1 Introduction In the field of off-line handwriting recognition, the goal is to read a handwritten document and produce a machine transcription.
Human Face Detection in Visual Scenes
Rowley, Henry A., Baluja, Shumeet, Kanade, Takeo
We present a neural network-based face detection system. A retinally connected neural network examines small windows of an image, and decides whether each window contains a face. The system arbitrates between multiple networks to improve performance over a single network. We use a bootstrap algorithm for training, which adds false detections into the training set as training progresses. This eliminates the difficult task of manually selecting non-face training examples, which must be chosen to span the entire space of non-face images.
Improved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms and Network Averaging
We compare two regularization methods which can be used to improve thegeneralization capabilities of Gaussian mixture density estimates. The first method uses a Bayesian prior on the parameter space.We derive EM (Expectation Maximization) update rules which maximize the a posterior parameter probability. In the second approachwe apply ensemble averaging to density estimation. This includes Breiman's "bagging", which recently has been found to produce impressive results for classification networks.
Context-Dependent Classes in a Hybrid Recurrent Network-HMM Speech Recognition System
Kershaw, Dan J., Robinson, Anthony J., Hochberg, Mike
A method for incorporating context-dependent phone classes in a connectionist-HMM hybrid speech recognition system is introduced. Amodular approach is adopted, where single-layer networks discriminate between different context classes given the phone class and the acoustic data. The context networks are combined with a context-independent (CI) network to generate context-dependent (CD) phone probability estimates. Experiments show an average reduction in word error rate of 16% and 13% from the CI system on ARPA 5,000 word and SQALE 20,000 word tasks respectively. Due to improved modelling, the decoding speed of the CD system is more than twice as fast as the CI system.
Generating Accurate and Diverse Members of a Neural-Network Ensemble
Opitz, David W., Shavlik, Jude W.
In particular, combining separately trained neural networks (commonly referred to as a neural-network ensemble) has been demonstrated to be particularly successful (Alpaydin, 1993; Drucker et al., 1994; Hansen and Salamon, 1990; Hashem et al., 1994; Krogh and Vedelsby, 1995; Maclin and Shavlik, 1995; Perrone, 1992). Both theoretical (Hansen and Salamon, 1990;Krogh and Vedelsby, 1995) and empirical (Hashem et al., 1994; 536 D.W. OPITZ, J. W. SHAVLIK Maclin and Shavlik, 1995) work has shown that a good ensemble is one where the individual networks are both accurate and make their errors on different parts of the input space; however, most previous work has either focussed on combining the output of multiple trained networks or only indirectly addressed how we should generate a good set of networks.
Is Learning The n-th Thing Any Easier Than Learning The First?
This paper investigates learning in a lifelong context. Lifelong learning addresses situations in which a learner faces a whole stream of learning tasks.Such scenarios provide the opportunity to transfer knowledge across multiple learning tasks, in order to generalize more accurately from less training data. In this paper, several different approaches to lifelong learning are described, and applied in an object recognition domain. It is shown that across the board, lifelong learning approaches generalize consistently more accurately from less training data, by their ability to transfer knowledge across learning tasks. 1 Introduction Supervised learning is concerned with approximating an unknown function based on examples. Virtuallyall current approaches to supervised learning assume that one is given a set of input-output examples, denoted by X, which characterize an unknown function, denoted by f.
Reinforcement Learning by Probability Matching
Sabes, Philip N., Jordan, Michael I.
Department of Brain and Cognitive Sciences Massachusetts Institute of Technology Cambridge, MA 02139 Abstract We present a new algorithm for associative reinforcement learning. Thealgorithm is based upon the idea of matching a network's output probability with a probability distribution derived from the environment's reward signal. This Probability Matching algorithm is shown to perform faster and be less susceptible to local minima than previously existing algorithms. We use Probability Matching totrain mixture of experts networks, an architecture for which other reinforcement learning rules fail to converge reliably on even simple problems. This architecture is particularly well suited for our algorithm as it can compute arbitrarily complex functions yet calculation of the output probability is simple. 1 INTRODUCTION The problem of learning associative networks from scalar reinforcement signals is notoriously difficult.
Discriminant Adaptive Nearest Neighbor Classification and Regression
Hastie, Trevor, Tibshirani, Robert
Nearest neighbor classification expects the class conditional probabilities tobe locally constant, and suffers from bias in high dimensions Wepropose a locally adaptive form of nearest neighbor classification to try to finesse this curse of dimensionality. We use a local linear discriminant analysis to estimate an effective metric forcomputing neighborhoods. We determine the local decision boundaries from centroid information, and then shrink neighborhoods indirections orthogonal to these local decision boundaries, and elongate them parallel to the boundaries. Thereafter, any neighborhood-based classifier can be employed, using the modified neighborhoods. We also propose a method for global dimension reduction, that combines local dimension information.