Undirected Networks
Diffusion of Context and Credit Information in Markovian Models
This paper studies the problem of ergodicity of transition probability matrices in Markovian models, such as hidden Markov models (HMMs), and how it makes very difficult the task of learning to represent long-term context for sequential data. This phenomenon hurts the forward propagation of long-term context information, as well as learning a hidden state representation to represent long-term context, which depends on propagating credit information backwards in time. Using results from Markov chain theory, we show that this problem of diffusion of context and credit is reduced when the transition probabilities approach 0 or 1, i.e., the transition probability matrices are sparse and the model essentially deterministic. The results found in this paper apply to learning approaches based on continuous optimization, such as gradient descent and the Baum-Welch algorithm.
Figure of Merit Training for Detection and Spotting
Chang, Eric I., Lippmann, Richard P.
Spotting tasks require detection of target patterns from a background of richly varied non-target inputs. The performance measure of interest for these tasks, called the figure of merit (FOM), is the detection rate for target patterns when the false alarm rate is in an acceptable range. A new approach to training spotters is presented which computes the FOM gradient for each input pattern and then directly maximizes the FOM using b ackpropagati on. This eliminates the need for thresholds during training. It also uses network resources to model Bayesian a posteriori probability functions accurately only for patterns which have a significant effect on the detection accuracy over the false alarm rate of interest. FOM training increased detection accuracy by 5 percentage points for a hybrid radial basis function (RBF) - hidden Markov model (HMM) wordspotter on the credit-card speech corpus.
Decoding Cursive Scripts
Singer, Yoram, Tishby, Naftali
Online cursive handwriting recognition is currently one of the most intriguing challenges in pattern recognition. This study presents a novel approach to this problem which is composed of two complementary phases. The first is dynamic encoding of the writing trajectory into a compact sequence of discrete motor control symbols. In this compact representation we largely remove the redundancy of the script, while preserving most of its intelligible components. In the second phase these control sequences are used to train adaptive probabilistic acyclic automata (PAA) for the important ingredients of the writing trajectories, e.g.
Learning Temporal Dependencies in Connectionist Speech Recognition
Renals, Steve, Hochberg, Mike, Robinson, Tony
In this paper, we discuss the nature of the time dependence currently employed in our systems using recurrent networks (RNs) and feed-forward multi-layer perceptrons (MLPs). In particular, we introduce local recurrences into a MLP to produce an enhanced input representation. This is in the form of an adaptive gamma filter and incorporates an automatic approach for learning temporal dependencies. We have experimented on a speakerindependent phone recognition task using the TIMIT database. Results using the gamma filtered input representation have shown improvement over the baseline MLP system. Improvements have also been obtained through merging the baseline and gamma filter models.
Globally Trained Handwritten Word Recognizer using Spatial Representation, Convolutional Neural Networks, and Hidden Markov Models
Bengio, Yoshua, LeCun, Yann, Henderson, Donnie
We introduce a new approach for online recognition of handwritten words written in unconstrained mixed style. The preprocessor performs a word-level normalization by fitting a model of the word structure using the EM algorithm. Words are then coded into low resolution "annotated images" where each pixel contains information about trajectory direction and curvature. The recognizer is a convolution network which can be spatially replicated. From the network output, a hidden Markov model produces word scores. The entire system is globally trained to minimize word-level errors. 1 Introduction Natural handwriting is often a mixture of different "styles", lower case printed, upper case, and cursive.
Probabilistic Anomaly Detection in Dynamic Systems
This paper describes probabilistic methods for novelty detection when using pattern recognition methods for fault monitoring of dynamic systems. The problem of novelty detection is particularly acute when prior knowledge and training data only allow one to construct an incomplete classification model. Allowance must be made in model design so that the classifier will be robust to data generated by classes not included in the training phase. For diagnosis applications one practical approach is to construct both an input density model and a discriminative class model. Using Bayes' rule and prior estimates of the relative likelihood of data of known and unknown origin the resulting classification equations are straightforward.
Hidden Markov Models for Human Genes
Baldi, Pierre, Brunak, Sรธren, Chauvin, Yves, Engelbrecht, Jacob, Krogh, Anders
Human genes are not continuous but rather consist of short coding regions (exons) interspersed with highly variable non-coding regions (introns). We apply HMMs to the problem of modeling exons, introns and detecting splice sites in the human genome. Our most interesting result so far is the detection of particular oscillatory patterns, with a minimal period ofroughly 10 nucleotides, that seem to be characteristic of exon regions and may have significant biological implications.
Mixtures of Controllers for Jump Linear and Non-Linear Plants
Cacciatore, Timothy W., Nowlan, Steven J.
To control such complex systems it is computationally more efficient to decompose the problem into smaller subtasks, with different control strategies for different operating points. When detailed information about the plant is available, gain scheduling has proven a successful method for designing a global control (Shamma and Athans, 1992). The system is partitioned by choosing several operating points and a linear model for each operating point. A controller is designed for each linear model and a method for interpolating or'scheduling' the gains of the controllers is chosen. The control problem becomes even more challenging when the system to be controlled is non-stationary, and the mode of the system is not explicitly observable.
Monte Carlo Matrix Inversion and Reinforcement Learning
We describe the relationship between certain reinforcement learning (RL) methods based on dynamic programming (DP) and a class of unorthodox Monte Carlo methods for solving systems of linear equations proposed in the 1950's. These methods recast the solution of the linear system as the expected value of a statistic suitably defined over sample paths of a Markov chain. The significance of our observations lies in arguments (Curtiss, 1954) that these Monte Carlo methods scale better with respect to state-space size than do standard, iterative techniques for solving systems of linear equations. This analysis also establishes convergence rate estimates. Because methods used in RL systems for approximating the evaluation function of a fixed control policy also approximate solutions to systems of linear equations, the connection to these Monte Carlo methods establishes that algorithms very similar to TD algorithms (Sutton, 1988) are asymptotically more efficient in a precise sense than other methods for evaluating policies. Further, all DPbased RL methods have some of the properties of these Monte Carlo algorithms, which suggests that although RL is often perceived to be slow, for sufficiently large problems, it may in fact be more efficient than other known classes of methods capable of producing the same results.
The Power of Amnesia
Ron, Dana, Singer, Yoram, Tishby, Naftali
We propose a learning algorithm for a variable memory length Markov process. Human communication, whether given as text, handwriting, or speech, has multi characteristic time scales. On short scales it is characterized mostly by the dynamics that generate the process, whereas on large scales, more syntactic and semantic information is carried. For that reason the conventionally used fixed memory Markov models cannot capture effectively the complexity of such structures. On the other hand using long memory models uniformly is not practical even for as short memory as four.