Technology
Representation and Induction of Finite State Machines using Time-Delay Neural Networks
Clouse, Daniel S., Giles, C. Lee, Horne, Bill G., Cottrell, Garrison W.
This work investigates the representational and inductive capabilities oftime-delay neural networks (TDNNs) in general, and of two subclasses of TDNN, those with delays only on the inputs (IDNN), and those which include delays on hidden units (HDNN). Both architectures arecapable of representing the same class of languages, the definite memory machine (DMM) languages, but the delays on the hidden units in the HDNN helps it outperform the IDNN on problems composed of repeated features over short time windows. 1 Introduction In this paper we consider the representational and inductive capabilities of timedelay neuralnetworks (TDNN) [Waibel et al., 1989] [Lang et al., 1990], also known as NNFIR [Wan, 1993]. A TDNN is a feed-forward network in which the set of inputs to any node i may include the output from previous layers not only in the current time step t, but from d earlier time steps as well. The activation function 404 D.S. Clouse, C. L Giles, B. G. Home and G. W. Cottrell for node i at time t in such a network is given by equation 1: TDNNs have been used in speech recognition [Waibel et al., 1989], and time series prediction [Wan, 1993]. In this paper we concentrate on the language induction problem.
Genetic Algorithms and Explicit Search Statistics
The genetic algorithm (GA) is a heuristic search procedure based on mechanisms abstracted from population genetics. In a previous paper [Baluja & Caruana, 1995], we showed that much simpler algorithms, such as hillcIimbing and Population Based Incremental Learning (PBIL), perform comparably to GAs on an optimization problemcustom designed to benefit from the GA's operators. This paper extends these results in two directions. First, in a large-scale empirical comparison of problems that have been reported in GA literature, we show that on many problems, simpleralgorithms can perform significantly better than GAs. Second, we describe when crossover is useful, and show how it can be incorporated into PBIL. 1 IMPLICIT VS.
The Generalisation Cost of RAMnets
Rohwer, Richard, Morciniec, Michal
Neural Computing Research Group Aston University Aston Triangle, Birmingham B4 7ET, UK. Abstract Given unlimited computational resources, it is best to use a criterion ofminimal expected generalisation error to select a model and determine its parameters. However, it may be worthwhile to sacrifice somegeneralisation performance for higher learning speed. A method for quantifying sub-optimality is set out here, so that this choice can be made intelligently. Furthermore, the method is applicable to a broad class of models, including the ultra-fast memory-based methods such as RAMnets. This brings the added benefit of providing, for the first time, the means to analyse the generalisation properties of such models in a Bayesian framework . 1 Introduction In order to quantitatively predict the performance of methods such as the ultra-fast RAMnet, which are not trained by minimising a cost function, we develop a Bayesian formalism for estimating the generalisation cost of a wide class of algorithms.
Practical Confidence and Prediction Intervals
We propose a new method to compute prediction intervals. Especially forsmall data sets the width of a prediction interval does not only depend on the variance of the target distribution, but also on the accuracy of our estimator of the mean of the target, i.e., on the width of the confidence interval. The confidence interval follows from the variation in an ensemble of neural networks, each of them trained and stopped on bootstrap replicates of the original data set. A second improvement is the use of the residuals on validation patterns insteadof on training patterns for estimation of the variance of the target distribution. As illustrated on a synthetic example, our method is better than existing methods with regard to extrapolation andinterpolation in data regimes with a limited amount of data, and yields prediction intervals which actual confidence levels are closer to the desired confidence levels. 1 STATISTICAL INTERVALS In this paper we will consider feedforward neural networks for regression tasks: estimating an underlying mathematical function between input and output variables based on a finite number of data points possibly corrupted by noise.
For Valid Generalization the Size of the Weights is More Important than the Size of the Network
Baum and Haussler [4] used these results to give sample size bounds for multi-layer threshold networks Generalization and the Size ofthe Weights in Neural Networks 135 that grow at least as quickly as the number of weights (see also [7]). However, for pattern classification applications the VC-bounds seem loose; neural networks often perform successfully with training sets that are considerably smaller than the number of weights. This paper shows that for classification problems on which neural networksperform well, if the weights are not too big, the size of the weights determines the generalization performance. In contrast with the function classes and algorithms considered in the VC-theory, neural networks used for binary classification problems have real-valued outputs, and learning algorithms typically attempt to minimize the squared error of the network output over a training set. As well as encouraging the correct classification, this tends to push the output away from zero and towards the target values of { -1, I}.
Statistically Efficient Estimations Using Cortical Lateral Connections
Pouget, Alexandre, Zhang, Kechen
Coarse codes are widely used throughout the brain to encode sensory andmotor variables. Methods designed to interpret these codes, such as population vector analysis, are either inefficient, i.e., the variance of the estimate is much larger than the smallest possible variance,or biologically implausible, like maximum likelihood. Moreover, these methods attempt to compute a scalar or vector estimate of the encoded variable. Neurons are faced with a similar estimationproblem. They must read out the responses of the presynaptic neurons, but, by contrast, they typically encode the variable with a further population code rather than as a scalar. We show how a nonlinear recurrent network can be used to perform theseestimation in an optimal way while keeping the estimate in a coarse code format. This work suggests that lateral connections inthe cortex may be involved in cleaning up uncorrelated noise among neurons representing similar variables.
Complex-Cell Responses Derived from Center-Surround Inputs: The Surprising Power of Intradendritic Computation
Mel, Bartlett W., Ruderman, Daniel L., Archie, Kevin A.
Biophysical modeling studies have previously shown that cortical pyramidal cells driven by strong NMDA-type synaptic currents and/or containing dendritic voltage-dependent Ca or Na channels, respondmore strongly when synapses are activated in several spatially clustered groups of optimal size-in comparison to the same number of synapses activated diffusely about the dendritic arbor [8]- The nonlinear intradendritic interactions giving rise to this "cluster sensitivity" property are akin to a layer of virtual nonlinear "hiddenunits" in the dendrites, with implications for the cellular basis of learning and memory [7, 6], and for certain classes of nonlinear sensory processing [8]- In the present study, we show that a single neuron, with access only to excitatory inputs from unoriented ONand OFFcenter cells in the LGN, exhibits the principal nonlinear response properties of a "complex" cell in primary visual cortex, namely orientation tuning coupled with translation invariance andcontrast insensitivity_ We conjecture that this type of intradendritic processing could explain how complex cell responses can persist in the absence of oriented simple cell input [13]- 84 B. W. Mel, D. L. Ruderman and K. A. Archie
Learning Decision Theoretic Utilities through Reinforcement Learning
Stensmo, Magnus, Sejnowski, Terrence J.
Probability models can be used to predict outcomes and compensate for missing data, but even a perfect model cannot be used to make decisions unless the utility of the outcomes, or preferences between them, are also provided. This arises in many real-world problems, such as medical diagnosis, wherethe cost of the test as well as the expected improvement in the outcome must be considered. Relatively little work has been done on learning the utilities of outcomes for optimal decision making. In this paper, we show how temporal-difference reinforcement learning (TO(A» can be used to determine decision theoretic utilities within the context of a mixture model and apply this new approach to a problem in medical diagnosis. TO(A) learning of utilities reduces the number of tests that have to be done to achieve the same level of performance compared with the probability model alone, which results in significant cost savings and increased efficiency.
Unsupervised Learning by Convex and Conic Coding
Lee, Daniel D., Seung, H. Sebastian
Unsupervised learning algorithms based on convex and conic encoders areproposed. The encoders find the closest convex or conic combination of basis vectors to the input. The learning algorithms produce basis vectors that minimize the reconstruction error of the encoders. The convex algorithm develops locally linear models of the input, while the conic algorithm discovers features. Both algorithms areused to model handwritten digits and compared with vector quantization and principal component analysis.
The Effect of Correlated Input Data on the Dynamics of Learning
The convergence properties of the gradient descent algorithm in the case of the linear perceptron may be obtained from the response function. We derive a general expression for the response function and apply it to the case of data with simple input correlations. It is found that correlations severely may slow down learning. This explains the success of PCA as a method for reducing training time. Motivated by this finding we furthermore propose to transform the input data by removing the mean across input variables as well as examples to decrease correlations. Numerical findings for a medical classification problem are in fine agreement with the theoretical results. 1 INTRODUCTION Learning and generalization are important areas of research within the field of neural networks.Although good generalization is the ultimate goal in feed-forward networks (perceptrons), it is of practical importance to understand the mechanism which control the amount of time required for learning, i. e. the dynamics of learning. Thisis of course particularly important in the case of a large data set. An exact analysis of this mechanism is possible for the linear perceptron and as usual it is hoped that the results to some extend may be carried over to explain the behaviour of nonlinear perceptrons.