Technology
Finite State Automata that Recurrent Cascade-Correlation Cannot Represent
This paper relates the computational power of Fahlman' s Recurrent Cascade Correlation (RCC) architecture to that of fInite state automata (FSA). While some recurrent networks are FSA equivalent, RCC is not. The paper presents a theoretical analysis of the RCC architecture in the form of a proof describing a large class of FSA which cannot be realized by RCC. 1 INTRODUCTION Recurrent networks can be considered to be defmed by two components: a network architecture, and a learning rule. The former describes how a network with a given set of weights and topology computes its output values, while the latter describes how the weights (and possibly topology) of the network are updated to fIt a specifIc problem. It is possible to evaluate the computational power of a network architecture by analyzing the types of computations a network could perform assuming appropriate connection weights (and topology).
From Isolation to Cooperation: An Alternative View of a System of Experts
Schaal, Stefan, Atkeson, Christopher G.
We introduce a constructive, incremental learning system for regression problems that models data by means of locally linear experts. In contrast to other approaches, the experts are trained independently and do not compete for data during learning. Only when a prediction for a query is required do the experts cooperate by blending their individual predictions. Each expert is trained by minimizing a penalized local cross validation error using second order methods. In this way, an expert is able to find a local distance metric by adjusting the size and shape of the receptive field in which its predictions are valid, and also to detect relevant input features by adjusting its bias on the importance of individual input dimensions. We derive asymptotic results for our method. In a variety of simulations the properties of the algorithm are demonstrated with respect to interference, learning speed, prediction accuracy, feature detection, and task oriented incremental learning.
An Information-theoretic Learning Algorithm for Neural Network Classification
Miller, David J., Rao, Ajit V., Rose, Kenneth, Gersho, Allen
A new learning algorithm is developed for the design of statistical classifiers minimizing the rate of misclassification. The method, which is based on ideas from information theory and analogies to statistical physics, assigns data to classes in probability. The distributions are chosen to minimize the expected classification error while simultaneously enforcing the classifier's structure and a level of "randomness" measured by Shannon's entropy. Achievement of the classifier structure is quantified by an associated cost. The constrained optimization problem is equivalent to the minimization of a Helmholtz free energy, and the resulting optimization method is a basic extension of the deterministic annealing algorithm that explicitly enforces structural constraints on assignments while reducing the entropy and expected cost with temperature. In the limit of low temperature, the error rate is minimized directly and a hard classifier with the requisite structure is obtained. This learning algorithm can be used to design a variety of classifier structures. The approach is compared with standard methods for radial basis function design and is demonstrated to substantially outperform other design methods on several benchmark examples, while often retaining design complexity comparable to, or only moderately greater than that of strict descent-based methods.
Investment Learning with Hierarchical PSOMs
Walter, Jรถrg A., Ritter, Helge
We propose a hierarchical scheme for rapid learning of context dependent "skills" that is based on the recently introduced "Parameterized Self Organizing Map" ("PSOM"). The underlying idea is to first invest some learning effort to specialize the system into a rapid learner for a more restricted range of contexts. The specialization is carried out by a prior "investment learning stage", during which the system acquires a set of basis mappings or "skills" for a set of prototypical contexts. Adaptation of a "skill" to a new context can then be achieved by interpolating in the space of the basis mappings and thus can be extremely rapid. We demonstrate the potential of this approach for the task of a 3D visuomotor map for a Puma robot and two cameras. This includes the forward and backward robot kinematics in 3D end effector coordinates, the 2D 2D retina coordinates and also the 6D joint angles. After the investment phase the transformation can be learned for a new camera setup with a single observation.
Tempering Backpropagation Networks: Not All Weights are Created Equal
Schraudolph, Nicol N., Sejnowski, Terrence J.
Backpropagation learning algorithms typically collapse the network's structure into a single vector of weight parameters to be optimized. We suggest that their performance may be improved by utilizing the structural information instead of discarding it, and introduce a framework for ''tempering'' each weight accordingly. In the tempering model, activation and error signals are treated as approximately independent random variables. The characteristic scale of weight changes is then matched to that ofthe residuals, allowing structural properties such as a node's fan-in and fan-out to affect the local learning rate and backpropagated error. The model also permits calculation of an upper bound on the global learning rate for batch updates, which in turn leads to different update rules for bias vs. non-bias weights. This approach yields hitherto unparalleled performance on the family relations benchmark, a deep multi-layer network: for both batch learning with momentum and the delta-bar-delta algorithm, convergence at the optimal learning rate is sped up by more than an order of magnitude.
The Capacity of a Bump
Recently, several researchers have reported encouraging experimental results when using Gaussian or bump-like activation functions in multilayer perceptrons. Networks of this type usually require fewer hidden layers and units and often learn much faster than typical sigmoidal networks. To explain these results we consider a hyper-ridge network, which is a simple perceptron with no hidden units and a ridยฅe activation function. If we are interested in partitioningp points in d dimensions into two classes then in the limit as d approaches infinity the capacity of a hyper-ridge and a perceptron is identical.
Improved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms and Network Averaging
We compare two regularization methods which can be used to improve the generalization capabilities of Gaussian mixture density estimates. The first method uses a Bayesian prior on the parameter space. We derive EM (Expectation Maximization) update rules which maximize the a posterior parameter probability. In the second approach we apply ensemble averaging to density estimation. This includes Breiman's "bagging", which recently has been found to produce impressive results for classification networks.
Generating Accurate and Diverse Members of a Neural-Network Ensemble
Opitz, David W., Shavlik, Jude W.
In particular, combining separately trained neural networks (commonly referred to as a neural-network ensemble) has been demonstrated to be particularly successful (Alpaydin, 1993; Drucker et al., 1994; Hansen and Salamon, 1990; Hashem et al., 1994; Krogh and Vedelsby, 1995; Maclin and Shavlik, 1995; Perrone, 1992). Both theoretical (Hansen and Salamon, 1990; Krogh and Vedelsby, 1995) and empirical (Hashem et al., 1994; 536 D. W. OPITZ, J. W. SHA VLIK Maclin and Shavlik, 1995) work has shown that a good ensemble is one where the individual networks are both accurate and make their errors on different parts of the input space; however, most previous work has either focussed on combining the output of multiple trained networks or only indirectly addressed how we should generate a good set of networks.