Country
SIMPLIFYING NEURAL NETS BY DISCOVERING FLAT MINIMA
Hochreiter, Sepp, Schmidhuber, Jürgen
We present a new algorithm for finding low complexity networks with high generalization capability. The algorithm searches for large connected regions of so-called ''fiat'' minima of the error function. In the weight-space environment of a "flat" minimum, the error remains approximately constant. Using an MDL-based argument, flat minima can be shown to correspond to low expected overfitting. Although our algorithm requires the computation of second order derivatives, it has backprop's order of complexity. Experiments with feedforward and recurrent nets are described. In an application to stock market prediction, the method outperforms conventional backprop, weight decay, and "optimal brain surgeon".
Capacity and Information Efficiency of a Brain-like Associative Net
Graham, Bruce, Willshaw, David
In this paper we consider the capacity of a binary associative net (Willshaw, Buneman, & Longuet-Higgins, 1969; Willshaw, 1971; Buckingham, 1991) containing these features. While the associative net is a very simple model of associative memory, its behaviour as a storage device is not trivial and yet it is tractable to theoretical analysis.
Extracting Rules from Artificial Neural Networks with Distributed Representations
Although artificial neural networks have been applied in a variety of real-world scenarios with remarkable success, they have often been criticized for exhibiting a low degree of human comprehensibility. Techniques that compile compact sets of symbolic rules out of artificial neural networks offer a promising perspective to overcome this obvious deficiency of neural network representations. This paper presents an approach to the extraction of if-then rules from artificial neural networks. Its key mechanism is validity interval analysis, which is a generic tool for extracting symbolic knowledge by propagating rule-like knowledge through Backpropagation-style neural networks. Empirical studies in a robot arm domain illustrate the appropriateness of the proposed method for extracting rules from networks with real-valued and distributed representations.
Dynamic Cell Structures
Dynamic Cell Structures (DCS) represent a family of artificial neural architectures suited both for unsupervised and supervised learning. They belong to the recently [Martinetz94] introduced class of Topology Representing Networks (TRN) which build perlectly topology preserving feature maps. DCS empI'oy a modified Kohonen learning rule in conjunction with competitive Hebbian learning. The Kohonen type learning rule serves to adjust the synaptic weight vectors while Hebbian learning establishes a dynamic lateral connection structure between the units reflecting the topology of the feature manifold. In case of supervised learning, i.e. function approximation, each neural unit implements a Radial Basis Function, and an additional layer of linear output units adjusts according to a delta-rule. DCS is the first RBF-based approximation scheme attempting to concurrently learn and utilize a perfectly topology preserving map for improved performance. Simulations on a selection of CMU-Benchmarks indicate that the DCS idea applied to the Growing Cell Structure algorithm [Fritzke93] leads to an efficient and elegant algorithm that can beat conventional models on similar tasks.
Learning Local Error Bars for Nonlinear Regression
Nix, David A., Weigend, Andreas S.
We present a new method for obtaining local error bars for nonlinear regression, i.e., estimates of the confidence in predicted values that depend on the input. We approach this problem by applying a maximumlikelihood framework to an assumed distribution of errors. We demonstrate our method first on computer-generated data with locally varying, normally distributed target noise. We then apply it to laser data from the Santa Fe Time Series Competition where the underlying system noise is known quantization error and the error bars give local estimates of model misspecification. In both cases, the method also provides a weightedregression effect that improves generalization performance.
Phase-Space Learning
Tsung, Fu-Sheng, Cottrell, Garrison W.
Existing recurrent net learning algorithms are inadequate. We introduce the conceptual framework of viewing recurrent training as matching vector fields of dynamical systems in phase space. Phasespace reconstruction techniques make the hidden states explicit, reducing temporal learning to a feed-forward problem. In short, we propose viewing iterated prediction [LF88] as the best way of training recurrent networks on deterministic signals. Using this framework, we can train multiple trajectories, insure their stability, and design arbitrary dynamical systems. 1 INTRODUCTION Existing general-purpose recurrent algorithms are capable of rich dynamical behavior. Unfortunately, straightforward applications of these algorithms to training fully-recurrent networks on complex temporal tasks have had much less success than their feedforward counterparts. For example, to train a recurrent network to oscillate like a sine wave (the "hydrogen atom" of recurrent learning), existing techniques such as Real Time Recurrent Learning (RTRL) [WZ89] perform suboptimally. Williams & Zipser trained a two-unit network with RTRL, with one teacher signal. One unit of the resulting network showed a distorted waveform, the other only half the desired amplitude.
A Non-linear Information Maximisation Algorithm that Performs Blind Separation
Bell, Anthony J., Sejnowski, Terrence J.
With the exception of (Becker 1992), there has been little attempt to use non-linearity in networks to achieve something a linear network could not. Nonlinear networks, however, are capable of computing more general statistics than those second-order ones involved in decorrelation, and as a consequence they are capable of dealing with signals (and noises) which have detailed higher-order structure. The success of the'H-J' networks at blind separation (Jutten & Herault 1991) suggests that it should be possible to separate statistically independent components, by using learning rules which make use of moments of all orders. This paper takes a principled approach to this problem, by starting with the question of how to maximise the information passed on in nonlinear feed-forward network. Starting with an analysis of a single unit, the approach is extended to a network mapping N inputs to N outputs. In the process, it will be shown that, under certain fairly weak conditions, the N ---. N network forms a minimally redundant encoding ofthe inputs, and that it therefore performs Independent Component Analysis (ICA). 2 Information maximisation The information that output Y contains about input X is defined as: I(Y, X) H(Y) - H(YIX) (1) where H(Y) is the entropy (information) in the output, while H(YIX) is whatever information the output has which didn't come from the input. In the case that we have no noise (or rather, we don't know what is noise and what is signal in the input), the mapping between X and Y is deterministic and H(YIX) has its lowest possible value of
Multidimensional Scaling and Data Clustering
Hofmann, Thomas, Buhmann, Joachim
Visualizing and structuring pairwise dissimilarity data are difficult combinatorial optimization problems known as multidimensional scaling or pairwise data clustering. Algorithms for embedding dissimilarity data set in a Euclidian space, for clustering these data and for actively selecting data to support the clustering process are discussed in the maximum entropy framework. Active data selection provides a strategy to discover structure in a data set efficiently with partially unknown data. 1 Introduction Grouping experimental data into compact clusters arises as a data analysis problem in psychology, linguistics, genetics and other experimental sciences. The data which are supposed to be clustered are either given by an explicit coordinate representation (central clustering) or, in the non-metric case, they are characterized by dissimilarity values for pairs of data points (pairwise clustering). In this paper we study algorithms (i) for embedding non-metric data in a D-dimensional Euclidian space, (ii) for simultaneous clustering and embedding of non-metric data, and (iii) for active data selection to determine a particular cluster structure with minimal number of data queries. All algorithms are derived from the maximum entropy principle (Hertz et al., 1991) which guarantees robust statistics (Tikochinsky et al., 1984).
Using a Saliency Map for Active Spatial Selective Attention: Implementation & Initial Results
Baluja, Shumeet, Pomerleau, Dean A.
School of Computer Science School of Computer Science Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA 15213 Pittsburgh, PA 15213 Abstract In many vision based tasks, the ability to focus attention on the important portions of a scene is crucial for good performance on the tasks. In this paper we present a simple method of achieving spatial selective attention through the use of a saliency map. The saliency map indicates which regions of the input retina are important for performing the task. The saliency map is created through predictive auto-encoding. The performance of this method is demonstrated on two simple tasks which have multiple very strong distracting features in the input retina. Architectural extensions and application directions for this model are presented. On some tasks this extra input can easily be ignored. Nonetheless, often the similarity between the important input features and the irrelevant features is great enough to interfere with task performance.