Technology
SIMPLIFYING NEURAL NETS BY DISCOVERING FLAT MINIMA
Hochreiter, Sepp, Schmidhuber, Jürgen
We present a new algorithm for finding low complexity networks with high generalization capability. The algorithm searches for large connected regions of so-called ''fiat'' minima of the error function. Inthe weight-space environment of a "flat" minimum, the error remains approximately constant. Using an MDL-based argument, flatminima can be shown to correspond to low expected overfitting. Although our algorithm requires the computation of second order derivatives, it has backprop's order of complexity.
Extracting Rules from Artificial Neural Networks with Distributed Representations
Although artificial neural networks have been applied in a variety of real-world scenarios with remarkable success, they have often been criticized for exhibiting a low degree of human comprehensibility. Techniques that compile compact sets of symbolic rules out of artificial neural networks offer a promising perspective to overcome this obvious deficiency of neural network representations. This paper presents an approach to the extraction of if-then rules from artificial neural networks.Its key mechanism is validity interval analysis, which is a generic tool for extracting symbolic knowledge by propagating rule-like knowledge through Backpropagation-style neural networks. Empirical studies in a robot arm domain illustrate theappropriateness of the proposed method for extracting rules from networks with real-valued and distributed representations.
Learning Local Error Bars for Nonlinear Regression
Nix, David A., Weigend, Andreas S.
We present a new method for obtaining local error bars for nonlinear regression, i.e., estimates of the confidence in predicted values that depend onthe input. We approach this problem by applying a maximumlikelihood frameworkto an assumed distribution of errors. We demonstrate our method first on computer-generated data with locally varying, normally distributed target noise. We then apply it to laser data from the Santa Fe Time Series Competition where the underlying system noise is known quantization error and the error bars give local estimates of model misspecification. In both cases, the method also provides a weightedregression effectthat improves generalization performance.
Plasticity-Mediated Competitive Learning
Schraudolph, Nicol N., Sejnowski, Terrence J.
Differentiation between the nodes of a competitive learning network isconventionally achieved through competition on the basis of neural activity. Simple inhibitory mechanisms are limited to sparse representations, while decorrelation and factorization schemes that support distributed representations are computationally unattractive.By letting neural plasticity mediate the competitive interactioninstead, we obtain diffuse, nonadaptive alternatives forfully distributed representations. We use this technique to Simplify and improve our binary information gain optimization algorithmfor feature extraction (Schraudolph and Sejnowski, 1993); the same approach could be used to improve other learning algorithms. 1 INTRODUCTION Unsupervised neural networks frequently employ sets of nodes or subnetworks with identical architecture and objective function. Some form of competitive interaction isthen needed for these nodes to differentiate and efficiently complement each other in their task.
A Non-linear Information Maximisation Algorithm that Performs Blind Separation
Bell, Anthony J., Sejnowski, Terrence J.
With the exception of (Becker 1992), there has been little attempt to use non-linearity in networks to achieve something a linear network could not. Nonlinear networks, however, are capable of computing more general statistics than those second-order ones involved in decorrelation, and as a consequence they are capable of dealing with signals (and noises) which have detailed higher-order structure. The success of the'H-J' networks at blind separation (Jutten & Herault 1991)suggests that it should be possible to separate statistically independent components, by using learning rules which make use of moments of all orders. This paper takes a principled approach to this problem, by starting with the question ofhow to maximise the information passed on in nonlinear feed-forward network. Startingwith an analysis of a single unit, the approach is extended to a network mapping N inputs to N outputs. In the process, it will be shown that, under certain fairly weak conditions, the N ---. N network forms a minimally redundant encodingofthe inputs, and that it therefore performs Independent Component Analysis (ICA). 2 Information maximisation The information that output Y contains about input X is defined as: I(Y, X) H(Y) - H(YIX) (1) where H(Y) is the entropy (information) in the output, while H(YIX) is whatever information the output has which didn't come from the input. In the case that we have no noise (or rather, we don't know what is noise and what is signal in the input), the mapping between X and Y is deterministic and H(YIX) has its lowest possible value of
Multidimensional Scaling and Data Clustering
Hofmann, Thomas, Buhmann, Joachim
Visualizing and structuring pairwise dissimilarity data are difficult combinatorial optimization problemsknown as multidimensional scaling or pairwise data clustering. Algorithms for embedding dissimilarity data set in a Euclidian space, for clustering these data and for actively selecting data to support the clustering process are discussed in the maximum entropy framework. Active data selection provides a strategy to discover structure in a data set efficiently with partially unknown data. 1 Introduction Grouping experimental data into compact clusters arises as a data analysis problem in psychology, linguistics,genetics and other experimental sciences. The data which are supposed to be clustered are either given by an explicit coordinate representation (central clustering) or, in the non-metric case, they are characterized by dissimilarity values for pairs of data points (pairwise clustering). In this paper we study algorithms (i) for embedding non-metric data in a D-dimensional Euclidian space, (ii) for simultaneous clustering and embedding of non-metric data, and (iii) for active data selection to determine a particular cluster structure with minimal number of data queries. All algorithms are derived from the maximum entropy principle (Hertz et al., 1991) which guarantees robust statistics (Tikochinsky et al., 1984).
Using a Saliency Map for Active Spatial Selective Attention: Implementation & Initial Results
Baluja, Shumeet, Pomerleau, Dean A.
School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract In many vision based tasks, the ability to focus attention on the important portions of a scene is crucial for good performance on the tasks. In this paper we present a simple method of achieving spatial selective attention through the use of a saliency map. The saliency map indicates which regions of the input retina are important for performing the task. The saliency map is created throughpredictive auto-encoding. The performance of this method is demonstrated on two simple tasks which have multiple very strong distracting featuresin the input retina. Architectural extensions and application directions for this model are presented. On some tasks this extra input can easily be ignored. Nonetheless, often the similarity between the important input features and the irrelevant features is great enough to interfere with task performance.
Bayesian Query Construction for Neural Network Models
Paass, Gerhard, Kindermann, Jörg
If data collection is costly, there is much to be gained by actively selecting particularlyinformative data points in a sequential way. In a Bayesian decision-theoretic framework we develop a query selection criterionwhich explicitly takes into account the intended use of the model predictions. By Markov Chain Monte Carlo methods the necessary quantities can be approximated to a desired precision. Asthe number of data points grows, the model complexity is modified by a Bayesian model selection strategy. The properties oftwo versions of the criterion ate demonstrated in numerical experiments.
Boltzmann Chains and Hidden Markov Models
Saul, Lawrence K., Jordan, Michael I.
Statistical models of discrete time series have a wide range of applications, most notably to problems in speech recognition (Juang & Rabiner, 1991) and molecular biology (Baldi, Chauvin, Hunkapiller, & McClure, 1992). A common problem in these fields is to find a probabilistic model, and a set of model parameters, that 436 LawrenceK. Saul, Michael I. Jordan account for sequences of observed data. Hidden Markov models (HMMs) have been particularly successful at modeling discrete time series. One reason for this is the powerful learning rule (Baum) 1972») a special case of the Expectation-Maximization (EM) procedure for maximum likelihood estimation (Dempster) Laird) & Rubin) 1977).