Country
Node Splitting: A Constructive Algorithm for Feed-Forward Neural Networks
A constructive algorithm is proposed for feed-forward neural networks, which uses node-splitting in the hidden layers to build large networks from smaller ones. The small network forms an approximate model of a set of training data, and the split creates a larger more powerful network which is initialised with the approximate solution already found. The insufficiency of the smaller network in modelling the system which generated the data leads to oscillation in those hidden nodes whose weight vectors cover regions in the input space where more detail is required in the model. These nodes are identified and split in two using principal component analysis, allowing the new nodes t.o cover the two main modes of each oscillating vector. Nodes are selected for splitting using principal component analysis on the oscillating weight vectors, or by examining the Hessian matrix of second derivatives of the network error with respect to the weight.s.
Iterative Construction of Sparse Polynomial Approximations
Sanger, Terence D., Sutton, Richard S., Matheus, Christopher J.
Terence D. Sanger Richard S. Sutton Christopher J. Matheus Massachusetts Institute GTE Laboratories GTE Laboratories of Technology Incorporated Incorporated Room E25-534 40 Sylvan Road 40 Sylvan Road Cambridge, MA 02139 Waltham, MA 02254 Waltham, MA 02254 tds@ai.mit.edu Abstract We present an iterative algorithm for nonlinear regression based on construction of sparse polynomials. Polynomials are built sequentially from lower to higher order. Selection of new terms is accomplished using a novel look-ahead approach that predicts whether a variable contributes to the remaining error. The algorithm is based on the tree-growing heuristic in LMS Trees which we have extended to approximation of arbitrary polynomials of the input features.
Learning in Feedforward Networks with Nonsmooth Functions
Redding, Nicholas J., Downs, T.
This paper is concerned with the problem of learning in networks where some or all of the functions involved are not smooth. Examples of such networks are those whose neural transfer functions are piecewise-linear and those whose error function is defined in terms of the 100 norm. Up to now, networks whose neural transfer functions are piecewise-linear have received very little consideration in the literature, but the possibility of using an error function defined in terms of the 100 norm has received some attention. In this paper we draw upon some recent results from the field of nonsmooth optimization (NSO) to present an algorithm for the non smooth case. Our motivation for this work arose out of the fact that we have been able to show that, in backpropagation, an error function based upon the 100 norm overcomes the difficulties which can occur when using the 12 norm. 1 INTRODUCTION This paper is concerned with the problem of learning in networks where some or all of the functions involved are not smooth.
Networks with Learned Unit Response Functions
Feedforward networks composed of units which compute a sigmoidal function of a weighted sum of their inputs have been much investigated. We tested the approximation and estimation capabilities of networks using functions more complex than sigmoids. Three classes of functions were tested: polynomials, rational functions, and flexible Fourier series. Unlike sigmoids, these classes can fit nonmonotonic functions. They were compared on three problems: prediction of Boston housing prices, the sunspot count, and robot arm inverse dynamics. The complex units attained clearly superior performance on the robot arm problem, which is a highly nonmonotonic, pure approximation problem. On the noisy and only mildly nonlinear Boston housing and sunspot problems, differences among the complex units were revealed; polynomials did poorly, whereas rationals and flexible Fourier series were comparable to sigmoids. 1 Introduction
Splines, Rational Functions and Neural Networks
Williamson, Robert C., Bartlett, Peter L.
Connections between spline approximation, approximation with rational functions, and feedforward neural networks are studied. The potential improvement in the degree of approximation in going from single to two hidden layer networks is examined. Some results of Birman and Solomjak regarding the degree of approximation achievable when knot positions are chosen on the basis of the probability distribution of examples rather than the function values are extended.
Kernel Regression and Backpropagation Training With Noise
Koistinen, Petri, Holmstrรถm, Lasse
One method proposed for improving the generalization capability of a feedforward network trained with the backpropagation algorithm is to use artificial training vectors which are obtained by adding noise to the original training vectors. We discuss the connection of such backpropagation training with noise to kernel density and kernel regression estimation. We compare by simulated examples (1) backpropagation, (2) backpropagation with noise, and (3) kernel regression in mapping estimation and pattern classification contexts.
Merging Constrained Optimisation with Deterministic Annealing to "Solve" Combinatorially Hard Problems
Several parallel analogue algorithms, based upon mean field theory (MFT) approximations to an underlying statistical mechanics formulation, and requiring an externally prescribed annealing schedule, now exist for finding approximate solutions to difficult combinatorial optimisation problems. They have been applied to the Travelling Salesman Problem (TSP), as well as to various issues in computational vision and cluster analysis. I show here that any given MFT algorithm can be combined in a natural way with notions from the areas of constrained optimisation and adaptive simulated annealing to yield a single homogenous and efficient parallel relaxation technique, for which an externally prescribed annealing schedule is no longer required. The results of numerical simulations on 50-city and 100-city TSP problems are presented, which show that the ensuing algorithms are typically an order of magnitude faster than the MFT algorithms alone, and which also show, on occasion, superior solutions as well. 1 INTRODUCTION Several promising parallel analogue algorithms, which can be loosely described by the term "deterministic annealing", or "mean field theory (MFT) annealing", have *also at Theoretical Division and Center for Nonlinear Studies, MSB213, Los Alamos National Laboratory, Los Alamos, NM 87545.
Competitive Anti-Hebbian Learning of Invariants
Schraudolph, Nicol N., Sejnowski, Terrence J.
Although the detection of invariant structure in a given set of input patterns is vital to many recognition tasks, connectionist learning rules tend to focus on directions of high variance (principal components). The prediction paradigm is often used to reconcile this dichotomy; here we suggest a more direct approach to invariant learning based on an anti-Hebbian learning rule. An unsupervised tWO-layer network implementing this method in a competitive setting learns to extract coherent depth information from random-dot stereograms. 1 INTRODUCTION: LEARNING INVARIANT STRUCTURE Many connectionist learning algorithms share with principal component analysis (Jolliffe, 1986) the strategy of extracting the directions of highest variance from the input. A single Hebbian neuron, for instance, will come to encode the input's first principal component (Oja and Karhunen, 1985); various forms of lateral interaction can be used to force a layer of such nodes to differentiate and span the principal component subspace - cf. (Sanger, 1989; Kung, 1990; Leen, 1991), and others. The same type of representation also develops in the hidden layer of backpropagation autoassociator networks (Baldi and Hornik, 1989).
Repeat Until Bored: A Pattern Selection Strategy
An alternative to the typical technique of selecting training examples independently from a fixed distribution is fonnulated and analyzed, in which the current example is presented repeatedly until the error for that item is reduced to some criterion value,; then, another item is randomly selected. The convergence time can be dramatically increased or decreased by this heuristic, depending on the task, and is very sensitive to the value of .