Country
Principled Architecture Selection for Neural Networks: Application to Corporate Bond Rating Prediction
The notion of generalization ability can be defined precisely as the prediction risk, the expected performance of an estimator in predicting new observations. In this paper, we propose the prediction risk as a measure of the generalization ability of multi-layer perceptron networks and use it to select an optimal network architecture from a set of possible architectures. We also propose a heuristic search strategy to explore the space of possible architectures. The prediction risk is estimated from the available data; here we estimate the prediction risk by v-fold cross-validation and by asymptotic approximations of generalized cross-validation or Akaike's final prediction error. We apply the technique to the problem of predicting corporate bond ratings. This problem is very attractive as a case study, since it is characterized by the limited availability of the data and by the lack of a complete a priori model which could be used to impose a structure to the network architecture.
Fast Learning with Predictive Forward Models
A method for transforming performance evaluation signals distal both in space and time into proximal signals usable by supervised learning algorithms, presentedin [Jordan & Jacobs 90], is examined. A simple observation concerning differentiation through models trained with redundant inputs (as one of their networks is) explains a weakness in the original architecture and suggests a modification: an internal world model that encodes action-space exploration and, crucially, cancels input redundancy to the forward model is added. Learning time on an example task, cartpole balancing,is thereby reduced about 50 to 100 times. 1 INTRODUCTION In many learning control problems, the evaluation used to modify (and thus improve) controlmay not be available in terms of the controller's output: instead, it may be in terms of a spatial transformation of the controller's output variables (in which case we shall term it as being "distal in space"), or it may be available only several time steps into the future (termed as being "distal in time"). For example, control of a robot arm may be exerted in terms of joint angles, while evaluation may be in terms of the endpoint cartesian coordinates; furthermore, we may only wish to evaluate the endpoint coordinates reached after a certain period of time: the co- ·Current address: Computation and Neural Systems Program, California Institute of Technology, Pasadena CA. 563 564 Brody ordinatesreached at the end of some motion, for instance. In such cases, supervised learning methods are not directly applicable, and other techniques must be used. Here we study one such technique (proposed for cases where the evaluation is distal in both space and time by [Jordan & Jacobs 90)), analyse a source of its problems, and propose a simple solution for them which leads to fast, efficient learning. We first describe two methods, and then combine them into the "predictive forward modeling" technique with which we are concerned.
Illumination and View Position in 3D Visual Recognition
It is shown that both changes in viewing position and illumination conditions canbe compensated for, prior to recognition, using combinations of images taken from different viewing positions and different illumination conditions.It is also shown that, in agreement with psychophysical findings, the computation requires at least a sign-bit image as input - contours alone are not sufficient. 1 Introduction The task of visual recognition is natural and effortless for biological systems, yet the problem of recognition has been proven to be very difficult to analyze from a computational point of view. The fundamental reason is that novel images of familiar objects are often not sufficiently similar to previously seen images of that object. Assuming a rigid and isolated object in the scene, there are two major sources for this variability: geometric and photometric. The geometric source of variability comes from changes of view position. A 3D object can be viewed from a variety of directions, each resulting with a different 2D projection. The difference is significant, even for modest changes in viewing positions, and can be demonstrated by superimposing those projections (see Figure 1, first row second image). Much attention has been given to this problem in the visual recognition literature ([9], and references therein), and recent results show that one can compensate for changes in viewing position by generating novel views from a small number of model views of the object [10, 4, 8].
Kernel Regression and Backpropagation Training With Noise
Koistinen, Petri, Holmström, Lasse
One method proposed for improving the generalization capability of a feedforward networktrained with the backpropagation algorithm is to use artificial training vectors which are obtained by adding noise to the original trainingvectors. We discuss the connection of such backpropagation training with noise to kernel density and kernel regression estimation. We compare by simulated examples (1) backpropagation, (2) backpropagation with noise, and (3) kernel regression in mapping estimation and pattern classification contexts.
Learning in Feedforward Networks with Nonsmooth Functions
Redding, Nicholas J., Downs, T.
Box 1600 Salisbury Adelaide SA 5108 Australia T.Downs Intelligent Machines Laboratory Dept of Electrical Engineering University of Queensland Brisbane Q 4072 Australia Abstract This paper is concerned with the problem of learning in networks where some or all of the functions involved are not smooth. Examples of such networks are those whose neural transfer functions are piecewise-linear and those whose error function is defined in terms of the 100 norm. Up to now, networks whose neural transfer functions are piecewise-linear have received very little consideration in the literature, but the possibility of using an error function defined in terms of the 100 norm has received some attention. In this paper we draw upon some recent results from the field of nonsmooth optimization (NSO) to present an algorithm for the nonsmooth case. Our motivation forthis work arose out of the fact that we have been able to show that, in backpropagation, an error function based upon the 100 norm overcomes the difficulties which can occur when using the 12 norm. 1 INTRODUCTION This paper is concerned with the problem of learning in networks where some or all of the functions involved are not smooth.
Temporal Adaptation in a Silicon Auditory Nerve
Many auditory theorists consider the temporal adaptation of the auditory nerve a key aspect of speech coding in the auditory periphery. Experimentswith models of auditory localization and pitch perception also suggest temporal adaptation is an important element ofpractical auditory processing. I have designed, fabricated, and successfully tested an analog integrated circuit that models many aspects of auditory nerve response, including temporal adaptation. 1. INTRODUCTION We are modeling known and proposed auditory structures in the brain using analog VLSI circuits, with the goal of making contributions both to engineering practice andbiological understanding. Computational neuroscience involves modeling biology at many levels of abstraction. The first silicon auditory models were constructed ata fairly high level of abstraction (Lyon and Mead, 1988; Lazzaro and Mead, 1989ab; Mead et al., 1991; Lyon, 1991). The functional limitations of these silicon systems have prompted a new generation of auditory neural circuits designed at a lower level of abstraction (Watts et al., 1991; Liu et -al., 1991).
Merging Constrained Optimisation with Deterministic Annealing to "Solve" Combinatorially Hard Problems
Several parallel analogue algorithms, based upon mean field theory (MFT) approximations to an underlying statistical mechanics formulation, and requiring anexternally prescribed annealing schedule, now exist for finding approximate solutions to difficult combinatorial optimisation problems. They have been applied to the Travelling Salesman Problem (TSP), as well as to various issues in computational vision and cluster analysis. I show here that any given MFT algorithm can be combined in a natural way with notions from the areas of constrained optimisation and adaptive simulated annealing to yield a single homogenous and efficient parallel relaxation technique,for which an externally prescribed annealing schedule is no longer required. The results of numerical simulations on 50-city and 100-city TSP problems are presented, which show that the ensuing algorithms aretypically an order of magnitude faster than the MFT algorithms alone, and which also show, on occasion, superior solutions as well. 1 INTRODUCTION Several promising parallel analogue algorithms, which can be loosely described by the term "deterministic annealing", or "mean field theory (MFT) annealing", have *also at Theoretical Division and Center for Nonlinear Studies, MSB213, Los Alamos National Laboratory, Los Alamos, NM 87545.
Iterative Construction of Sparse Polynomial Approximations
Sanger, Terence D., Sutton, Richard S., Matheus, Christopher J.
We present an iterative algorithm for nonlinear regression based on construction ofsparse polynomials. Polynomials are built sequentially from lower to higher order. Selection of new terms is accomplished using a novel look-ahead approach that predicts whether a variable contributes to the remaining error. The algorithm is based on the tree-growing heuristic in LMS Trees which we have extended to approximation of arbitrary polynomials ofthe input features. In addition, we provide a new theoretical justification for this heuristic approach.
Learning Unambiguous Reduced Sequence Descriptions
Do you want your neural net algorithm to learn sequences? Do not limit yourselfto conventional gradient descent (or approximations thereof). Instead, use your sequence learning algorithm (any will do) to implement the following method for history compression. No matter what your final goalsare, train a network to predict its next input from the previous ones. Since only unpredictable inputs convey new information, ignore all predictable inputs but let all unexpected inputs (plus information about the time step at which they occurred) become inputs to a higher-level network of the same kind (working on a slower, self-adjusting time scale). Go on building a hierarchy of such networks.