Country
Temporal Dynamics of Generalization in Neural Networks
Wang, Changfeng, Venkatesh, Santosh S.
This paper presents a rigorous characterization of how a general nonlinear learning machine generalizes during the training process when it is trained on a random sample using a gradient descent algorithm based on reduction of training error. It is shown, in particular, that best generalization performance occurs, in general, before the global minimum of the training error is achieved. The different roles played by the complexity of the machine class and the complexity of the specific machine in the class during learning are also precisely demarcated. 1 INTRODUCTION In learning machines such as neural networks, two major factors that affect the'goodness of fit' of the examples are network size (complexity) and training time. These are also the major factors that affect the generalization performance of the network. Many theoretical studies exploring the relation between generalization performance and machine complexity support the parsimony heuristics suggested by Occam's razor, towit that amongst machines with similar training performance one should opt for the machine of least complexity.
Reinforcement Learning with Soft State Aggregation
Singh, Satinder P., Jaakkola, Tommi, Jordan, Michael I.
It is widely accepted that the use of more compact representations than lookup tables is crucial to scaling reinforcement learning (RL) algorithms to real-world problems. Unfortunately almost all of the theory of reinforcement learning assumes lookup table representations. Inthis paper we address the pressing issue of combining function approximation and RL, and present 1) a function approximator basedon a simple extension to state aggregation (a commonly used form of compact representation), namely soft state aggregation, 2) a theory of convergence for RL with arbitrary, but fixed, soft state aggregation, 3) a novel intuitive understanding of the effect of state aggregation on online RL, and 4) a new heuristic adaptive state aggregation algorithm that finds improved compact representations by exploiting the non-discrete nature of soft state aggregation. Preliminary empirical results are also presented.
Active Learning for Function Approximation
We develop a principled strategy to sample a function optimally for function approximation tasks within a Bayesian framework. Using ideas from optimal experiment design, we introduce an objective function (incorporating both bias and variance) to measure the degree ofapproximation, and the potential utility of the data points towards optimizing this objective. We show how the general strategy canbe used to derive precise algorithms to select data for two cases: learning unit step functions and polynomial functions. In particular, we investigate whether such active algorithms can learn the target with fewer examples. We obtain theoretical and empirical resultsto suggest that this is the case. 1 INTRODUCTION AND MOTIVATION Learning from examples is a common supervised learning paradigm that hypothesizes atarget concept given a stream of training examples that describes the concept. In function approximation, example-based learning can be formulated as synthesizing anapproximation function for data sampled from an unknown target function (Poggio and Girosi, 1990). Active learning describes a class of example-based learning paradigms that seeks out new training examples from specific regions of the input space, instead of passively accepting examples from some data generating source. By judiciously selecting ex- 594 KahKay Sung, Parlha Niyogi amples instead of allowing for possible random sampling, active learning techniques can conceivably have faster learning rates and better approximation results than passive learning methods. This paper presents a Bayesian formulation for active learning within the function approximation framework.
A Silicon Axon
Minch, Bradley A., Hasler, Paul E., Diorio, Chris, Mead, Carver
It is well known that axons are neural processes specialized for transmitting information overrelatively long distances in the nervous system. Impulsive electrical disturbances known as action potentials are normally initiated near the cell body of a neuron when the voltage across the cell membrane crosses a threshold. These pulses are then propagated with a fairly stereotypical shape at a more or less constant velocitydown the length of the axon. Consequently, axons excel at precisely preserving the relative timing of threshold crossing events but do not preserve any of the initial signal shape. Information, then, is presumably encoded in the relative timing of action potentials.
Bias, Variance and the Combination of Least Squares Estimators
We consider the effect of combining several least squares estimators on the expected performance of a regression problem. Computing the exact bias and variance curves as a function of the sample size we are able to quantitatively compare the effect of the combination on the bias and variance separately, and thus on the expected error which is the sum of the two. Our exact calculations, demonstrate that the combination of estimators is particularly useful in the case where the data set is small and noisy and the function to be learned is unrealizable. For large data sets the single estimator produces superior results. Finally, we show that by splitting the data set into several independent parts and training each estimator on a different subset, the performance can in some cases be significantly improved.
Optimal Movement Primitives
The theory of Optimal Unsupervised Motor Learning shows how a network can discover a reduced-order controller for an unknown nonlinear system by representing only the most significant modes. Here, I extend the theory to apply to command sequences, so that the most significant components discovered by the network correspond tomotion "primitives". Combinations of these primitives can be used to produce a wide variety of different movements. I demonstrate applications to human handwriting decomposition and synthesis, as well as to the analysis of electrophysiological experiments on movements resulting from stimulation of the frog spinal cord. 1 INTRODUCTION There is much debate within the neuroscience community concerning the internal representationof movement, and current neurophysiological investigations are aimed at uncovering these representations. In this paper, I propose a different approach that attempts to define the optimal internal representation in terms of "movement primitives", and I compare this representation with the observed behavior.
Asymptotics of Gradient-based Neural Network Training Algorithms
Mukherjee, Sayandev, Fine, Terrence L.
We study the asymptotic properties of the sequence of iterates of weight-vector estimates obtained by training a multilayer feedforward neuralnetwork with a basic gradient-descent method using a fixed learning constant and no batch-processing. In the onedimensional case,an exact analysis establishes the existence of a limiting distribution that is not Gaussian in general. For the general caseand small learning constant, a linearization approximation permits the application of results from the theory of random matrices toagain establish the existence of a limiting distribution. We study the first few moments of this distribution to compare and contrast the results of our analysis with those of techniques of stochastic approximation. 1 INTRODUCTION The wide applicability of neural networks to problems in pattern classification and signal processing has been due to the development of efficient gradient-descent algorithms forthe supervised training of multilayer feedforward neural networks with differentiable node functions. A basic version uses a fixed learning constant and updates allweights after each training input is presented (online mode) rather than after the entire training set has been presented (batch mode). The properties of this algorithm as exhibited by the sequence of iterates are not yet well-understood. There are at present two major approaches.