Technology
Asymptotics of Gradient-based Neural Network Training Algorithms
Mukherjee, Sayandev, Fine, Terrence L.
We study the asymptotic properties of the sequence of iterates of weight-vector estimates obtained by training a multilayer feed forward neural network with a basic gradient-descent method using a fixed learning constant and no batch-processing. In the onedimensional case, an exact analysis establishes the existence of a limiting distribution that is not Gaussian in general. For the general case and small learning constant, a linearization approximation permits the application of results from the theory of random matrices to again establish the existence of a limiting distribution. We study the first few moments of this distribution to compare and contrast the results of our analysis with those of techniques of stochastic approximation. 1 INTRODUCTION The wide applicability of neural networks to problems in pattern classification and signal processing has been due to the development of efficient gradient-descent algorithms for the supervised training of multilayer feedforward neural networks with differentiable node functions. A basic version uses a fixed learning constant and updates all weights after each training input is presented (online mode) rather than after the entire training set has been presented (batch mode). The properties of this algorithm as exhibited by the sequence of iterates are not yet well-understood. There are at present two major approaches.
Sample Size Requirements for Feedforward Neural Networks
Turmon, Michael J., Fine, Terrence L.
We estimate the number of training samples required to ensure that the performance of a neural network on its training data matches that obtained when fresh data is applied to the network. Existing estimates are higher by orders of magnitude than practice indicates. This work seeks to narrow the gap between theory and practice by transforming the problem into determining the distribution of the supremum of a random field in the space of weight vectors, which in turn is attacked by application of a recent technique called the Poisson clumping heuristic.
A Rigorous Analysis of Linsker-type Hebbian Learning
Feng, J., Pan, H., Roychowdhury, V. P.
We propose a novel rigorous approach for the analysis of Linsker's unsupervised Hebbian learning network. The behavior of this model is determined by the underlying nonlinear dynamics which are parameterized by a set of parameters originating from the Hebbian rule and the arbor density of the synapses. These parameters determine the presence or absence of a specific receptive field (also referred to as a'connection pattern') as a saturated fixed point attractor of the model. In this paper, we perform a qualitative analysis of the underlying nonlinear dynamics over the parameter space, determine the effects of the system parameters on the emergence of various receptive fields, and predict precisely within which parameter regime the network will have the potential to develop a specially designated connection pattern. In particular, this approach exposes, for the first time, the crucial role played by the synaptic density functions, and provides a complete precise picture of the parameter space that defines the relationships among the different receptive fields. Our theoretical predictions are confirmed by numerical simulations.
Dynamic Modelling of Chaotic Time Series with Neural Networks
Principe, Jose C., Kuo, Jyh-Ming
In young barn owls raised with optical prisms over their eyes, these auditory maps are shifted to stay in register with the visual map, suggesting that the visual input imposes a frame of reference on the auditory maps. However, the optic tectum, the first site of convergence of visual with auditory information, is not the site of plasticity for the shift of the auditory maps; the plasticity occurs instead in the inferior colliculus, which contains an auditory map and projects into the optic tectum. We explored a model of the owl remapping in which a global reinforcement signal whose delivery is controlled by visual foveation. A hebb learning rule gated by reinforcement learned to appropriately adjust auditory maps. In addition, reinforcement learning preferentially adjusted the weights in the inferior colliculus, as in the owl brain, even though the weights were allowed to change throughout the auditory system. This observation raises the possibility that the site of learning does not have to be genetically specified, but could be determined by how the learning procedure interacts with the network architecture.
On-line Learning of Dichotomies
Barkai, N., Seung, H. S., Sompolinsky, H.
The performance of online algorithms for learning dichotomies is studied. In online learning, the number of examples P is equivalent to the learning time, since each example is presented only once. The learning curve, or generalization error as a function of P, depends on the schedule at which the learning rate is lowered.
Bias, Variance and the Combination of Least Squares Estimators
We consider the effect of combining several least squares estimators on the expected performance of a regression problem. Computing the exact bias and variance curves as a function of the sample size we are able to quantitatively compare the effect of the combination on the bias and variance separately, and thus on the expected error which is the sum of the two. Our exact calculations, demonstrate that the combination of estimators is particularly useful in the case where the data set is small and noisy and the function to be learned is unrealizable. For large data sets the single estimator produces superior results. Finally, we show that by splitting the data set into several independent parts and training each estimator on a different subset, the performance can in some cases be significantly improved.
Learning from queries for maximum information gain in imperfectly learnable problems
In supervised learning, learning from queries rather than from random examples can improve generalization performance significantly. We study the performance of query learning for problems where the student cannot learn the teacher perfectly, which occur frequently in practice. As a prototypical scenario of this kind, we consider a linear perceptron student learning a binary perceptron teacher. Two kinds of queries for maximum information gain, i.e., minimum entropy, are investigated: Minimum student space entropy (MSSE) queries, which are appropriate if the teacher space is unknown, and minimum teacher space entropy (MTSE) queries, which can be used if the teacher space is assumed to be known, but a student of a simpler form has deliberately been chosen. We find that for MSSE queries, the structure of the student space determines the efficacy of query learning, whereas MTSE queries lead to a higher generalization error than random examples, due to a lack of feedback about the progress of the student in the way queries are selected.
Learning Stochastic Perceptrons Under k-Blocking Distributions
Marchand, Mario, Hadjifaradji, Saeed
I} when the probability distribution that generates the input examples is member of a family that we call k-blocking distributions. Such distributions represent an important step beyond the case where each input variable is statistically independent since the 2k-blocking family contains all the Markov distributions of order k. By stochastic percept ron we mean a perceptron which, upon presentation of input vector x, outputs 1 with probability fCLJi WiXi - B).
Stochastic Dynamics of Three-State Neural Networks
We present here an analysis of the stochastic neurodynamics of a neural network composed of three-state neurons described by a master equation. An outer-product representation of the master equation is employed. In this representation, an extension of the analysis from two to three-state neurons is easily performed. We apply this formalism with approximation schemes to a simple three-state network and compare the results with Monte Carlo simulations.
Temporal Dynamics of Generalization in Neural Networks
Wang, Changfeng, Venkatesh, Santosh S.
This paper presents a rigorous characterization of how a general nonlinear learning machine generalizes during the training process when it is trained on a random sample using a gradient descent algorithm based on reduction of training error. It is shown, in particular, that best generalization performance occurs, in general, before the global minimum of the training error is achieved. The different roles played by the complexity of the machine class and the complexity of the specific machine in the class during learning are also precisely demarcated. 1 INTRODUCTION In learning machines such as neural networks, two major factors that affect the'goodness of fit' of the examples are network size (complexity) and training time. These are also the major factors that affect the generalization performance of the network. Many theoretical studies exploring the relation between generalization performance and machine complexity support the parsimony heuristics suggested by Occam's razor, to wit that amongst machines with similar training performance one should opt for the machine of least complexity.