Gradient Descent
Leaning by Combining Memorization and Gradient Descent
We have created a radial basis function network that allocates a new computational unit whenever an unusual pattern is presented to the network. The network learns by allocating new units and adjusting the parameters of existing units. If the network performs poorly on a presented pattern, then a new unit is allocated which memorizes the response to the presented pattern. If the network performs well on a presented pattern, then the network parameters are updated using standard LMS gradient descent. For predicting the Mackey Glass chaotic time series, our network learns much faster than do those using back-propagation and uses a comparable number of synapses.
Gradient Descent: Second Order Momentum and Saturating Error
Batch gradient descent, w(t) -7JdE/dw(t), conver es to a minimum of quadratic form with a time constant no better than '4Amax/ Amin where Amin and Amax are the minimum and maximum eigenvalues of the Hessian matrix of E with respect to w. It was recently shown that adding a momentum term w(t) -7JdE/dw(t) Q' w(t - 1) improves this to VAmax/ Amin, although only in the batch case. Here we show that second(cid:173) order momentum, w(t) -7JdE/dw(t) Q' w(t -1) (3 w(t - 2), can lower this no further. We then regard gradient descent with momentum as a dynamic system and explore a non quadratic error surface, showing that saturation of the error accounts for a variety of effects observed in simulations and justifies some popular heuristics.
Towards Faster Stochastic Gradient Search
Stochastic gradient descent is a general algorithm which includes LMS, on-line backpropagation, and adaptive k-means clustering as special cases. The standard choices of the learning rate 1] (both adaptive and fixed func(cid:173) tions of time) often perform quite poorly. In contrast, our recently pro(cid:173) posed class of "search then converge" learning rate schedules (Darken and Moody, 1990) display the theoretically optimal asymptotic convergence rate and a superior ability to escape from poor local minima. However, the user is responsible for setting a key parameter. We propose here a new method(cid:173) ology for creating the first completely automatic adaptive learning rates which achieve the optimal rate of convergence.
Analog VLSI Implementation of Multi-dimensional Gradient Descent
We describe an analog VLSI implementation of a multi-dimensional gradient estimation and descent technique for minimizing an on(cid:173) chip scalar function fO. The implementation uses noise injec(cid:173) tion and multiplicative correlation to estimate derivatives, as in [Anderson, Kerns 92]. One intended application of this technique is setting circuit parameters on-chip automatically, rather than manually [Kirk 91]. Gradient descent optimization may be used to adjust synapse weights for a backpropagation or other on-chip learning implementation. The approach combines the features of continuous multi-dimensional gradient descent and the potential for an annealing style of optimization.
Diffusion Approximations for the Constant Learning Rate Backpropagation Algorithm and Resistence to Local Minima
In this paper we discuss the asymptotic properties of the most com(cid:173) monly used variant of the backpropagation algorithm in which net(cid:173) work weights are trained by means of a local gradient descent on ex(cid:173) amples drawn randomly from a fixed training set, and the learning rate TJ of the gradient updates is held constant (simple backpropa(cid:173) gation). Using stochastic approximation results, we show that for TJ 0 this training process approaches a batch training and pro(cid:173) vide results on the rate of convergence. Further, we show that for small TJ one can approximate simple back propagation by the sum of a batch training process and a Gaussian diffusion which is the unique solution to a linear stochastic differential equation. Using this approximation we indicate the reasons why simple backprop(cid:173) agation is less likely to get stuck in local minima than the batch training process and demonstrate this empirically on a number of examples.
A Parallel Gradient Descent Method for Learning in Analog VLSI Neural Networks
Typical methods for gradient descent in neural network learning involve calculation of derivatives based on a detailed knowledge of the network model. This requires extensive, time consuming calculations for each pat(cid:173) tern presentation and high precision that makes it difficult to implement in VLSI. We present here a perturbation technique that measures, not calculates, the gradient. Since the technique uses the actual network as a measuring device, errors in modeling neuron activation and synaptic weights do not cause errors in gradient descent. The method is parallel in nature and easy to implement in VLSI.
A Unified Gradient-Descent/Clustering Architecture for Finite State Machine Induction
Although recurrent neural nets have been moderately successful in learning to emulate finite-state machines (FSMs), the continu(cid:173) ous internal state dynamics of a neural net are not well matched to the discrete behavior of an FSM. We describe an architecture, called DOLCE, that allows discrete states to evolve in a net as learn(cid:173) ing progresses. DOLCE consists of a standard recurrent neural net trained by gradient descent and an adaptive clustering technique that quantizes the state space. DOLCE is based on the assumption that a finite set of discrete internal states is required for the task, and that the actual network state belongs to this set but has been corrupted by noise due to inaccuracy in the weights. DOLCE learns to recover the discrete state with maximum a posteriori probabil(cid:173) ity from the noisy state.
GDS: Gradient Descent Generation of Symbolic Classification Rules
Imagine you have designed a neural network that successfully learns a complex classification task. What are the relevant input features the classifier relies on and how are these features combined to pro(cid:173) duce the classification decisions? There are applications where a deeper insight into the structure of an adaptive system and thus into the underlying classification problem may well be as important as the system's performance characteristics, e.g. in economics or medicine. GDSi is a backpropagation-based training scheme that produces networks transformable into an equivalent and concise set of IF-THEN rules. This is achieved by imposing penalty terms on the network parameters that adapt the network to the expressive power of this class of rules.
A Study of Parallel Perturbative Gradient Descent
We have continued our study of a parallel perturbative learning method [Alspector et al., 1993] and implications for its implemen(cid:173) tation in analog VLSI. Our new results indicate that, in most cases, a single parallel perturbation (per pattern presentation) of the func(cid:173) tion parameters (weights in a neural network) is theoretically the best course. This is not true, however, for certain problems and may not generally be true when faced with issues of implemen(cid:173) tation such as limited precision. In these cases, multiple parallel perturbations may be best as indicated in our previous results.
Predicting the Risk of Complications in Coronary Artery Bypass Operations using Neural Networks
Experiments demonstrated that sigmoid multilayer perceptron (MLP) networks provide slightly better risk prediction than conventional logistic regression when used to predict the risk of death, stroke, and renal failure on 1257 patients who underwent coronary artery bypass operations at the Lahey Clinic. MLP networks with no hidden layer and networks with one hidden layer were trained using stochastic gradient descent with early stopping. MLP networks and logistic regression used the same input features and were evaluated using bootstrap sampling with 50 replications. ROC areas for predicting mortality using preoperative input features were 70.5% for logistic regression and 76.0% for MLP networks. Regularization provided by early stopping was an important component of improved perfonnance.