Goto

Collaborating Authors

 Gradient Descent


A Study of Parallel Perturbative Gradient Descent

Neural Information Processing Systems

Motivated by difficulties in analog VLSI implementation of back-propagation [Rumelhart et al., 1986] and related algorithms that calculate gradients based on detailed knowledge of the neural network model, there were several similar recent papers proposing to use a parallel [Alspector et al., 1993, Cauwenberghs, 1993, Kirk et al., 1993] or a semi-parallel [Flower and Jabri, 1993] perturbative technique which has the property that it measures (with the physical neural network) rather than calculates the gradient. This technique is closely related to methods of stochastic approximation [Kushner and Clark, 1978] which have been investigated recently by workers in fields other than neural networks.


A Study of Parallel Perturbative Gradient Descent

Neural Information Processing Systems

Motivated by difficulties in analog VLSI implementation of back-propagation [Rumelhart et al., 1986] and related algorithms that calculate gradients based on detailed knowledge of the neural network model, there were several similar recent papers proposing to use a parallel [Alspector et al., 1993, Cauwenberghs, 1993, Kirk et al., 1993] or a semi-parallel [Flower and Jabri, 1993] perturbative technique which has the property that it measures (with the physical neural network) rather than calculates the gradient. This technique is closely related to methods of stochastic approximation [Kushner and Clark, 1978] which have been investigated recently by workers in fields other than neural networks.


A Study of Parallel Perturbative Gradient Descent

Neural Information Processing Systems

Motivated by difficulties in analog VLSI implementation of back-propagation [Rumelhart et al., 1986] and related algorithms that calculate gradients based on detailed knowledge of the neural network model, there were several similar recent papersproposing to use a parallel [Alspector et al., 1993, Cauwenberghs, 1993, Kirk et al., 1993] or a semi-parallel [Flower and Jabri, 1993] perturbative technique which has the property that it measures (with the physical neural network) rather than calculates the gradient. This technique is closely related to methods of stochastic approximation[Kushner and Clark, 1978] which have been investigated recently by workers in fields other than neural networks.


A Learning Analog Neural Network Chip with Continuous-Time Recurrent Dynamics

Neural Information Processing Systems

The recurrent network, containing six continuous-time analog neurons and 42 free parameters (connection strengths and thresholds), is trained to generate time-varying outputs approximating given periodic signals presented to the network. The chip implements a stochastic perturbative algorithm, which observes the error gradient along random directions in the parameter space for error-descent learning. In addition to the integrated learning functions and the generation of pseudo-random perturbations, the chip provides for teacher forcing and long-term storage of the volatile parameters. The network learns a 1 kHz circular trajectory in 100 sec. The chip occupies 2mm x 2mm in a 2JLm CMOS process, and dissipates 1.2 m W. 1 Introduction Exact gradient-descent algorithms for supervised learning in dynamic recurrent networks [1-3] are fairly complex and do not provide for a scalable implementation in a standard 2-D VLSI process. We have implemented a fairly simple and scalable ·Present address: Johns Hopkins University, ECE Dept., Baltimore MD 21218-2686.


A Unified Gradient-Descent/Clustering Architecture for Finite State Machine Induction

Neural Information Processing Systems

Researchers often try to understand-post hoc-representations that emerge in the hidden layers of a neural net following training. Interpretation is difficult because these representations are typically highly distributed and continuous. By "continuous," we mean that if one constructed a scatterplot over the hidden unit activity space of patterns obtained in response to various inputs, examination at any scale would reveal the patterns to be broadly distributed over the space.



Credit Assignment through Time: Alternatives to Backpropagation

Neural Information Processing Systems

Learning to recognize or predict sequences using long-term context has many applications. However, practical and theoretical problems are found in training recurrent neural networks to perform tasks in which input/output dependencies span long intervals. Starting from a mathematical analysis of the problem, we consider and compare alternative algorithms and architectures on tasks for which the span of the input/output dependencies can be controlled. Results on the new algorithms show performance qualitatively superior to that obtained with backpropagation. 1 Introduction Recurrent neural networks have been considered to learn to map input sequences to output sequences. Machines that could efficiently learn such tasks would be useful for many applications involving sequence prediction, recognition or production. However, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals. In fact, we can prove that dynamical systems such as recurrent neural networks will be increasingly difficult to train with gradient descent as the duration of the dependencies to be captured increases. A mathematical analysis of the problem shows that either one of two conditions arises in such systems.



A Learning Analog Neural Network Chip with Continuous-Time Recurrent Dynamics

Neural Information Processing Systems

The recurrent network, containing six continuous-time analog neurons and 42 free parameters (connection strengths and thresholds), is trained to generate time-varying outputs approximating given periodic signals presented to the network. The chip implements a stochastic perturbative algorithm, which observes the error gradient along random directions in the parameter space for error-descent learning. In addition to the integrated learning functions and the generation of pseudo-random perturbations, the chip provides for teacher forcing and long-term storage of the volatile parameters. The network learns a 1 kHz circular trajectory in 100 sec. The chip occupies 2mm x 2mm in a 2JLm CMOS process, and dissipates 1.2 m W. 1 Introduction Exact gradient-descent algorithms for supervised learning in dynamic recurrent networks [1-3] are fairly complex and do not provide for a scalable implementation in a standard 2-D VLSI process. We have implemented a fairly simple and scalable ·Present address: Johns Hopkins University, ECE Dept., Baltimore MD 21218-2686.


Credit Assignment through Time: Alternatives to Backpropagation

Neural Information Processing Systems

Learning to recognize or predict sequences using long-term context has many applications. However, practical and theoretical problems are found in training recurrent neural networks to perform tasks in which input/output dependencies span long intervals. Starting from a mathematical analysis of the problem, we consider and compare alternative algorithms and architectures on tasks for which the span of the input/output dependencies can be controlled. Results on the new algorithms show performance qualitatively superior to that obtained with backpropagation. 1 Introduction Recurrent neural networks have been considered to learn to map input sequences to output sequences. Machines that could efficiently learn such tasks would be useful for many applications involving sequence prediction, recognition or production. However, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals. In fact, we can prove that dynamical systems such as recurrent neural networks will be increasingly difficult to train with gradient descent as the duration of the dependencies to be captured increases. A mathematical analysis of the problem shows that either one of two conditions arises in such systems.