Goto

Collaborating Authors

 Backpropagation


Geometry of Early Stopping in Linear Networks

Neural Information Processing Systems

A theory of early stopping as applied to linear models is presented. The backpropagation learning algorithm is modeled as gradient descent in continuous time. Given a training set and a validation set, all weight vectors found by early stopping must lie on a certain quadric surface, usually an ellipsoid. Given a training set and a candidate early stopping weight vector, all validation sets have least-squares weights lying on a certain plane. This latter fact can be exploited to estimate the probability of stopping at any given point along the trajectory from the initial weight vector to the leastsquares weights derived from the training set, and to estimate the probability that training goes on indefinitely. The prospects for extending this theory to nonlinear models are discussed.


SPERT-II: A Vector Microprocessor System and its Application to Large Problems in Backpropagation Training

Neural Information Processing Systems

We report on our development of a high-performance system for neural network and other signal processing applications. We have designed and implemented a vector microprocessor and packaged it as an attached processor for a conventional workstation.


Tempering Backpropagation Networks: Not All Weights are Created Equal

Neural Information Processing Systems

Backpropagation learning algorithms typically collapse the network's structure into a single vector of weight parameters to be optimized. We suggest that their performance may be improved by utilizing the structural information instead of discarding it, and introduce a framework for ''tempering'' each weight accordingly. In the tempering model, activation and error signals are treated as approximately independent random variables. The characteristic scale of weight changes is then matched to that ofthe residuals, allowing structural properties such as a node's fan-in and fan-out to affect the local learning rate and backpropagated error. The model also permits calculation of an upper bound on the global learning rate for batch updates, which in turn leads to different update rules for bias vs. non-bias weights. This approach yields hitherto unparalleled performance on the family relations benchmark, a deep multi-layer network: for both batch learning with momentum and the delta-bar-delta algorithm, convergence at the optimal learning rate is sped up by more than an order of magnitude.


Geometry of Early Stopping in Linear Networks

Neural Information Processing Systems

A theory of early stopping as applied to linear models is presented. The backpropagation learning algorithm is modeled as gradient descent in continuous time. Given a training set and a validation set, all weight vectors found by early stopping must lie on a certain quadric surface, usually an ellipsoid. Given a training set and a candidate early stopping weight vector, all validation sets have least-squares weights lying on a certain plane. This latter fact can be exploited to estimate the probability of stopping at any given point along the trajectory from the initial weight vector to the leastsquares weights derived from the training set, and to estimate the probability that training goes on indefinitely. The prospects for extending this theory to nonlinear models are discussed.


Geometry of Early Stopping in Linear Networks

Neural Information Processing Systems

A theory of early stopping as applied to linear models is presented. The backpropagation learning algorithm is modeled as gradient descent in continuous time. Given a training set and a validation set, all weight vectors found by early stopping must lie on a certain quadricsurface, usually an ellipsoid. Given a training set and a candidate early stopping weight vector, all validation sets have least-squares weights lying on a certain plane. This latter fact can be exploited to estimate the probability of stopping at any given point along the trajectory from the initial weight vector to the leastsquares weightsderived from the training set, and to estimate the probability that training goes on indefinitely. The prospects for extending this theory to nonlinear models are discussed.


SPERT-II: A Vector Microprocessor System and its Application to Large Problems in Backpropagation Training

Neural Information Processing Systems

We report on our development of a high-performance system for neural network and other signal processing applications. We have designed and implemented a vector microprocessor and packaged itas an attached processor for a conventional workstation.


Tempering Backpropagation Networks: Not All Weights are Created Equal

Neural Information Processing Systems

Backpropagation learning algorithms typically collapse the network's structure into a single vector of weight parameters to be optimized. We suggest that their performance may be improved by utilizing the structural informationinstead of discarding it, and introduce a framework for ''tempering'' each weight accordingly. In the tempering model, activation and error signals are treated as approximately independentrandom variables. The characteristic scale of weight changes is then matched to that ofthe residuals, allowing structural properties suchas a node's fan-in and fan-out to affect the local learning rate and backpropagated error. The model also permits calculation of an upper bound on the global learning rate for batch updates, which in turn leads to different update rules for bias vs. non-bias weights. This approach yields hitherto unparalleled performance on the family relations benchmark,a deep multi-layer network: for both batch learning with momentum and the delta-bar-delta algorithm, convergence at the optimal learning rate is sped up by more than an order of magnitude.


A Lagrangian Formulation For Optical Backpropagation Training In Kerr-Type Optical Networks

Neural Information Processing Systems

A training method based on a form of continuous spatially distributed optical error back-propagation is presented for an all optical network composed of nondiscrete neurons and weighted interconnections. The all optical network is feed-forward and is composed of thin layers of a Kerrtype self focusing/defocusing nonlinear optical material. The training method is derived from a Lagrangian formulation of the constrained minimization of the network error at the output. This leads to a formulation that describes training as a calculation of the distributed error of the optical signal at the output which is then reflected back through the device to assign a spatially distributed error to the internal layers. This error is then used to modify the internal weighting values. Results from several computer simulations of the training are presented, and a simple optical table demonstration of the network is discussed.


Learning Many Related Tasks at the Same Time with Backpropagation

Neural Information Processing Systems

Hinton [6] proposed that generalization in artificial neural nets should improve if nets learn to represent the domain's underlying regularities. Abu-Mustafa's hints work [1] shows that the outputs of a backprop net can be used as inputs through which domainspecific information can be given to the net. We extend these ideas by showing that a backprop net learning many related tasks at the same time can use these tasks as inductive bias for each other and thus learn better. We identify five mechanisms by which multitask backprop improves generalization and give empirical evidence that multi task backprop generalizes better in real domains.