optimal brain surgeon
The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information
The rising footprint of machine learning has led to a focus on imposing model sparsity as a means of reducing computational and memory costs. For deep neural networks (DNNs), the state-of-the-art accuracy-vs-sparsity is achieved by heuristics inspired by the classical Optimal Brain Surgeon (OBS) framework [LeCun et al., 1989, Hassibi and Stork, 1992, Hassibi et al., 1993], which leverages loss curvature information to make better pruning decisions. Yet, these results still lack a solid theoretical understanding, and it is unclear whether they can be improved by leveraging connections to the wealth of work on sparse recovery algorithms. In this paper, we draw new connections between these two areas and present new sparse recovery algorithms inspired by the OBS framework that come with theoretical guarantees under reasonable assumptions and have strong practical performance. Specifically, our work starts from the observation that we can leverage curvature information in OBS-like fashion upon the projection step of classic iterative sparse recovery algorithms such as IHT. We show for the first time that this leads both to improved convergence bounds in well-behaved settings and to stronger practical convergence.
The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information
The rising footprint of machine learning has led to a focus on imposing model sparsity as a means of reducing computational and memory costs. For deep neural networks (DNNs), the state-of-the-art accuracy-vs-sparsity is achieved by heuristics inspired by the classical Optimal Brain Surgeon (OBS) framework [LeCun et al., 1989, Hassibi and Stork, 1992, Hassibi et al., 1993], which leverages loss curvature information to make better pruning decisions. Yet, these results still lack a solid theoretical understanding, and it is unclear whether they can be improved by leveraging connections to the wealth of work on sparse recovery algorithms. In this paper, we draw new connections between these two areas and present new sparse recovery algorithms inspired by the OBS framework that come with theoretical guarantees under reasonable assumptions and have strong practical performance. Specifically, our work starts from the observation that we can leverage curvature information in OBS-like fashion upon the projection step of classic iterative sparse recovery algorithms such as IHT.
Reviews: Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon
Summary: This paper adapts Optimal Brain Surgeon (OBS) method to a local version, and modified the objective function to be the target activation per each layer. Similar to OBS, it uses an approximation to compute Hessian inverse by running through the dataset once. Compare to prior methods, it finishes compression with much less retraining iterations. A theoretical bound on the total error based on local reconstruction error is provided. Pros: - The paper explores a local version of OBS and shows effectiveness of proposed method in terms of less time cost for retraining the pruned network.
Second order derivatives for network pruning: Optimal Brain Surgeon
We investigate the use of information from all second order derivatives of the error function to perfonn network pruning (i.e., removing unimportant weights from a trained network) in order to improve generalization, simplify networks, reduce hardware or storage requirements, increase the speed of further training, and in some cases enable rule extraction. Our method, Optimal Brain Surgeon (OBS), is Significantly better than magnitude-based methods and Optimal Brain Damage [Le Cun, Denker and Sol1a, 1990], which often remove the wrong weights. OBS permits the pruning of more weights than other methods (for the same error on the training set), and thus yields better generalization on test data. Crucial to OBS is a recursion relation for calculating the inverse Hessian matrix H-I from training data and structural information of the net. OBS permits a 90%, a 76%, and a 62% reduction in weights over backpropagation with weighL decay on three benchmark MONK's problems [Thrun et aI., 1991].
Optimal Brain Surgeon: Extensions and performance comparisons
We extend Optimal Brain Surgeon (OBS) - to allow for general error mea(cid:173) method for pruning networks - sures, and explore a reduced computational and storage implemen(cid:173) tation via a dominant eigenspace decomposition. Simulations on nonlinear, noisy pattern classification problems reveal that OBS does lead to improved generalization, and performs favorably in comparison with Optimal Brain Damage (OBD). We find that the required retraining steps in OBD may lead to inferior generaliza(cid:173) tion, a result that can be interpreted as due to injecting noise back into the system. A common technique is to stop training of a large network at the minimum validation error. We found that the test error could be reduced even further by means of OBS (but not OBD) pruning.
Recurrent Networks: Second Order Properties and Pruning
Second order properties of cost functions for recurrent networks are investigated. We analyze a layered fully recurrent architecture, the virtue of this architecture is that it features the conventional feedforward architecture as a special case. A detailed description of recursive computation of the full Hessian of the network cost func(cid:173) tion is provided. We discuss the possibility of invoking simplifying approximations of the Hessian and show how weight decays iron the cost function and thereby greatly assist training. We present tenta(cid:173) tive pruning results, using Hassibi et al.'s Optimal Brain Surgeon, demonstrating that recurrent networks can construct an efficient internal memory.
Early Brain Damage
Tresp, Volker, Neuneier, Ralph, Zimmermann, Hans-Georg
Optimal Brain Damage (OBD) is a method for reducing the number of weights in a neural network. OBD estimates the increase in cost function if weights are pruned and is a valid approximation if the learning algorithm has converged into a local minimum. On the other hand it is often desirable to terminate the learning process before a local minimum is reached (early stopping). In this paper we show that OBD estimates the increase in cost function incorrectly if the network is not in a local minimum. We also show how OBD can be extended such that it can be used in connection with early stopping. We call this new approach Early Brain Damage, EBD. EBD also allows to revive already pruned weights. We demonstrate the improvements achieved by EBD using three publicly available data sets.
- North America > United States > California > San Mateo County > San Mateo (0.05)
- Europe > Germany (0.04)
Early Brain Damage
Tresp, Volker, Neuneier, Ralph, Zimmermann, Hans-Georg
Optimal Brain Damage (OBD) is a method for reducing the number of weights in a neural network. OBD estimates the increase in cost function if weights are pruned and is a valid approximation if the learning algorithm has converged into a local minimum. On the other hand it is often desirable to terminate the learning process before a local minimum is reached (early stopping). In this paper we show that OBD estimates the increase in cost function incorrectly if the network is not in a local minimum. We also show how OBD can be extended such that it can be used in connection with early stopping. We call this new approach Early Brain Damage, EBD. EBD also allows to revive already pruned weights. We demonstrate the improvements achieved by EBD using three publicly available data sets.
- North America > United States > California > San Mateo County > San Mateo (0.05)
- Europe > Germany (0.04)
Early Brain Damage
Tresp, Volker, Neuneier, Ralph, Zimmermann, Hans-Georg
Optimal Brain Damage (OBD) is a method for reducing the number ofweights in a neural network. OBD estimates the increase in cost function if weights are pruned and is a valid approximation if the learning algorithm has converged into a local minimum. On the other hand it is often desirable to terminate the learning process beforea local minimum is reached (early stopping). In this paper we show that OBD estimates the increase in cost function incorrectly if the network is not in a local minimum. We also show how OBD can be extended such that it can be used in connection withearly stopping.
- North America > United States > California > San Mateo County > San Mateo (0.05)
- Europe > Germany (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.05)
- North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
- Europe > Italy (0.04)
- Europe > Denmark > Capital Region > Kongens Lyngby (0.04)