Goto

Collaborating Authors

 path-sgd



Path-SGD: Path-Normalized Optimization in Deep Neural Networks

Neural Information Processing Systems

We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and Ada-Grad.


Path-SGD: Path-Normalized Optimization in Deep Neural Networks

Neural Information Processing Systems

We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

Deep rectified neural networks are over-parameterized in the sense that scaling of the weights in one layer, can be compensated for exactly in the subsequent layer. This paper introduces Path-SGD, a simple modification to the SGD update rule, whose update is invariant to such rescaling. The method is derived from the proximal form of gradient descent, whereby a constraint term is added which preserves the norm of the "product weight" formed along each path in the network (from input to output node). Path-SGD is thus principled and shown to yield faster convergence for a standard 2 layer rectifier network, across a variety of dataset (MNIST, CIFAR-10, CIFAR-100, SVHN). As an algorithm, Path-SGD appears effective, simple to implement and addresses an obvious flaw in first-order updates to ReLU networks.


Reviews: Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations

Neural Information Processing Systems

This seems to be a worthwhile goal (since plain RNNs are computationally cheaper and easier to analyze theoretically) and their experiments show some promising results in improving performance over plain RNNs trained with existing optimization methods. However, it is not clear to me how the method that the authors use in practice differs significantly from regular Path-SGD introduced in previous work. The authors do present an adaptation of Path-SGD to networks with shared weights, and show that the new rescaling term applied to the gradients can be divided into two terms k1 and k2. But then, they note that the second term, which accounts for interactions between shared weights along the same path, is expensive to calculate for RNNs and show some empirical evidence that including it does not help performance. In the rest of the experiments, they ignore the second term, which to my understanding is essentially what makes the method introduced here different from regular Path-SGD.


Path-SGD: Path-Normalized Optimization in Deep Neural Networks

Neural Information Processing Systems

We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and Ada-Grad.


Path Normalized Optimization of Recurrent Neural Networks with ReLU Activations

Neural Information Processing Systems

We investigate the parameter-space geometry of recurrent neural networks (RNNs), and develop an adaptation of path-SGD optimization method, attuned to this geometry, that can learn plain RNNs with ReLU activations. On several datasets that require capturing long-term dependency structure, we show that path-SGD can significantly improve trainability of ReLU RNNs compared to RNNs trained with SGD, even with various recently suggested initialization schemes.


Path-SGD: Path-Normalized Optimization in Deep Neural Networks

Neural Information Processing Systems

We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. Papers published at the Neural Information Processing Systems Conference.


Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations

Neural Information Processing Systems

We investigate the parameter-space geometry of recurrent neural networks (RNNs), and develop an adaptation of path-SGD optimization method, attuned to this geometry, that can learn plain RNNs with ReLU activations. On several datasets that require capturing long-term dependency structure, we show that path-SGD can significantly improve trainability of ReLU RNNs compared to RNNs trained with SGD, even with various recently suggested initialization schemes.


Path-SGD: Path-Normalized Optimization in Deep Neural Networks

Neural Information Processing Systems

We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.