BackPACK: Packing more into backprop
Dangel, Felix, Kunstner, Frederik, Hennig, Philipp
Automatic differentiation frameworks are optimized for exactly one thing: computing the average mini-batch gradient. Y et, other quantities such as the variance of the mini-batch gradients or many approximations to the Hessian can, in theory, be computed efficiently, and at the same time as the gradient. While these quantities are of great interest to researchers and practitioners, current deep-learning software does not support their automatic calculation. Manually implementing them is burdensome, inefficient if done na ıvely, and the resulting code is rarely shared. This hampers progress in deep learning, and unnecessarily narrows research to focus on gradient descent and its variants; it also complicates replication studies and comparisons between newly developed methods that require those quantities, to the point of impossibility. Its capabilities are illustrated by benchmark reports for computing additional quantities on deep neural networks, and an example application by testing several recent curvature approximations for optimization. The success of deep learning and the applications it fuels can be traced to the popularization of automatic differentiation frameworks. However, this specialization also has its shortcomings: it assumes the user only wants to compute gradients or, more precisely, the average of gradients across a mini-batch of examples. Other quantities can also be computed with automatic differentiation at a comparable cost or minimal overhead to the gradient backpropagation pass; for example, approximate second-order information or the variance of gradients within the batch. These quantities are valuable to understand the geometry of deep neural networks, for the identification of free parameters, and to push the development of more efficient optimization algorithms. But researchers who want to investigate their use face a chicken-and-egg problem: automatic differentiation tools required to go beyond standard gradient methods are not available, but there is no incentive for their implementation in existing deep-learning software as long as no large portion of the users need it. Second-order methods for deep learning have been continuously investigated for decades (e.g., Becker & Le Cun, 1989; Amari, 1998; Bordes et al., 2009; Martens & Grosse, 2015).
Dec-23-2019
- Country:
- Oceania > Australia
- New South Wales > Sydney (0.14)
- North America
- United States
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Georgia > Chatham County
- Savannah (0.04)
- California > San Diego County
- San Diego (0.04)
- Louisiana > Orleans Parish
- Canada
- Ontario > Toronto (0.14)
- British Columbia > Vancouver (0.04)
- United States
- Europe
- Asia > Japan
- Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
- Oceania > Australia
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Education (0.54)
- Technology: