Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory
Goldblum, Micah, Geiping, Jonas, Schwarzschild, Avi, Moeller, Michael, Goldstein, Tom
A BSTRACT We empirically evaluate common assumptions about neural networks that are widely held by practitioners and theorists alike. We study the prevalence of local minima in loss landscapes, whether small-norm parameter vectors generalize better (and whether this explains the advantages of weight decay), whether wide-network theories (like the neural tangent kernel) describe the behaviors of classifiers, and whether the rank of weight matrices can be linked to generalization and robustness in real-world networks. In statistical learning, principled kernel methods have vastly improved the performance of SVMs and PCA (Suykens & V andewalle, 1999; Sch olkopf et al., 1997), and boosting theory has enabled weak learners to generate strong classifiers (Schapire, 1990). Optimizers in deep learning are borrowed from the field of convex optimization, where momentum optimizers (Nesterov, 1983) and conjugate gradient methods provably solve ill-conditioned problems with high efficiency (Hestenes & Stiefel, 1952). Deep learning harnesses foundational tools from these mature parent fields. Despite its rigorous roots, deep learning has driven a wedge between theory and practice. Recent theoretical work has certainly made impressive strides towards understanding optimization and generalization in neural networks. But doing so has required researchers to make strong assumptions and study restricted model classes. In this paper, we seek to understand whether deep learning theories accurately capture the behaviors and network properties that make realistic deep networks work. Following a line of previous work, such as Swirszcz et al. (2016), Zhang et al. (2016), Balduzzi et al. (2017) and Santurkar et al. (2018), we put the assumptions and conclusions of deep learning theory to the test using experiments with both toy networks and realistic ones. We focus on the following important theoretical issues: - Local minima: Numerous theoretical works argue that all local minima of neural loss functions are globally optimal or that all local minima are nearly optimal. In practice, we find Authors contributed equally. 1 arXiv:1910.00359v1 Y et for neural networks, it is not at all clear which form of null 2-regularization is optimal.
Oct-1-2019
- Country:
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- North America > United States
- Maryland (0.04)
- District of Columbia > Washington (0.04)
- Africa > Middle East
- Tunisia > Ben Arous Governorate > Ben Arous (0.04)
- Genre:
- Research Report > New Finding (0.46)
- Technology: