A Unifying View on Implicit Bias in Training Linear Neural Networks
Yun, Chulhee, Krishnan, Shankar, Mobahi, Hossein
Overparametrized neural networks have infinitely many solutions that achieve zero training error, and such global minima have different generalization performance. Moreover, training a neural network is a high-dimensional nonconvex problem, which is typically intractable to solve. However, the success of deep learning indicates that first-order methods such as gradient descent or stochastic gradient descent (GD/SGD) not only (a) succeed in finding global minima, but also (b) are biased towards solutions that generalize well, which largely has remained a mystery in the literature. To explain part (a) of the phenomenon, there is a growing literature studying the convergence of GD/SGD on overparametrized neural networks (e.g., Du et al. (2018a,b); Allen-Zhu et al. (2018); Zou et al. (2018); Jacot et al. (2018); Oymak and Soltanolkotabi (2020), and many more). There are also convergence results that focus on linear networks, without nonlinear activations (Bartlett et al., 2018; Arora et al., 2019a; Wu et al., 2019; Du and Hu, 2019; Hu et al., 2020). These results typically focus on the convergence of loss, hence do not address which of the many global minima is reached.
Oct-6-2020
- Country:
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Genre:
- Research Report (1.00)
- Technology: