Reviews: Towards Understanding the Importance of Shortcut Connections in Residual Networks
–Neural Information Processing Systems
The paper investigates the outcome of training a one hidden layer convolutional residual network architecture using gradient descent when input is sampled from standard Gaussian distribution. As a followup of a similar analysis of Du et al (2017) for CNNs, this paper shows for ResNets that there exists two fixed points to the teacher-student loss function (network architecture is same for both). While one is a global minimum, the other is a spurious fixed point. The authors then derive *sufficient* conditions on the parameter initialization and learning rates such that training happens in two phases: 1. first phase where the hidden layer weights (w) remain away from the spurious fixed point (due to sufficiently small learning rate) while the last layer weights (a) approach the optimal value and eventually enter the region where the inner product satisfies a'a* 0. 2. second phase in which both parameters approach the global minimum such that the learning rate for w can be larger allowing faster convergence. I find this paper to be very interesting as it provides novel insights into the optimization process of ResNets even though in a very restricted setting.
Neural Information Processing Systems
Jan-24-2025, 21:18:00 GMT
- Technology: