Reviews: Global Convergence of Gradient Descent for Deep Linear Residual Networks

Jan-21-2025, 21:00:32 GMT–Neural Information Processing Systems

Response to authors' feedback: I thank the authors for the rebuttal. My score remains the same. With this initialization, the networks are shown to converge linearly to zero loss, under conditions (for discrete-time GD) that are different from and perhaps conceptually simpler than previous works. For instance, compared to reference [2] (Arora et al "A convergence analysis of gradient descent for deep linear neural networks", ICLR 2019), this work removes completely the delta-balanced condition in [2] by showing that this condition actually holds, for most layers, on the GD trajectory (Lemma 4.2 and Eq. While certain elements have already been seen in previous works (e.g. the property in Lemma 4.2 is similar to the delta-balanced condition in [2], or the requirement of zero initialization for the last layer's weight has been seen in "fixup initialization" of reference [21] in the context of residual networks), I think the proposed initialization as well as the convergence analysis here deserve credits for novelty.

deep linear residual network, global convergence, initialization, (7 more...)

Neural Information Processing Systems

Jan-21-2025, 21:00:32 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)