Reviews: Global Convergence of Gradient Descent for Deep Linear Residual Networks
–Neural Information Processing Systems
Response to authors' feedback: I thank the authors for the rebuttal. My score remains the same. With this initialization, the networks are shown to converge linearly to zero loss, under conditions (for discrete-time GD) that are different from and perhaps conceptually simpler than previous works. For instance, compared to reference [2] (Arora et al "A convergence analysis of gradient descent for deep linear neural networks", ICLR 2019), this work removes completely the delta-balanced condition in [2] by showing that this condition actually holds, for most layers, on the GD trajectory (Lemma 4.2 and Eq. While certain elements have already been seen in previous works (e.g. the property in Lemma 4.2 is similar to the delta-balanced condition in [2], or the requirement of zero initialization for the last layer's weight has been seen in "fixup initialization" of reference [21] in the context of residual networks), I think the proposed initialization as well as the convergence analysis here deserve credits for novelty.
Neural Information Processing Systems
Jan-21-2025, 21:00:32 GMT
- Technology: