Review for NeurIPS paper: Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks

Feb-7-2025, 14:23:09 GMT–Neural Information Processing Systems

Weaknesses: * There might be multiple reasons make networks BN trainable under extreme conditions, including large learning rate and huge depth. I agree the point made by this work, that small init in residual branches is such a reason, which in turn makes vanilla resnet withour normalization trainble, however It's possible that the normalized resnet are trainable even without small init in residual branches. It's well known that the input/output scale for the weights before batch normalization is not making as much sense as they do for networks without normalization. For example, Li&Arora, 2019 shows that slightly modified ResNet is trainable with exponential increasing LR and achieves equally good performance as Step Decay schedule. The output of the residual blocks could also grow exponentially, but the network is still trainable because the gradients are small.

batch normalization bias residual block, identity function, residual branch, (6 more...)

Neural Information Processing Systems

Feb-7-2025, 14:23:09 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.56)