Reviews: Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients?

Neural Information Processing Systems 

The paper studies the gradient vanishing/exploding problem (EVGP) theoretically in deep fully connected ReLU networks. As a substitute for ensuring if gradient vanishing/exploding has been avoided, the paper proposes two criteria: annealed EVGP and quenched EVGP. It is finally shown that both these criteria are met if the sum of reciprocal of layer widths of the network is a small number (thus the width of all layers should ideally be large). To confirm this empirically, the paper uses an experiment from a concurrent work. Comments: To motivate formally studying EVGP in deep networks, the authors refer to papers which suggest looking at the distribution of singular values of the input-output Jacobian.