Review for NeurIPS paper: Quantitative Propagation of Chaos for SGD in Wide Neural Networks

Neural Information Processing Systems 

Additional Feedback: I have a few minor comments. Specifically: (1a) Depending on how one thinks about it, the learning rate in previous papers on infinite width SGD depends on the number of hidden units. What I mean is that you explicitly put in the 1/N as the size of the weights into the last layer. This has the effect of putting in a 1/N in the derivative d Loss / d W, where W is a weight in the first layer, which is akin to putting an extra 1/N into the learning rate. In previous papers (NTK-type analyses in deeper networks), sometimes this scale of weights is like N {-1/2}.