Review for NeurIPS paper: Bad Global Minima Exist and SGD Can Reach Them

Neural Information Processing Systems 

Weaknesses: - The paper claims to have shown for the first time that models that perfectly fit the training set can have different degrees of generalization depending on the initialization, ie. This has been previously shown also using a similar technique. See for example "Theoretical issues in deep networks" by Poggio et al. (in PNAS), which shows (among other things) that depending on the standard deviation of the distribution to initialize the weights the network converges to global minima with different test accuracy (see Fig.2). Also, "Classical Generalization Bounds Are Surprisingly Tight For Deep Networks" by Liao et al. (CBMM Memo) introduces the training "Random initialization Training with random labels Training with true labels" and even more: they show that depending on the amount of images with randomized labels the test accuracy after training with the true labels varies accordingly (see Section 2). Fig.2 and 3) and tables are hard to quickly extract conclusions (Table 1).