. We would like to point out that

Neural Information Processing Systems 

We would like to thank all the valuable and constructive feedback from the reviewers. AdaReg does not explicitly enforce the weight matrices to be positively/negatively correlated. Therefore, our method is orthogonal to but not contradictory with Dropout. Inspired by this result, we explored hyperparameter learning by empirical Bayes. BatchNorm, we do observe that smaller batch size leads to better generalizations.