2281f5c898351dbc6dace2ba201e7948-AuthorFeedback.pdf

Neural Information Processing Systems 

Fromanoptimization5 perspective, as the reviewer pointed out, we use a preconditioning matrix to change the curvature and reduce the6 condition number of the optimization problem. Weusethemetaphor ofstatistical strengthtoreferthat,bytaking15 into account the correlations between data/gradients, we improve the effective sample size. From an optimization16 viewpoint, reducing the number of hidden features will not help optimization since the condition number can still17 be very large. To address the raised concerns, we performed additional experiments using only 15 hidden units18 in the last fully connected layer (the original implementation has 50 hidden units) on MNIST with batch size 256.19 {Regularizing_Type/Hidden_Dim} with {L2/50}, {L2/15}, {AdaReg/50}, and {AdaReg/15} are97.53%, On the MNIST dataset, for most of the methods except AdaReg and28 BatchNorm, we do observe that smaller batch size leads to better generalizations.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found