Reviews: Swapout: Learning an ensemble of deep architectures
–Neural Information Processing Systems
Why can't you estimate the test time statistics empirically on a validation set? I really appreciate the tidbits on why dropout and swapout interact poorly with batch normalization. It's useful to know that you don't have to average over very many sampled dropouts (swapouts). I think this is a neat additional analysis and rather useful to the community. Why do the authors first do exactly 196 and then 224 epochs before decaying the learning rate? Normally such specific choices would arouse suspicion except in this case I expect it doesn't make much difference (e.g. between 196 and a round number like 200).
Neural Information Processing Systems
Jan-20-2025, 19:34:26 GMT
- Technology: