Interpolation can hurt robust generalization even when there is no noise

Donhauser, Konstantin, Ţifrea, Alexandru, Aerni, Michael, Heckel, Reinhard, Yang, Fanny

arXiv.org Machine Learning 

Conventional statistical wisdom cautions the user that trains a model by minimizing a loss L(θ): if a global minimizer achieves zero or near-zero training loss (i.e., it interpolates), we run the risk of overfitting (i.e., high variance) and thus sub-optimal prediction performance. Instead, regularization is commonly used to reduce the effect of noise and to obtain an estimator with better generalization. Specifically, regularization limits model complexity and induces worse data fit, for example via an explicit penalty term R(θ). The resulting penalized loss L(θ) λR(θ) explicitly imposes certain structural properties on the minimizer. This classical rationale, however, does seemingly not apply to overparameterized models: in practice, large neural networks, for example, exhibit good generalization performance on i.i.d.