Training shallow ReLU networks on noisy data using hinge loss: when do we overfit and is it benign?

Open in new window