Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization