A Additional related work

Neural Information Processing Systems 

Soudry et al. [2018] showed that gradient descent on linearly-separable binary classification problems This analysis was extended to other loss functions, tighter convergence rates, non-separable data, and variants of gradient-based optimization algorithms [Nacson et al., 2019, As detailed in Section 2, Lyu and Li [2019] and Ji and Telgarsky [2020] showed that GF on homogeneous neural networks with exponential-type losses converge in direction to a KKT point of the maximum-margin problem in parameter space. The implications of margin maximization in parameter space on the implicit bias in predictor space for linear neural networks were studied in Gunasekar et al. [2018b] (as detailed in Section 2) and also in Jagadeesan et al. [2021], Ergen and Pilanci [2021a,b]. Moreover, several recent works considered implications of convergence to a KKT point of the maximum-margin problem, without assuming that the KKT point is optimal: Safran et al. [2022] proved a generalization bound in univariate depth-2 ReLU networks, V ardi et al. [2022] proved bias towards non-robust solutions in depth-2 The implicit bias in predictor space of diagonal and convolutional linear networks was studied in Gunasekar et al. [2018b], Moroshko Lyu et al. [2021] studied the implicit bias in two-layer leaky-ReLU networks trained on linearly They also gave constructions where a KKT point is not a global max-margin solution. We note that their constructions do not imply any of our results. Finally, the implicit bias of neural networks in regression tasks w.r.t. the square loss was also This setting, however, is less relevant to our work.