Any-stepsize Gradient Descent for Separable Data under Fenchel-Young Losses
–Neural Information Processing Systems
The gradient descent (GD) has been one of the most common optimizer in machine learning. In particular, the loss landscape of a neural network is typically sharpened during the initial phase of training, making the training dynamics hover on the edge of stability. This is beyond our standard understanding of GD convergence in the stable regime where stepsize is chosen sufficiently smaller. Recently, Wu et al. [63] have shown that GD converges with much larger stepsize under linearly separable logistic regression. Although their analysis hinges on the self-bounding property of the logistic loss, which seems to be a cornerstone to establish a modified descent lemma, our pilot study shows that other loss functions without the selfbounding property can make GD attain arbitrarily small loss with large stepsize.
Neural Information Processing Systems
Jun-18-2026, 02:49:32 GMT
- Country:
- Asia (0.28)
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.88)
- Research Report
- Industry:
- Education > Educational Setting > Online (0.46)
- Technology: