Don't blame Dataset Shift! Shortcut Learning due to Gradients and Cross Entropy

Jan-20-2025, 00:48:25 GMT–Neural Information Processing Systems

Common explanations for shortcut learning assume that the shortcut improves prediction only under the training distribution. Thus, models trained in the typical way by minimizing log-loss using gradient descent, which we call default-ERM, should utilize the shortcut. However, even when the stable feature determines the label in the training distribution and the shortcut does not provide any additional information, like in perception tasks, default-ERM exhibits shortcut learning. Why are such solutions preferred when the loss can be driven to zero when using the stable feature alone? By studying a linear perception task, we show that default-ERM's preference for maximizing the margin, even without overparameterization, leads to models that depend more on the shortcut than the stable feature.

gradient and cross entropy, perception task, shortcut, (6 more...)

Neural Information Processing Systems

Jan-20-2025, 00:48:25 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.42)