Goto

Collaborating Authors

 sequential






A distributional simplicity bias in the learning dynamics of transformers

Neural Information Processing Systems

The remarkable capability of over-parameterised neural networks to generalise effectively has been explained by invoking a "simplicity bias": neural networks prevent overfitting by initially learning simple classifiers before progressing to





Appendix

Neural Information Processing Systems

B.1 BaselineGHN:GHN-1 GHNs were designed for NAS, which typically make strong assumptions about the choice of operations and their possible dimensions tomakesearch and learning feasible.