Disentangling Feature Structure: A Mathematically Provable Two-Stage Training Dynamics in Transformers

Open in new window