Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis

Open in new window