Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

Open in new window