White-Box Transformers via Sparse Rate Reduction
–Neural Information Processing Systems
In Section 2.2 we show, using an idealized model for the token distribution, that if one iteratively
Neural Information Processing Systems
Feb-8-2026, 16:16:29 GMT