Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models

Open in new window