The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Neural Information Processing Systems 

Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment.