Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
–Neural Information Processing Systems
Machine learning models are increasingly being scaled in both sequence length and model dimension to reach longer contexts and better performance. We ask: are there performant architectures that can scale sub-quadratically along sequence length and model dimension? We introduce Monarch Mixer (M2), a new architecture that uses the same sub-quadratic primitive along both sequence length and model dimension: Monarch matrices, a simple class of expressive structured matrices that captures many linear transforms, achieves high hardware efficiency on GPUs, and scales sub-quadratically. As a proof of concept, we explore the performance of M2 in three domains: non-causal BERT-style language modeling, ViT-style image classification, and causal GPT-style language modeling. For non-causal BERT-style modeling, M2 matches BERT-base and BERT-large in downstream GLUE quality with up to 27% fewer parameters, and achieves up to 9.1 \times higher throughput at sequence length 4K.
Neural Information Processing Systems
Jan-20-2025, 02:37:44 GMT
- Technology: