Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

Jan-20-2025, 02:37:44 GMT–Neural Information Processing Systems

Machine learning models are increasingly being scaled in both sequence length and model dimension to reach longer contexts and better performance. We ask: are there performant architectures that can scale sub-quadratically along sequence length and model dimension? We introduce Monarch Mixer (M2), a new architecture that uses the same sub-quadratic primitive along both sequence length and model dimension: Monarch matrices, a simple class of expressive structured matrices that captures many linear transforms, achieves high hardware efficiency on GPUs, and scales sub-quadratically. As a proof of concept, we explore the performance of M2 in three domains: non-causal BERT-style language modeling, ViT-style image classification, and causal GPT-style language modeling. For non-causal BERT-style modeling, M2 matches BERT-base and BERT-large in downstream GLUE quality with up to 27% fewer parameters, and achieves up to 9.1 \times higher throughput at sequence length 4K.

monarch mixer, sequence length and model dimension, simple sub-quadratic gemm-based architecture, (4 more...)

Neural Information Processing Systems

Jan-20-2025, 02:37:44 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.77)