Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

Neural Information Processing Systems 

Machine learning models are increasingly being scaled in both sequence length and model dimension to reach longer contexts and better performance.