M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference

Bhendawade, Nikhil, Najibi, Mahyar, Naik, Devang, Belousova, Irina

arXiv.org Artificial Intelligence 

Residual transformation is critical to improving representational depth and expressive power of large language models (LLMs). However, the use of static residual transformations across all tokens during auto-regressive generation induces a suboptimal balance between inference efficiency and generation fidelity. Existing methods, including Early Exiting, Skip Decoding, and Mixture-of-Depth, attempt to address this by modulating the residual transformation based on token-level complexity. Nevertheless, these approaches predominantly consider the distance traversed by tokens through the model layers, neglecting the underlying velocity of residual evolution. In this work, we introduce Mixture of Multi-rate Residuals, a novel framework that dynamically modulates the velocity of residual transformations to optimize early residual alignment. This modification improves inference efficiency by better aligning intermediate representations at earlier stages. We show the efficacy of our technique in diverse optimization setups such as dynamic computing, speculative decoding, and MoE Ahead-of-Time (AoT) loading using challenging reasoning tasks from Koala, Self-Instruct, WizardLM and MT Bench. Our approach empirically outperforms state-of-the-art distance-based residual strategies, enabling a better trade-off between generation metrics and speedup in dynamic computing settings. In self-speculative decoding setups, M2R2 achieves up to 2.8X speedups on MT-Bench under lossless conditions, outperforming SOTA approaches such as 2-model speculative decoding, Medusa, LookAhead Decoding, and DEED. In Mixture-of-Experts (MoE) architectures, we enhance decoding speed by coupling early residual alignment with ahead-oftime expert loading into high-bandwidth memory (HBM). This enables concurrent memory access and computation, reducing the latency bottlenecks inherent in expert switching during decoding. Empirical results show that our method delivers a speedup of 2.9X in MoE architectures, positioning it as a highly effective strategy in resource-constrained environments.