M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference

Bhendawade, Nikhil, Najibi, Mahyar, Naik, Devang, Belousova, Irina

Feb-4-2025–arXiv.org Artificial Intelligence

Residual transformation is critical to improving representational depth and expressive power of large language models (LLMs). However, the use of static residual transformations across all tokens during auto-regressive generation induces a suboptimal balance between inference efficiency and generation fidelity. Existing methods, including Early Exiting, Skip Decoding, and Mixture-of-Depth, attempt to address this by modulating the residual transformation based on token-level complexity. Nevertheless, these approaches predominantly consider the distance traversed by tokens through the model layers, neglecting the underlying velocity of residual evolution. In this work, we introduce Mixture of Multi-rate Residuals, a novel framework that dynamically modulates the velocity of residual transformations to optimize early residual alignment. This modification improves inference efficiency by better aligning intermediate representations at earlier stages. We show the efficacy of our technique in diverse optimization setups such as dynamic computing, speculative decoding, and MoE Ahead-of-Time (AoT) loading using challenging reasoning tasks from Koala, Self-Instruct, WizardLM and MT Bench. Our approach empirically outperforms state-of-the-art distance-based residual strategies, enabling a better trade-off between generation metrics and speedup in dynamic computing settings. In self-speculative decoding setups, M2R2 achieves up to 2.8X speedups on MT-Bench under lossless conditions, outperforming SOTA approaches such as 2-model speculative decoding, Medusa, LookAhead Decoding, and DEED. In Mixture-of-Experts (MoE) architectures, we enhance decoding speed by coupling early residual alignment with ahead-oftime expert loading into high-bandwidth memory (HBM). This enables concurrent memory access and computation, reducing the latency bottlenecks inherent in expert switching during decoding. Empirical results show that our method delivers a speedup of 2.9X in MoE architectures, positioning it as a highly effective strategy in resource-constrained environments.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Feb-4-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - UAE (0.14)
- North America
  - Mexico > Mexico City (0.14)
  - United States (0.25)

Genre:
- Research Report > New Finding (0.86)

Industry:
- Energy > Oil & Gas > Upstream (0.60)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (0.68)
    - Performance Analysis > Accuracy (0.67)
  - Natural Language > Large Language Model (1.00)