Can Mamba Always Enjoy the "Free Lunch"?

Oct-4-2024–arXiv.org Artificial Intelligence

Transformers have been the cornerstone of current Large Language Models (LLMs); however, its linear growth in overhead during inference with respect to sequence length poses challenges for modeling long sequences. In this context, Mamba has gradually attracted attention due to its constant-level size during inference and existing empirical results have shown that it can perform comparably to Transformers in sequence modeling while offering significant savings. However, one may ask that, can Mamba always enjoy the "free lunch"? In this paper, we focus on analyzing the expressive ability of Mamba from a theoretical standpoint. First, inspired by the connection between Mamba and linear attention, we investigate potential shortcomings of the Mamba when performing the COPY operation. Our results indicate that Mamba with constant size may encounter bottlenecks when handling COPY, while it can achieve perfect performance when the size scales linearly with sequence length. Based on this observation, we analyze Mamba's ability to tackle DP problems when equipped with Chain of Thought (CoT). Our findings suggest that to solve arbitrary DP problems, the total cost of Mamba is comparable to standard and efficient Transformers. However, similar to efficient Transformers, when facing DP problems with favorable properties such as locality, Mamba can provide savings in overhead. Our results contribute to a deeper understanding of Mamba. Reccently, Transformer-based large language models (LLMs) have become the mainstream of modern neural network architectures due to their outstanding performance across a wide range of tasks (V aswani et al., 2017; Kenton & Toutanova, 2019; Brown et al., 2020; Dosovitskiy et al., 2020; Min et al., 2022). However, the core component of Transformers--the attention layer--while providing excellent performance, also leads to emerging drawbacks: during training, the computational cost scales quadratically with sequence length, and during inference, the cost scales linearly with sequence length. This limitation becomes increasingly unacceptable when dealing with long sequence tasks. To address this issue, many works have attempted to improve the attention mechanism to reduce its time and storage costs (Tay et al., 2023; Choromanski et al., 2020; Katharopoulos et al., 2020; Beltagy et al., 2020; Child et al., 2019). However, these improved structures often achieve efficiency in the attention layer at the expense of some performance.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Oct-4-2024

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.14)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.87)
  - Natural Language > Large Language Model (1.00)