Exploring State-Space-Model based Language Model in Music Generation
Lee, Wei-Jaw, Hsieh, Fang-Chih, Chen, Xuanjun, Tsai, Fang-Duo, Yang, Yi-Hsuan
–arXiv.org Artificial Intelligence
ABSTRACT The recent surge in State Space Models (SSMs) [8, 9], particularly the emergence of Mamba, has established them as strong alternatives or complementary modules to Transformers across diverse domains. In this work, we aim to explore the potential of Mamba-based architectures for text-to-music generation. We adopt discrete tokens of Residual V ector Quantization (RVQ) as the modeling representation and empirically find that a single-layer code-book can capture semantic information in music. Motivated by this observation, we focus on modeling a single-codebook representation and adapt SiMBA, originally designed as a Mamba-based encoder, to function as a decoder for sequence modeling. We compare its performance against a standard Transformer-based decoder. Our results suggest that, under limited-resource settings, SiMBA achieves much faster convergence and generates outputs closer to the ground truth. This demonstrates the promise of SSMs for efficient and expressive text-to-music generation. We put audio examples on Github.
arXiv.org Artificial Intelligence
Jul-10-2025
- Country:
- Asia
- South Korea > Daejeon
- Daejeon (0.04)
- Taiwan (0.05)
- South Korea > Daejeon
- Asia
- Genre:
- Research Report > New Finding (0.69)
- Industry:
- Leisure & Entertainment (0.74)
- Media > Music (0.74)
- Technology: