Exploring State-Space-Model based Language Model in Music Generation

Lee, Wei-Jaw, Hsieh, Fang-Chih, Chen, Xuanjun, Tsai, Fang-Duo, Yang, Yi-Hsuan

arXiv.org Artificial Intelligence 

ABSTRACT The recent surge in State Space Models (SSMs) [8, 9], particularly the emergence of Mamba, has established them as strong alternatives or complementary modules to Transformers across diverse domains. In this work, we aim to explore the potential of Mamba-based architectures for text-to-music generation. We adopt discrete tokens of Residual V ector Quantization (RVQ) as the modeling representation and empirically find that a single-layer code-book can capture semantic information in music. Motivated by this observation, we focus on modeling a single-codebook representation and adapt SiMBA, originally designed as a Mamba-based encoder, to function as a decoder for sequence modeling. We compare its performance against a standard Transformer-based decoder. Our results suggest that, under limited-resource settings, SiMBA achieves much faster convergence and generates outputs closer to the ground truth. This demonstrates the promise of SSMs for efficient and expressive text-to-music generation. We put audio examples on Github.