Exploring State-Space-Model based Language Model in Music Generation

Lee, Wei-Jaw, Hsieh, Fang-Chih, Chen, Xuanjun, Tsai, Fang-Duo, Yang, Yi-Hsuan

Jul-10-2025–arXiv.org Artificial Intelligence

ABSTRACT The recent surge in State Space Models (SSMs) [8, 9], particularly the emergence of Mamba, has established them as strong alternatives or complementary modules to Transformers across diverse domains. In this work, we aim to explore the potential of Mamba-based architectures for text-to-music generation. We adopt discrete tokens of Residual V ector Quantization (RVQ) as the modeling representation and empirically find that a single-layer code-book can capture semantic information in music. Motivated by this observation, we focus on modeling a single-codebook representation and adapt SiMBA, originally designed as a Mamba-based encoder, to function as a decoder for sequence modeling. We compare its performance against a standard Transformer-based decoder. Our results suggest that, under limited-resource settings, SiMBA achieves much faster convergence and generates outputs closer to the ground truth. This demonstrates the promise of SSMs for efficient and expressive text-to-music generation. We put audio examples on Github.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

Jul-10-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - Taiwan (0.05)
  - South Korea > Daejeon
    - Daejeon (0.04)

Genre:
- Research Report > New Finding (0.69)

Industry:
- Media > Music (0.74)
- Leisure & Entertainment (0.74)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.70)
  - Machine Learning > Neural Networks
    - Deep Learning (0.51)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found