6801fa3fd290229efc490ee0cf1c5687-Paper-Conference.pdf

Neural Information Processing Systems 

Large Language models (LLMs) have demonstrated supreme capabilities in textual understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel LLM-driven audio codec model, LLM-Codec, which transfers the audio modality into textual space by representing audio tokens with words or sub-words from the LLM vocabulary, while maintaining high audio reconstruction quality. The key idea is to reduce the modality heterogeneity between text and audio by compressing the audio modality into the well-trained textual space of LLMs. Thus, the audio representation can be viewed as a new foreign language, and LLMs can learn the new foreign language with several demonstrations. In experiments, we investigate the performance of the proposed approach across multiple audio understanding and generation tasks, e.g.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found