UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

Tian, Jinchuan, Lee, Sang-gil, Kong, Zhifeng, Ghosh, Sreyan, Goel, Arushi, Yang, Chao-Han Huck, Dai, Wenliang, Liu, Zihan, Ye, Hanrong, Watanabe, Shinji, Shoeybi, Mohammad, Catanzaro, Bryan, Valle, Rafael, Ping, Wei

Oct-15-2025–arXiv.org Artificial Intelligence

Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. V ery few studies attempt to unify these tasks - an essential step toward advanced multimodal reasoning. This paper introduces Unified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations. Figure 1: Humans need understanding, generation, and reasoning to handle complex tasks, like composing music. Human auditory intelligence is characterized by two fundamental capabilities: perception (understanding) and production (generation). This duality is not merely conceptual; neuro-scientific evidence reveals a profound synergy between these functions, where impairment in one often corresponds to a deficit in the other (Liberman et al., 1967; Hickok & Poeppel, 2007; Rizzolatti & Craighero, 2004). Furthermore, resolving complex acoustic challenges requires a sophisticated reasoning process that is inherently multimodal (McGurk & MacDonald, 1976; Leman, 2007; Denes & Pinson, 1993; Liberman & Mattingly, 1985).

arxiv preprint arxiv, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

Oct-15-2025

arXiv.org PDF

Add feedback

Country:
- North America (0.46)

Genre:
- Research Report (1.00)

Industry:
- Media > Music (1.00)
- Leisure & Entertainment (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning > Neural Networks (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found