DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

Lu, Ke-Han, Chen, Zhehuai, Fu, Szu-Wei, Yang, Chao-Han Huck, Huang, Sung-Feng, Yang, Chih-Kai, Yu, Chee-En, Chen, Chun-Wei, Chen, Wei-Chih, Huang, Chien-yu, Lin, Yi-Cheng, Lin, Yu-Xiang, Fu, Chi-An, Kuan, Chun-Yi, Ren, Wenze, Chen, Xuanjun, Huang, Wei-Ping, Hu, En-Pei, Lin, Tzu-Quan, Wu, Yuan-Kuei, Huang, Kuan-Po, Huang, Hsiao-Ying, Chou, Huang-Cheng, Chang, Kai-Wei, Chiang, Cheng-Han, Ginsburg, Boris, Wang, Yu-Chiang Frank, Lee, Hung-yi

Jul-4-2025–arXiv.org Artificial Intelligence

--We introduce DeST A2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these approaches have often suffered from the catastrophic forgetting of the LLM's original language abilities. T o address this, we revisit the data construction pipeline and propose DeST A, a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets. This approach preserves the LLM's native language proficiency while establishing effective audio-text alignment, thereby enabling zero-shot generalization without task-specific tuning. Using DeST A, we construct DeST A-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeST A2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and V oiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms widely adopted data construction and training strategies in both auditory perception and instruction-following capabilities. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs. HE development of general-purpose artificial intelligence has become a central focus in contemporary AI research, driven by the remarkable performance of large language models (LLMs) across various natural language understanding and generation tasks [1]-[7]. Building on these advancements, a promising direction is to equip LLMs with multi-modal understanding capabilities, leading to the emergence of Large Audio Language Models (LALMs) [8]-[22] and Large Vision Language Models (L VLMs) [23]-[27]. This paper focuses on building a general-purpose LALM, illustrated in Figure 1. To develop a general-purpose LALM, two core capabilities are essential: auditory perception and instruction-following. Auditory perception refers to the comprehensive processing of auditory information, including speech, non-verbal cues, background sounds, and music.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Jul-4-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - Japan > Honshū
    - Tōhoku > Iwate Prefecture > Morioka (0.04)
  - Taiwan (0.04)
- Europe
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Italy
    - Calabria > Catanzaro Province
      - Catanzaro (0.04)
    - Tuscany > Florence (0.04)
- North America
  - Canada
    - Alberta > Census Division No. 15
      - Improvement District No. 9 > Banff (0.04)
    - Ontario > Toronto (0.04)
  - Mexico > Mexico City
    - Mexico City (0.04)
  - United States
    - Florida
      - Miami-Dade County > Miami (0.04)
      - Orange County > Orlando (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Education > Curriculum > Subject-Specific Education (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.70)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found