audio encoder
Boosting
Attention-based encoder decoder models remain a popular choice for state-of-the-art automatic speech recognition (ASR). These models combine a powerful audio encoder that extracts rich acoustic features with a decoder that autoregressively produces the ASR output. The decoder handles two critical tasks: (1) building rich text-only context and (2) merging acoustic information from the encoder to ensure the predictions remain faithful to the audio. We observe a systematic pattern across the attention distributions of decoder layers in prior architectures: the initial layers direct most attention towards building textual context, while the later layers largely focus on merging acoustic and textual information for the final predictions. Leveraging this key insight, we propose BLOCKDECODER, a novel decoder architecture comprising two distinct components: a text encoder that is purely text-based, and a MERGER that combines information from the audio encoder and text encoder to generate output tokens. Unlike traditional decoders, the MERGER autoregressively predicts a sequence of K tokens within a block of size K, while relying on the same precomputed contextual information from both text and audio encoders across the block. This design choice allows for the efficient reuse of encoder representations. The separation of the decoder into the text encoder and the MERGER promotes modularity and more flexible control of parameters via the number of text encoder and MERGER layers. As a result, BLOCKDECODER yields a significant speedup ( 2x) compared to traditional decoders, across diverse datasets, languages, and speech tasks, without any degradation in performance.
Mellow: a small audio language model for reasoning
Multimodal Audio-Language Models (ALMs) can understand and reason over both audio and text. Typically, reasoning performance correlates with model size, with the best results achieved by models exceeding 8 billion parameters. However, no prior work has explored enabling small audio-language models to perform reasoning tasks, despite the potential applications for edge devices. To address this gap, we introduce Mellow, a small Audio-Language Model specifically designed for reasoning. Mellow achieves state-of-the-art performance among existing small audio-language models and surpasses several larger models in reasoning capabilities. For instance, Mellow scores 52.11 on MMAU, comparable to SoTAQwen2 Audio (which scores 52.5) while using 50 times fewer parameters and being trained on 60 times less data (audio hrs).
BlockDecoder: Boosting ASR Decoders with Context and Merger Modules
Attention-based encoder decoder models remain a popular choice for state-of-the-art automatic speech recognition (ASR). These models combine a powerful audio encoder that extracts rich acoustic features with a decoder that autoregressively produces the ASR output. The decoder handles two critical tasks: (1) building rich text-only context and (2) merging acoustic information from the encoder to ensure the predictions remain faithful to the audio. We observe a systematic pattern across the attention distributions of decoder layers in prior architectures: the initial layers direct most attention towards building textual context, while the later layers largely focus on merging acoustic and textual information for the final predictions. Leveraging this key insight, we propose **BlockDecoder**, a novel decoder architecture comprising two distinct components: a text encoder that is purely text-based, and a **Merger** that combines information from the audio encoder and text encoder to generate output tokens. Unlike traditional decoders, the **Merger** autoregressively predicts a sequence of K tokens within a *block* of size K, while relying on the same precomputed contextual information from both text and audio encoders across the block. This design choice allows for the efficient reuse of encoder representations. The separation of the decoder into the text encoder and the **Merger** promotes modularity and more flexible control of parameters via the number of text encoder and **Merger** layers. As a result, **BlockDecoder** yields a significant speedup ($\sim2$x) compared to traditional decoders, across diverse datasets, languages, and speech tasks, without any degradation in performance.
SpeechQualityLLM: LLM-Based Multimodal Assessment of Speech Quality
Monjur, Mahathir, Nirjon, Shahriar
Objective speech quality assessment is central to telephony, V oIP, and streaming systems, where large volumes of degraded audio must be monitored and optimized at scale. Classical metrics such as PESQ and POLQA approximate human mean opinion scores (MOS) but require carefully controlled conditions and expensive listening tests, while learning-based models such as NISQA regress MOS and multiple perceptual dimensions from waveforms or spectrograms, achieving high correlation with subjective ratings yet remaining rigid: they yield fixed scalar scores, do not support interactive, natural-language queries, and do not natively provide textual rationales. In this work, we introduce SpeechQualityLLM, a multimodal speech quality question-answering (QA) system that couples an audio encoder with a language model and is trained on the NISQA corpus using template-based question-answer pairs covering overall MOS and four perceptual dimensions (noisiness, coloration, discontinuity, and loudness) in both single-ended (degraded only) and double-ended (degraded plus clean reference) setups. Instead of directly regressing scores, SpeechQualityLLM is supervised to generate textual answers from which numeric predictions are parsed and evaluated with standard regression and ranking metrics; on held-out NISQA clips, the double-ended model attains a MOS mean absolute error (MAE) of approximately 0.41 with Pearson correlation of 0.86, with competitive performance on dimension-wise tasks. Beyond these quantitative gains, SpeechQualityLLM offers a flexible natural-language interface in which the language model acts as an audio quality expert: practitioners can query arbitrary aspects of degradations, prompt the model to emulate different listener profiles to capture human variability and produce diverse but plausible judgments rather than a single deterministic score, and thereby reduce reliance on large-scale crowdsourced tests and their monetary cost. W e provide a general pipeline for adapting large language models to specialized audio quality assessment tasks via lightweight mul-timodal alignment. Code, model weights, and experimental results are available at GitHub.
Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding
Wang, Tsai-Ning, Chen, Lin-Lin, Zeghidour, Neil, Saeed, Aaqib
Pre-trained audio models excel at detecting acoustic patterns in auscultation sounds but often fail to grasp their clinical significance, limiting their use and performance in diagnostic tasks. To bridge this gap, we introduce AcuLa (Audio-Clinical Understanding via Language Alignment), a lightweight post-training framework that instills semantic understanding into any audio encoder by aligning it with a medical language model, which acts as a "semantic teacher." To enable alignment at scale, we construct a large-scale dataset by leveraging off-the-shelf large language models to translate the rich, structured metadata accompanying existing audio recordings into coherent clinical reports. Our alignment strategy combines a representation-level contrastive objective with a self-supervised modeling, ensuring that the model learns clinical semantics while preserving fine-grained temporal cues. AcuLa achieves state-of-the-art results across 18 diverse cardio-respiratory tasks from 10 different datasets, improving the mean AUROC on classification benchmarks from 0.68 to 0.79 and, on the most challenging COVID-19 cough detection task, boosting the AUROC from 0.55 to 0.89. Our work demonstrates that this audio-language alignment transforms purely acoustic models into clinically-aware diagnostic tools, establishing a novel paradigm for enhancing physiological understanding in audio-based health monitoring.
CACARA: Cross-Modal Alignment Leveraging a Text-Centric Approach for Cost-Effective Multimodal and Multilingual Learning
Moreira, Diego A. B., Ferreira, Alef I., Silva, Jhessica, Santos, Gabriel O. dos, Bonil, Gustavo, Gondim, Joรฃo, Santos, Marina dos, Maia, Helena, Hashiguti, Simone, da Silva, Nรกdia, Scarton, Carolina, Pedrini, Helio, Avila, Sandra
As deep learning models evolve, new applications and challenges are rapidly emerging. Tasks that once relied on a single modality, such as text, images, or audio, are now enriched by seamless interactions between multimodal data. These connections bridge information gaps: an image can visually materialize a text, while audio can add context to an image. Researchers have developed numerous multimodal models, but most rely on resource-intensive training across multiple modalities. Similarly, extending these models to new languages often follows the same resource-heavy training strategy. In this work, we propose a multimodal and multilingual architecture, CACARA, trained through emergent alignment learning, enabling the seamless integration of new modalities into an existing bimodal/multimodal model without requiring full retraining. This work breaks new ground by demonstrating that this emergent alignment paradigm can unlock multilingual capabilities from monolingual training. By fine-tuning the newly incorporated modality only on data aligned with the English language, our model develops support for over 100 languages without explicit multilingual pretraining or tuning of the text encoder. Such emergent multimodal and multilingual properties are gained efficiently, preserving previously learned knowledge at a training cost comparable to that of a monolingual model. Our strategy achieves up to a 14.24 percentage points improvement in R@1 audio-to-text retrieval, outperforming state-of-the-art multimodal models -- all without the heavy computational cost of retraining across every modality and language.
Towards Audio Token Compression in Large Audio Language Models
Bhati, Saurabhchand, Thomas, Samuel, Kuehne, Hilde, Feris, Rogerio, Glass, James
Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices. In this paper, we explore techniques such as unsupervised segmentation, uniform average pooling, etc., to reduce the number of audio tokens generated by the LALM's audio encoder but before they are consumed by the LLM decoder. To mitigate potential performance degradation introduced by the compressed representations, we employ low-rank adapters to finetune the model. We evaluate our proposed models on two tasks, automatic speech recognition and speech-to-speech translation tasks, that are dependent on effectively uncovering the underlying lexical content of the input signal and study the effect of downsampling on these tasks. Experimental results show that compressed LALMs can achieve performance closer to frame-level LALMs while reducing the input audio token count upto three times before the LLM backbone.
Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation
Tseng, Wei-Cheng, Zhou, Xuanru, Huo, Mingyue, Shao, Yiwen, Zhang, Hao, Yu, Dong
Audio-language pretraining holds promise for general-purpose audio understanding, yet remains underexplored compared to its vision counterpart. While vision-language models like CLIP serve as widely adopted foundations, existing audio-language models primarily excel at retrieval tasks with limited adoption as general-purpose encoders. We identify three key barriers: limited large-scale audio-text corpora, insufficient caption diversity, and lack of systematic exploration and evaluation. To this end, we introduce CaptionStew, a 10.7M caption dataset aggregating diverse open-source audio-text corpora across multiple domains and captioning styles. Using this resource, we conduct the first comprehensive evaluation comparing contrastive and captioning objectives for audio representation learning across speech, music, and environmental sound tasks. Our results demonstrate that audio-language pretraining yields competitive, transferable representations. Through systematic data-scaling experiments, we reveal complementary objective strengths: contrastive learning achieves superior data efficiency at smaller scales, while captioning demonstrates better scalability on language-involved audio understanding tasks. We also find that common supervised initialization practices provide diminishing returns at scale, challenging current approaches. These findings establish audio-language pretraining as a viable pathway toward general-purpose audio representations, guiding future research. To accelerate progress, we release data preparation recipes, training protocols, and pretrained models, paving the way toward universal audio understanding. Early advances relied on supervised learning, where models trained on labeled corpora were adapted to related downstream tasks or transferred across domains (Kong et al., 2020; Chen et al., 2022a; Snyder et al., 2018; Desplanques et al., 2020).
SPUR: A Plug-and-Play Framework for Integrating Spatial Audio Understanding and Reasoning into Large Audio-Language Models
Sakshi, S, Lokegaonkar, Vaibhavi, Zhang, Neil, Duraiswami, Ramani, Ghosh, Sreyan, Manocha, Dinesh, Lu, Lie
Spatial perception is central to auditory intelligence, enabling accurate understanding of real-world acoustic scenes and advancing human-level perception of the world around us. While recent large audio-language models (LALMs) show strong reasoning over complex audios, most operate on monaural inputs and lack the ability to capture spatial cues such as direction, elevation, and distance. We introduce SPUR, a lightweight, plug-in approach that equips LALMs with spatial perception through minimal architectural changes. SPUR consists of: (i) a First-Order Ambisonics (FOA) encoder that maps (W, X, Y, Z) channels to rotation-aware, listener-centric spatial features, integrated into target LALMs via a multimodal adapter; and (ii) SPUR-Set, a spatial QA dataset combining open-source FOA recordings with controlled simulations, emphasizing relative direction, elevation, distance, and overlap for supervised spatial reasoning. Fine-tuning our model on the SPUR-Set consistently improves spatial QA and multi-speaker attribution while preserving general audio understanding. SPUR provides a simple recipe that transforms monaural LALMs into spatially aware models. Extensive ablations validate the effectiveness of our approach.