AITopics | audio-language model

Collaborating Authors

audio-language model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Segmentwise Pruning in Audio-Language Models

Gibier, Marcel, Duroselle, Raphaël, Serrano, Pierre, Boeffard, Olivier, Bonastre, Jean-François

arXiv.org Artificial IntelligenceNov-19-2025

Recent audio-language models have shown impressive performance across a wide range of audio tasks and are increasingly capable of handling long audio inputs. However, the computing costs in these models heavily depend on sequence length, which can become very large given the nature of audio data. In the vision-language domain, token pruning methods have proven effective in reducing token counts while preserving strong performance on standard benchmarks. In this work, we investigate the relevance and effectiveness of such token selection strategies in the context of audio-language models. We also improve them by proposing a lightweight strategy that takes the time dimension into account. While retaining only a quarter of the initial tokens, our approach results in a relative maximum decrease of 2% in CIDEr on Clotho v2 and a relative maximum decrease of 4% in accuracy on MMAU.

artificial intelligence, natural language, representation, (13 more...)

arXiv.org Artificial Intelligence

2511.14293

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

SPUR: A Plug-and-Play Framework for Integrating Spatial Audio Understanding and Reasoning into Large Audio-Language Models

Sakshi, S, Lokegaonkar, Vaibhavi, Zhang, Neil, Duraiswami, Ramani, Ghosh, Sreyan, Manocha, Dinesh, Lu, Lie

arXiv.org Artificial IntelligenceNov-17-2025

Spatial perception is central to auditory intelligence, enabling accurate understanding of real-world acoustic scenes and advancing human-level perception of the world around us. While recent large audio-language models (LALMs) show strong reasoning over complex audios, most operate on monaural inputs and lack the ability to capture spatial cues such as direction, elevation, and distance. We introduce SPUR, a lightweight, plug-in approach that equips LALMs with spatial perception through minimal architectural changes. SPUR consists of: (i) a First-Order Ambisonics (FOA) encoder that maps (W, X, Y, Z) channels to rotation-aware, listener-centric spatial features, integrated into target LALMs via a multimodal adapter; and (ii) SPUR-Set, a spatial QA dataset combining open-source FOA recordings with controlled simulations, emphasizing relative direction, elevation, distance, and overlap for supervised spatial reasoning. Fine-tuning our model on the SPUR-Set consistently improves spatial QA and multi-speaker attribution while preserving general audio understanding. SPUR provides a simple recipe that transforms monaural LALMs into spatially aware models. Extensive ablations validate the effectiveness of our approach.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.06606

Country:

Europe (0.68)
North America > United States (0.46)
Asia > Japan (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)

Add feedback

Can large audio language models understand child stuttering speech? speech summarization, and source separation

Okocha, Chibuzor, Bakri, Maya, Grant, Christan

arXiv.org Artificial IntelligenceOct-27-2025

Child speech differs from adult speech in acoustics, prosody, and language development, and disfluencies (repetitions, prolongations, blocks) further challenge Automatic Speech Recognition (ASR) and downstream Natural Language Processing (NLP). Recent large audio-language models (LALMs) demonstrate strong cross-modal audio understanding; however, their behavior in disfluent child speech remains underexplored. We evaluate several state-of-the-art LALMs in two settings: an interview (mixed speakers) and a reading task (single child). The tasks are (i) single-channel source separation to isolate the child and (ii) child-only summarization that preserves clinically relevant disfluencies and avoids adult-speech leakage. Evaluation combines Large Language Model (LLM) as a judge, human expert ratings, and BERTScore (F1), and we report agreement between models and between models and humans to assess reliability. Our findings delineate the conditions under which LALMs produce faithful child-only summaries from mixed audio and where they fail, offering practical guidance for clinical and educational deployments. We provide prompts and evaluation scripts to support replication.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.2085

Country: North America > United States (0.46)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.47)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Adaptive vector steering: A training-free, layer-wise intervention for hallucination mitigation in large audio and multimodal models

Lin, Tsung-En, Lee, Kuan-Yi, Lee, Hung-Yi

arXiv.org Artificial IntelligenceOct-16-2025

Large Audio-Language Models and Multi-Modal Large Language Models have demonstrated strong capabilities in tasks such as Audio Question Answering (AQA), Audio Captioning, and Automatic Speech Recognition (ASR). However, there is growing evidence that these models can hallucinate about the content of the audio. To address this issue, we probe the models' internal states and propose Adaptive Vector Steering (AVS), a method that better grounds generation in audio content. We also identify a strong correlation between output correctness and internal representations. Experiments show consistent performance gains across two models and two benchmarks. On the Audio Hallucination QA dataset, our method boosts the F1-score of Gemma from 0.550 to 0.619 and Qwen from 0.626 to 0.632. Furthermore, our method increases the accuracy of Qwen on MMAU from 0.548 to 0.592, marking an 8% relative increase. To the best of our knowledge, this is the first work to apply vector steering to mitigate hallucination in audio.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.12851

Country: Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (0.95)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.54)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.50)

Add feedback

Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning

Lee, Kuan-Yi, Lin, Tsung-En, Lee, Hung-Yi

arXiv.org Artificial IntelligenceOct-14-2025

Recent advancements in large multimodal models (LMMs) have shown strong capabilities in audio understanding. However, most systems rely solely on end-to-end reasoning, limiting interpretability and accuracy for tasks that require structured knowledge or specialized signal analysis. In this work, we present Audio-Maestro -- a tool-augmented audio reasoning framework that enables audio-language models to autonomously call external tools and integrate their timestamped outputs into the reasoning process. This design allows the model to analyze, transform, and interpret audio signals through specialized tools rather than relying solely on end-to-end inference. Experiments show that Audio-Maestro consistently improves general audio reasoning performance: Gemini-2.5-flash's average accuracy on MMAU-Test rises from 67.4% to 72.1%, DeSTA-2.5 from 58.3% to 62.8%, and GPT-4o from 60.8% to 63.9%. To our knowledge, Audio-Maestro is the first framework to integrate structured tool output into the large audio language model reasoning process.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.11454

Country: Asia > Taiwan (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.91)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

Hearing the Order: Investigating Selection Bias in Large Audio-Language Models

Lin, Yu-Xiang, Li, Chen-An, Wei, Sheng-Lun, Chen, Po-Chun, Chen, Hsin-Hsi, Lee, Hung-yi

arXiv.org Artificial IntelligenceOct-2-2025

Large audio-language models (LALMs) are often used in tasks that involve reasoning over ordered options. An open question is whether their predictions are influenced by the order of answer choices, which would indicate a form of selection bias and undermine their reliability. In this paper, we identify and analyze this problem in LALMs. We demonstrate that no model is immune to this bias through extensive experiments on six LALMs across three widely used benchmarks and their spoken counterparts. Shuffling the order of answer options can cause performance fluctuations of up to 24% and even change model rankings, raising concerns about the reliability of current evaluation practices. We also study permutation-based strategies and show that they can mitigate bias in most cases. Our work represents the first systematic investigation of this issue in LALMs, and we hope it raises awareness and motivates further research in this direction.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.00628

Genre: Research Report > New Finding (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.76)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

Li, Chen-An, Lin, Tzu-Han, Lee, Hung-yi

arXiv.org Artificial IntelligenceOct-2-2025

Large audio-language models (LALMs) unify speech and text processing, but their robustness in noisy real-world settings remains underexplored. We investigate how irrelevant audio, such as silence, synthetic noise, and environmental sounds, affects text reasoning tasks where audio is unnecessary. Across three text-based benchmarks, we find that even non-informative audio reduces accuracy and increases prediction volatility; the severity of interference scales with longer durations, higher amplitudes, and elevated decoding temperatures. Silence, often assumed neutral, destabilizes outputs as strongly as synthetic noise. While larger models show greater resilience, vulnerabilities persist across all evaluated systems. We further test mitigation strategies and find that prompting shows limited effectiveness, whereas self-consistency improves stability at the cost of increased computation. Our results reveal cross-modal interference as a key robustness challenge and highlight the need for efficient fusion strategies that preserve reasoning performance in the presence of irrelevant inputs.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.00626

Country: Asia > Taiwan (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.54)

Add feedback

Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition

Shi, Jiacheng, Du, Hongfei, Hong, Y. Alicia, Gao, Ye

arXiv.org Artificial IntelligenceOct-1-2025

Large audio-language models (LALMs) exhibit strong zero-shot performance across speech tasks but struggle with speech emotion recognition (SER) due to weak paralinguistic modeling and limited cross-modal reasoning. We propose Compositional Chain-of-Thought Prompting for Emotion Reasoning (CCoT-Emo), a framework that introduces structured Emotion Graphs (EGs) to guide LALMs in emotion inference without fine-tuning. Each EG encodes seven acoustic features (e.g., pitch, speech rate, jitter, shimmer), textual sentiment, keywords, and cross-modal associations. Embedded into prompts, EGs provide interpretable and compositional representations that enhance LALM reasoning. Experiments across SER benchmarks show that CCoT-Emo outperforms prior SOTA and improves accuracy over zero-shot baselines.

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2509.25458

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics

Lin, Yi-Cheng, Chen, Yu-Hua, Dong, Jia-Kai, Huang, Yueh-Hsuan, Chen, Szu-Chi, Chen, Yu-Chen, Chen, Chih-Yao, Lin, Yu-Jung, Chen, Yu-Ling, Chen, Zih-Yu, Tsai, I-Ning, Wang, Hsiu-Hsuan, Chung, Ho-Lam, Lu, Ke-Han, Lee, Hung-yi

arXiv.org Artificial IntelligenceOct-1-2025

Large audio-language models are advancing rapidly, yet most evaluations emphasize speech or globally sourced sounds, overlooking culturally distinctive cues. This gap raises a critical question: can current models generalize to localized, non-semantic audio that communities instantly recognize but outsiders do not? To address this, we present TAU (Taiwan Audio Understanding), a benchmark of everyday Taiwanese "soundmarks." TAU is built through a pipeline combining curated sources, human editing, and LLM-assisted question generation, producing 702 clips and 1,794 multiple-choice items that cannot be solved by transcripts alone. Experiments show that state-of-the-art LALMs, including Gemini 2.5 and Qwen2-Audio, perform far below local humans. TAU demonstrates the need for localized benchmarks to reveal cultural blind spots, guide more equitable multimodal evaluation, and ensure models serve communities beyond the global mainstream.

benchmark, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2509.26329

Country: Asia > Taiwan (0.26)

Genre: Research Report (0.82)

Industry: Education (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.38)

Add feedback

Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models

Xiong, Zhen, Cai, Yujun, Li, Zhecheng, Yuan, Junsong, Wang, Yiwei

arXiv.org Artificial IntelligenceSep-29-2025

Recent Large Audio-Language Models (LALMs) have shown strong performance on various audio understanding tasks such as speech translation and Audio Q\&A. However, they exhibit significant limitations on challenging audio reasoning tasks in complex acoustic scenarios. These situations would greatly benefit from the use of acoustic tools like noise suppression, source separation, and precise temporal alignment, but current LALMs lack access to such tools. To address this limitation, we introduce Thinking-with-Sound (TwS), a framework that equips LALMs with Audio CoT by combining linguistic reasoning with on-the-fly audio-domain analysis. Unlike existing approaches that treat audio as static input, TwS enables models to actively think with audio signals, performing numerical analysis and digital manipulation through multimodal reasoning. To evaluate this approach, we construct MELD-Hard1k, a new robustness benchmark created by introducing various acoustic perturbations. Experiments reveal that state-of-the-art LALMs suffer dramatic performance degradation on MELD-Hard1k, with accuracy dropping by more than $50\%$ compared to clean audio. TwS achieves substantial improvements in robustness, demonstrating both effectiveness and scalability: small models gain $24.73\%$ absolute accuracy, with improvements scaling consistently up to $36.61\%$ for larger models. Our findings demonstrate that Audio CoT can significantly enhance robustness without retraining, opening new directions for developing more robust audio understanding systems.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2509.21749

Country: North America > United States > California (0.46)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)

Add feedback