AITopics | salmonn

Collaborating Authors

salmonn

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

AHAMask: Reliable Task Specification for Large Audio Language Models without Instructions

Guo, Yiwei, Li, Bohan, Wang, Hankun, Li, Zhihan, Wang, Shuai, Chen, Xie, Yu, Kai

arXiv.org Artificial IntelligenceDec-2-2025

Although current large audio language models (LALMs) extend text large language models (LLMs) with generic acoustic understanding abilities, they usually suffer from prompt sensitivity, where different instructions of the same intention can yield drastically different outcomes. In this work, we propose AHAMask, where we simply mask some of the attention heads in the decoder-only LLM backbone of LALMs, to trigger specific acoustic task functionalities without instructions. These masks are efficiently obtained by training on an LALM, with the number of trainable parameters equal to the attention head count in its LLM backbone. We show by experiments that applying such selective attention head masks achieves comparable or even better performance than using instructions, either on single or composite tasks. Besides achieving reliable acoustic task specification for LALMs, this also reveals that LALMs exhibit certain "functional pathways" in their attention heads.

ahamask, large language model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2509.01787

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models

Ògúnrèmí, Tolúlopé, Manning, Christopher D., Jurafsky, Dan, Livescu, Karen

arXiv.org Artificial IntelligenceOct-20-2025

Abstract--Spoken language models (SLMs) that integrate speech with large language models (LMs) rely on modality adapters (MAs) to map the output of speech encoders to a representation that is understandable to the decoder LM. Y et we know very little about how these crucial MAs transform representations. Here we examine the MA output representation in three SLMs (SALMONN, Qwen2-Audio and Phi-4-Multimodal-Instruct). By finding the nearest decoder LM token to an MA representation, we uncover two strategies for MA representations. For models using a Whisper encoder, MAs appear to represent the meaning of the input using an English-based interlingua, allowing them to handle languages unseen in instruction tuning. For models that don't, like Phi-4-Multimodal-Instruct, MAs instead represent the phonetics of the input, but expressed with English words. We hypothesise that which arises depends on whether the speech encoder is trained only for speech recognition or also for translation.

artificial intelligence, natural language, text processing, (16 more...)

arXiv.org Artificial Intelligence

2510.02569

Country: North America (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.95)

Add feedback

Investigating Faithfulness in Large Audio Language Models

Jain, Lovenya, Mousavi, Pooneh, Ravanelli, Mirco, Subakan, Cem

arXiv.org Artificial IntelligenceOct-15-2025

Faithfulness measures whether chain-of-thought (CoT) representations accurately reflect a model's decision process and can be used as reliable explanations. Prior work has shown that CoTs from text-based LLMs are often unfaithful. This question has not been explored for large audio-language models (LALMs), where faithfulness is critical for safety-sensitive applications. Reasoning in LALMs is also more challenging, as models must first extract relevant clues from audio before reasoning over them. In this paper, we investigate the faithfulness of CoTs produced by several LALMs by applying targeted interventions, including paraphrasing, filler token injection, early answering, and introducing mistakes, on two challenging reasoning datasets: SAKURA and MMAR. After going through the aforementioned interventions across several datasets and tasks, our experiments suggest that, LALMs generally produce CoTs that appear to be faithful to their underlying decision processes.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.22363

Country: North America > Canada (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.55)

Add feedback

How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations

Lee, Hyunji, Liu, Danni, Sinhamahapatra, Supriti, Niehues, Jan

arXiv.org Artificial IntelligenceNov-26-2024

Multimodal foundation models aim to create a unified representation space that abstracts away from surface features like language syntax or modality differences. To investigate this, we study the internal representations of three recent models, analyzing the model activations from semantically equivalent sentences across languages in the text and speech modalities. Our findings reveal that: 1) Cross-modal representations converge over model layers, except in the initial layers specialized at text and speech processing. 2) Length adaptation is crucial for reducing the cross-modal gap between text and speech, although current approaches' effectiveness is primarily limited to high-resource languages. 3) Speech exhibits larger cross-lingual differences than text. 4) For models not explicitly trained for modality-agnostic representations, the modality gap is more prominent than the language gap.

computational linguistic, representation, similarity, (15 more...)

arXiv.org Artificial Intelligence

2411.17666

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(13 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
(2 more...)

Add feedback

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

Wang, Siyin, Yu, Wenyi, Yang, Yudong, Tang, Changli, Li, Yixuan, Zhuang, Jimin, Chen, Xianzhao, Tian, Xiaohai, Zhang, Jun, Sun, Guangzhi, Lu, Lu, Zhang, Chao

arXiv.org Artificial IntelligenceSep-25-2024

Speech quality assessment typically requires evaluating audio from multiple aspects, such as mean opinion score (MOS) and speaker similarity (SIM) etc., which can be challenging to cover using one small model designed for a single task. In this paper, we propose leveraging recently introduced auditory large language models (LLMs) for automatic speech quality assessment. By employing task-specific prompts, auditory LLMs are finetuned to predict MOS, SIM and A/B testing results, which are commonly used for evaluating text-to-speech systems. Additionally, the finetuned auditory LLM is able to generate natural language descriptions assessing aspects like noisiness, distortion, discontinuity, and overall quality, providing more interpretable outputs. Extensive experiments have been performed on the NISQA, BVCC, SOMOS and VoxSim speech quality datasets, using open-source auditory LLMs such as SALMONN, Qwen-Audio, and Qwen2-Audio. For the natural language descriptions task, a commercial model Google Gemini 1.5 Pro is also evaluated. The results demonstrate that auditory LLMs achieve competitive performance compared to state-of-the-art task-specific small models in predicting MOS and SIM, while also delivering promising results in A/B testing and natural language descriptions. Our data processing scripts and finetuned model checkpoints will be released upon acceptance.

auditory llm, dataset, salmonn, (11 more...)

arXiv.org Artificial Intelligence

2409.16644

Country:

Asia > South Korea > Seoul > Seoul (0.05)
North America > United States > Hawaii > Honolulu County > Honolulu (0.05)
Asia > South Korea > Incheon > Incheon (0.05)
(11 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Tang, Changli, Yu, Wenyi, Sun, Guangzhi, Chen, Xianzhao, Tan, Tian, Li, Wei, Lu, Lu, Ma, Zejun, Zhang, Chao

arXiv.org Artificial IntelligenceOct-20-2023

Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning \textit{etc.} SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken-query-based question answering, audio-based storytelling, and speech audio co-reasoning \textit{etc}. The presence of the cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities of SALMONN. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. An interactive demo of SALMONN is available at \texttt{\url{https://github.com/bytedance/SALMONN}}, and the training code and model checkpoints will be released upon acceptance.

instruction, salmonn, speech recognition, (14 more...)

arXiv.org Artificial Intelligence

2310.13289

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(9 more...)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (0.67)
Media > Music (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback