Chen, Xianzhao
SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation
Yu, Wenyi, Wang, Siyin, Yang, Xiaoyu, Chen, Xianzhao, Tian, Xiaohai, Zhang, Jun, Sun, Guangzhi, Lu, Lu, Wang, Yuxuan, Zhang, Chao
Full-duplex multimodal large language models (LLMs) provide a unified framework for addressing diverse speech understanding and generation tasks, enabling more natural and seamless human-machine conversations. Unlike traditional modularised conversational AI systems, which separate speech recognition, understanding, and text-to-speech generation into distinct components, multimodal LLMs operate as single end-to-end models. This streamlined design eliminates error propagation across components and fully leverages the rich non-verbal information embedded in input speech signals. We introduce SALMONN-omni, a codec-free, full-duplex speech understanding and generation model capable of simultaneously listening to its own generated speech and background sounds while speaking. To support this capability, we propose a novel duplex spoken dialogue framework incorporating a ``thinking'' mechanism that facilitates asynchronous text and speech generation relying on embeddings instead of codecs (quantized speech and audio tokens). Experimental results demonstrate SALMONN-omni's versatility across a broad range of streaming speech tasks, including speech recognition, speech enhancement, and spoken question answering. Additionally, SALMONN-omni excels at managing turn-taking, barge-in, and echo cancellation scenarios, establishing its potential as a robust prototype for full-duplex conversational AI systems. To the best of our knowledge, SALMONN-omni is the first codec-free model of its kind. A full technical report along with model checkpoints will be released soon.
Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation
Wang, Siyin, Yu, Wenyi, Yang, Yudong, Tang, Changli, Li, Yixuan, Zhuang, Jimin, Chen, Xianzhao, Tian, Xiaohai, Zhang, Jun, Sun, Guangzhi, Lu, Lu, Zhang, Chao
Speech quality assessment typically requires evaluating audio from multiple aspects, such as mean opinion score (MOS) and speaker similarity (SIM) etc., which can be challenging to cover using one small model designed for a single task. In this paper, we propose leveraging recently introduced auditory large language models (LLMs) for automatic speech quality assessment. By employing task-specific prompts, auditory LLMs are finetuned to predict MOS, SIM and A/B testing results, which are commonly used for evaluating text-to-speech systems. Additionally, the finetuned auditory LLM is able to generate natural language descriptions assessing aspects like noisiness, distortion, discontinuity, and overall quality, providing more interpretable outputs. Extensive experiments have been performed on the NISQA, BVCC, SOMOS and VoxSim speech quality datasets, using open-source auditory LLMs such as SALMONN, Qwen-Audio, and Qwen2-Audio. For the natural language descriptions task, a commercial model Google Gemini 1.5 Pro is also evaluated. The results demonstrate that auditory LLMs achieve competitive performance compared to state-of-the-art task-specific small models in predicting MOS and SIM, while also delivering promising results in A/B testing and natural language descriptions. Our data processing scripts and finetuned model checkpoints will be released upon acceptance.
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Tang, Changli, Yu, Wenyi, Sun, Guangzhi, Chen, Xianzhao, Tan, Tian, Li, Wei, Lu, Lu, Ma, Zejun, Zhang, Chao
Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning \textit{etc.} SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken-query-based question answering, audio-based storytelling, and speech audio co-reasoning \textit{etc}. The presence of the cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities of SALMONN. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. An interactive demo of SALMONN is available at \texttt{\url{https://github.com/bytedance/SALMONN}}, and the training code and model checkpoints will be released upon acceptance.
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models
Sun, Guangzhi, Yu, Wenyi, Tang, Changli, Chen, Xianzhao, Tan, Tian, Li, Wei, Lu, Lu, Ma, Zejun, Zhang, Chao
Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs. To this end, a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal LLMs is proposed in this paper, which extends a text-based LLM to simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level. To fuse the audio and visual feature streams into joint representations and to align the joint space with the LLM input embedding space, we propose a causal Q-Former structure with a causal attention module to enhance the capture of causal relations of the audio-visual frames across time. An audio-visual evaluation benchmark (AVEB) is also proposed which comprises six representative single-modal tasks with five cross-modal tasks reflecting audio-visual coreasoning abilities. While achieving competitive single-modal performance on audio, speech and image tasks in AVEB, FAVOR achieved over 20% accuracy improvements on the video question-answering task when fine-grained information or temporal causal reasoning is required. FAVOR, in addition, demonstrated remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other multimodal LLMs. Text-based large language models (LLM) (Brown et al., 2020; Touvron et al., 2023; Chiang et al., 2023; Anil et al., 2023; Du et al., 2022) have demonstrated remarkable performance in various natural language processing tasks, especially achieving human-level capabilities in reasoning and comprehension (OpenAI, 2023). Meanwhile, instruction fine-tuning (Chung et al., 2022; Ouyang et al., 2022; Peng et al., 2023), where data is organised as pairs of user instruction (or prompt) and reference response, has emerged as a training paradigm that enables LLMs to perform various tasks by following open-ended natural language instructions from non-expert users. Recently, there has been a burgeoning research interest in equipping LLMs with visual and auditory perception abilities. These investigations often employ a trained modality alignment module that aligns the representation space of the input modality with the text one. Subsequently, work has started looking at incorporating multiple simultaneous input modalities (Su et al., 2023; Zhang et al., 2023b; Lyu et al., 2023; Zhao et al., 2023; Chen et al., 2023a). Despite the sequential nature of video and audio inputs, most aforementioned work treated video as a sampled subset of individual images and audio as a fixed-length spectrogram.
Connecting Speech Encoder and Large Language Model for ASR
Yu, Wenyi, Tang, Changli, Sun, Guangzhi, Chen, Xianzhao, Tan, Tian, Li, Wei, Lu, Lu, Ma, Zejun, Zhang, Chao
The impressive capability and versatility of large language models (LLMs) have aroused increasing attention in automatic speech recognition (ASR), with several pioneering studies attempting to build integrated ASR models by connecting a speech encoder with an LLM. This paper presents a comparative study of three commonly used structures as connectors, including fully connected layers, multi-head cross-attention, and Q-Former. Speech encoders from the Whisper model series as well as LLMs from the Vicuna model series with different model sizes were studied. Experiments were performed on the commonly used LibriSpeech, Common Voice, and GigaSpeech datasets, where the LLMs with Q-Formers demonstrated consistent and considerable word error rate (WER) reductions over LLMs with other connector structures. Q-Former-based LLMs can generalise well to out-of-domain datasets, where 12% relative WER reductions over the Whisper baseline ASR model were achieved on the Eval2000 test set without using any in-domain training data from Switchboard. Moreover, a novel segment-level Q-Former is proposed to enable LLMs to recognise speech segments with a duration exceeding the limitation of the encoders, which results in 17% relative WER reductions over other connector structures on 90-second-long speech data.
Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition
Chen, Xianzhao, Lin, Yist Y., Wang, Kang, He, Yi, Ma, Zejun
In E2E systems, word timings can be estimated by the forced alignment results of character-level CTC models, where End-to-end (E2E) systems have shown comparable performance the CTC peak of the first character indicate the word start time to hybrid systems for automatic speech recognition and the CTC peak of the last character indicate the word end (ASR). Word timings, as a by-product of ASR, are essential time [9]. The CTC model cannot estimate word timings well in many applications, especially for subtitling and computeraided when the duration of the modeling unit is relatively long, e.g., pronunciation training. In this paper, we improve the Chinese characters. Because the blank probability of CTC frame-level classifier for word timings in E2E system by introducing model is dominant in almost all frames, and the non-blank probability label priors in connectionist temporal classification is only relatively high in few frames. This is called the (CTC) loss, which is adopted from prior works, and combining peaky behavior [10]. CTC-based alignments for word timings low-level Mel-scale filter banks with high-level ASR encoder can be improved by alleviating the peaky behavior [11, 12], output as input feature. On the internal Chinese corpus, but these methods have complicated regularization terms which the proposed method achieves 95.68%/94.18%