Chien, Chung-Ming
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
Huang, Chien-yu, Chen, Wei-Chih, Yang, Shu-wen, Liu, Andy T., Li, Chen-An, Lin, Yu-Xiang, Tseng, Wei-Cheng, Diwan, Anuj, Shih, Yi-Jen, Shi, Jiatong, Chen, William, Chen, Xuanjun, Hsiao, Chi-Yuan, Peng, Puyuan, Wang, Shih-Heng, Kuan, Chun-Yi, Lu, Ke-Han, Chang, Kai-Wei, Yang, Chih-Kai, Ritter-Gutierrez, Fabian, Chuang, Ming To, Huang, Kuan-Po, Arora, Siddhant, Lin, You-Kuan, Yeo, Eunjung, Chang, Kalvin, Chien, Chung-Ming, Choi, Kwanghee, Hsieh, Cheng-Hsiu, Lin, Yi-Cheng, Yu, Chee-En, Chiu, I-Hsiang, Guimarães, Heitor R., Han, Jionghao, Lin, Tzu-Quan, Lin, Tzu-Yuan, Chang, Homu, Chang, Ting-Wu, Chen, Chun Wei, Chen, Shou-Jen, Chen, Yu-Hua, Cheng, Hsi-Chun, Dhawan, Kunal, Fang, Jia-Lin, Fang, Shi-Xin, Chiang, Kuan-Yu Fang, Fu, Chi An, Hsiao, Hsien-Fu, Hsu, Ching Yu, Huang, Shao-Syuan, Wei, Lee Chen, Lin, Hsi-Che, Lin, Hsuan-Hao, Lin, Hsuan-Ting, Lin, Jian-Ren, Liu, Ting-Chun, Lu, Li-Chun, Pai, Tsung-Min, Pasad, Ankita, Kuan, Shih-Yun Shan, Shon, Suwon, Tang, Yuxun, Tsai, Yun-Shao, Wei, Jui-Chiang, Wei, Tzu-Chieh, Wu, Chengxi, Wu, Dien-Ruei, Yang, Chao-Han Huck, Yang, Chieh-Chi, Yip, Jia Qi, Yuan, Shao-Xiang, Noroozi, Vahid, Chen, Zhehuai, Wu, Haibin, Livescu, Karen, Harwath, David, Watanabe, Shinji, Lee, Hung-yi
Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results indicate that none of the models performed well universally. SALMONN-13B excelled in English ASR, while WavLLM demonstrated high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We will soon open-source all task data and the evaluation pipeline.
On the Evaluation of Speech Foundation Models for Spoken Language Understanding
Arora, Siddhant, Pasad, Ankita, Chien, Chung-Ming, Han, Jionghao, Sharma, Roshan, Jung, Jee-weon, Dhamyal, Hira, Chen, William, Shon, Suwon, Lee, Hung-yi, Livescu, Karen, Watanabe, Shinji
The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks. However, the community still lacks a fine-grained understanding of the comparative utility of different SFMs. Inspired by this, we ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs? To answer this, we perform an extensive evaluation of multiple supervised and self-supervised SFMs using several evaluation protocols: (i) frozen SFMs with a lightweight prediction head, (ii) frozen SFMs with a complex prediction head, and (iii) fine-tuned SFMs with a lightweight prediction head. Although the supervised SFMs are pre-trained on much more speech recognition data (with labels), they do not always outperform self-supervised SFMs; the latter tend to perform at least as well as, and sometimes better than, supervised SFMs, especially on the sequence generation tasks in SLUE. While there is no universally optimal way of incorporating SFMs, the complex prediction head gives the best performance for most tasks, although it increases the inference time. We also introduce an open-source toolkit and performance leaderboard, SLUE-PERB, for these tasks and modeling strategies.
Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning
Chien, Chung-Ming, Tjandra, Andros, Vyas, Apoorv, Le, Matt, Shi, Bowen, Hsu, Wei-Ning
In this work, we propose Voicebox Adapter, Our contributions are as follows: (1) we propose Voicebox a novel approach that integrates fine-grained conditions into a Adapter, which augments Voicebox, a pre-trained speech pre-trained Voicebox speech generation model using a crossattention generation model, with fine-grained controllability; (2) we explore module. To ensure a smooth integration of newly different efficient fine-tuning methods to bridge the gap added modules with pre-trained ones, we explore various efficient between pre-trained parameters and new fine-grained conditioning fine-tuning approaches. Our experiment shows that the modules; (3) we show that Voicebox Adapter can generalize LoRA with bias-tuning configuration yields the best performance, across various fine-grained conditions, attaining performance enhancing controllability without compromising speech comparable to that achieved by fine-tuning the entire model quality. Across three fine-grained conditional generation tasks, with significantly fewer fine-tuned parameters; (4) we conduct we demonstrate the effectiveness and resource efficiency of experiments using varying amounts of fine-tuning data and different Voicebox Adapter. Follow-up experiments further highlight the hidden dimension sizes, analyzing the performance of robustness of Voicebox Adapter across diverse data setups.
AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement
Chou, Ju-Chieh, Chien, Chung-Ming, Livescu, Karen
Speech enhancement systems are typically trained using pairs of clean and noisy speech. In audio-visual speech enhancement (AVSE), there is not as much ground-truth clean data available; most audio-visual datasets are collected in real-world environments with background noise and reverberation, hampering the development of AVSE. In this work, we introduce AV2Wav, a resynthesis-based audio-visual speech enhancement approach that can generate clean speech despite the challenges of real-world training data. We obtain a subset of nearly clean speech from an audio-visual corpus using a neural quality estimator, and then train a diffusion model on this subset to generate waveforms conditioned on continuous speech representations from AV-HuBERT with noise-robust training. We use continuous rather than discrete representations to retain prosody and speaker information. With this vocoding task alone, the model can perform speech enhancement better than a masking-based baseline. We further fine-tune the diffusion model on clean/noisy utterance pairs to improve the performance. Our approach outperforms a masking-based baseline in terms of both automatic metrics and a human listening test and is close in quality to the target speech in the listening test. Audio samples can be found at https://home.ttic.edu/~jcchou/demo/avse/avse_demo.html.
Toward Joint Language Modeling for Speech Units and Text
Chou, Ju-Chieh, Chien, Chung-Ming, Hsu, Wei-Ning, Livescu, Karen, Babu, Arun, Conneau, Alexis, Baevski, Alexei, Auli, Michael
Speech and text are two major forms of human language. The research community has been focusing on mapping speech to text or vice versa for many years. However, in the field of language modeling, very little effort has been made to model them jointly. In light of this, we explore joint language modeling for speech units and text. Specifically, we compare different speech tokenizers to transform continuous speech signals into discrete units and use different methods to construct mixed speech-text data. We introduce automatic metrics to evaluate how well the joint LM mixes speech and text. We also fine-tune the LM on downstream spoken language understanding (SLU) tasks with different modalities (speech or text) and test its performance to assess the model's learning of shared representations. Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks and shows zero-shot cross-modal transferability.
Few-Shot Spoken Language Understanding via Joint Speech-Text Models
Chien, Chung-Ming, Zhang, Mingjiamei, Chou, Ju-Chieh, Livescu, Karen
Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations by encoding speech and text in a shared space. In this paper, we leverage such shared representations to address the persistent challenge of limited data availability in spoken language understanding tasks. By employing a pre-trained speech-text model, we find that models fine-tuned on text can be effectively transferred to speech testing data. With as little as 1 hour of labeled speech data, our proposed approach achieves comparable performance on spoken language understanding tasks (specifically, sentiment analysis and named entity recognition) when compared to previous methods using speech-only pre-trained models fine-tuned on 10 times more data. Beyond the proof-of-concept study, we also analyze the latent representations. We find that the bottom layers of speech-text models are largely task-agnostic and align speech and text representations into a shared space, while the top layers are more task-specific.
What do self-supervised speech models know about words?
Pasad, Ankita, Chien, Chung-Ming, Settle, Shane, Livescu, Karen
Many self-supervised speech models (S3Ms) have been introduced over the last few years, producing performance and data efficiency improvements for a variety of speech tasks. Evidence is emerging that different S3Ms encode linguistic information in different layers, and also that some S3Ms appear to learn phone-like sub-word units. However, the extent to which these models capture larger linguistic units, such as words, and where word-related information is encoded, remains unclear. In this study, we conduct several analyses of word segment representations extracted from different layers of three S3Ms: wav2vec2, HuBERT, and WavLM. We employ canonical correlation analysis (CCA), a lightweight analysis tool, to measure the similarity between these representations and word-level linguistic properties. We find that the maximal word-level linguistic content tends to be found in intermediate model layers, while some lower-level information like pronunciation is also retained in higher layers of HuBERT and WavLM. Syntactic and semantic word attributes have similar layer-wise behavior. We also find that, for all of the models tested, word identity information is concentrated near the center of each word segment. We then test the layer-wise performance of the same models, when used directly with no additional learned parameters, on several tasks: acoustic word discrimination, word segmentation, and semantic sentence similarity. We find similar layer-wise trends in performance, and furthermore, find that when using the best-performing layer of HuBERT or WavLM, it is possible to achieve performance on word segmentation and sentence similarity that rivals more complex existing approaches.