Not enough data to create a plot.
Try a different view from the menu above.
Chang, Xuankai
Towards Robust Speech Representation Learning for Thousands of Languages
Chen, William, Zhang, Wangyou, Peng, Yifan, Li, Xinjian, Tian, Jinchuan, Shi, Jiatong, Chang, Xuankai, Maiti, Soumi, Livescu, Karen, Watanabe, Shinji
Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or achieves comparable results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS sets a new SOTA on the ML-SUPERB benchmark: it outperforms MMS 1B and w2v-BERT 2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or pre-training data. Checkpoints, code, and data are found in https://www.wavlab.org/activities/2024/xeus/.
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets
Shi, Jiatong, Wang, Shih-Heng, Chen, William, Bartelds, Martijn, Kumar, Vanya Bannihatti, Tian, Jinchuan, Chang, Xuankai, Jurafsky, Dan, Livescu, Karen, Lee, Hung-yi, Watanabe, Shinji
ML-SUPERB evaluates self-supervised learning (SSL) models on the tasks of language identification and automatic speech recognition (ASR). This benchmark treats the models as feature extractors and uses a single shallow downstream model, which can be fine-tuned for a downstream task. However, real-world use cases may require different configurations. This paper presents ML-SUPERB~2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models across downstream models, fine-tuning setups, and efficient model adaptation approaches. We find performance improvements over the setup of ML-SUPERB. However, performance depends on the downstream model design. Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches to improve multilingual ASR performance.
A Large-Scale Evaluation of Speech Foundation Models
Yang, Shu-wen, Chang, Heng-Jui, Huang, Zili, Liu, Andy T., Lai, Cheng-I, Wu, Haibin, Shi, Jiatong, Chang, Xuankai, Tsai, Hsiang-Sheng, Huang, Wen-Chin, Feng, Tzu-hsun, Chi, Po-Han, Lin, Yist Y., Chuang, Yung-Sung, Huang, Tzu-Hsien, Tseng, Wei-Cheng, Lakhotia, Kushal, Li, Shang-Wen, Mohamed, Abdelrahman, Watanabe, Shinji, Lee, Hung-yi
The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work, we establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the paradigm for speech. We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads. Combining our results with community submissions, we verify that the foundation model paradigm is promising for speech, and our multi-tasking framework is simple yet effective, as the best-performing foundation model shows competitive generalizability across most SUPERB tasks. For reproducibility and extensibility, we have developed a long-term maintained platform that enables deterministic benchmarking, allows for result sharing via an online leaderboard, and promotes collaboration through a community-driven benchmark database to support new development cycles. Finally, we conduct a series of analyses to offer an in-depth understanding of SUPERB and speech foundation models, including information flows across tasks inside the models, the correctness of the weighted-sum benchmarking protocol and the statistical significance and robustness of the benchmark.
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages
Kim, Minsu, Jung, Jee-weon, Rha, Hyeongseop, Maiti, Soumi, Arora, Siddhant, Chang, Xuankai, Watanabe, Shinji, Ro, Yong Man
The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, where we interpret different modalities as different languages, and treat multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, which provide a unified interface across modalities and significantly decrease the computational cost. In the proposed TMT, a multi-modal encoder-decoder conducts the core translation, whereas modality-specific processing is conducted only within the tokenization and detokenization stages. We evaluate the proposed TMT on all six modality translation tasks. TMT outperforms single model counterparts consistently, demonstrating that unifying tasks is beneficial not only for practicality but also for performance.
OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
Peng, Yifan, Tian, Jinchuan, Chen, William, Arora, Siddhant, Yan, Brian, Sudo, Yui, Shakeel, Muhammad, Choi, Kwanghee, Shi, Jiatong, Chang, Xuankai, Jung, Jee-weon, Watanabe, Shinji
Recent studies have advocated for fully open foundation models to promote transparency and open science. As an initial step, the Open Whisper-style Speech Model (OWSM) reproduced OpenAI's Whisper using publicly available data and open-source toolkits. With the aim of reproducing Whisper, the previous OWSM v1 through v3 models were still based on Transformer, which might lead to inferior performance compared to other state-of-the-art speech encoders. In this work, we aim to improve the performance and efficiency of OWSM without extra training data. We present E-Branchformer based OWSM v3.1 models at two scales, i.e., 100M and 1B. The 1B model is the largest E-Branchformer based speech model that has been made publicly available. It outperforms the previous OWSM v3 in a vast majority of evaluation benchmarks, while demonstrating up to 25% faster inference speed. We publicly release the data preparation scripts, pre-trained models and training logs.
Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
Maiti, Soumi, Peng, Yifan, Choi, Shukjae, Jung, Jee-weon, Chang, Xuankai, Watanabe, Shinji
We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90. VoxtLM also improves speech generation and speech recognition performance over the single-task counterpart. Further, VoxtLM is trained with publicly available data and training recipes and model checkpoints are open-sourced to make fully reproducible work.
Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
Peng, Yifan, Tian, Jinchuan, Yan, Brian, Berrebbi, Dan, Chang, Xuankai, Li, Xinjian, Shi, Jiatong, Arora, Siddhant, Chen, William, Sharma, Roshan, Zhang, Wangyou, Sudo, Yui, Shakeel, Muhammad, Jung, Jee-weon, Maiti, Soumi, Watanabe, Shinji
Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessible, which makes it difficult for researchers to further improve its performance and address training-related issues such as efficiency, robustness, fairness, and bias. This work presents an Open Whisper-style Speech Model (OWSM), which reproduces Whisper-style training using an open-source toolkit and publicly available data. OWSM even supports more translation directions and can be more efficient to train. We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science.
Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond
Shi, Jiatong, Chen, William, Berrebbi, Dan, Wang, Hsiu-Hsuan, Huang, Wei-Ping, Hu, En-Pei, Chuang, Ho-Lam, Chang, Xuankai, Tang, Yuxun, Li, Shang-Wen, Mohamed, Abdelrahman, Lee, Hung-yi, Watanabe, Shinji
The benchmark primarily focuses on evaluating SSL models for automatic speech recognition (ASR) and language identification The 2023 Multilingual Speech Universal Performance Benchmark (LID). To cater to different use cases for SSL models, ML-SUPERB (ML-SUPERB) Challenge expands upon the acclaimed SUPERB includes two tracks with four different tasks: the monolingual framework, emphasizing self-supervised models in multilingual track (monolingual ASR) and the multilingual track (multilingual speech recognition and language identification. The challenge comprises ASR, LID, joint multilingual ASR/LID). Similar to SUPERB, MLa research track focused on applying ML-SUPERB to specific SUPERB utilizes frozen SSL models as feature extractors and multilingual subjects, a Challenge Track for model submissions, employs a lightweight downstream model that can be fine-tuned for and a New Language Track where language resource researchers different tracks to achieve high training efficiency. The released can contribute and evaluate their low-resource language data in the public benchmark of ML-SUPERB covers 143 languages, making it context of the latest progress in multilingual speech recognition.
HuBERTopic: Enhancing Semantic Representation of HuBERT through Self-supervision Utilizing Topic Model
Maekaku, Takashi, Shi, Jiatong, Chang, Xuankai, Fujita, Yuya, Watanabe, Shinji
Recently, the usefulness of self-supervised representation learning (SSRL) methods has been confirmed in various downstream tasks. Many of these models, as exemplified by HuBERT and WavLM, use pseudo-labels generated from spectral features or the model's own representation features. From previous studies, it is known that the pseudo-labels contain semantic information. However, the masked prediction task, the learning criterion of HuBERT, focuses on local contextual information and may not make effective use of global semantic information such as speaker, theme of speech, and so on. In this paper, we propose a new approach to enrich the semantic representation of HuBERT. We apply topic model to pseudo-labels to generate a topic label for each utterance. An auxiliary topic classification task is added to HuBERT by using topic labels as teachers. This allows additional global semantic information to be incorporated in an unsupervised manner. Experimental results demonstrate that our method achieves comparable or better performance than the baseline in most tasks, including automatic speech recognition and five out of the eight SUPERB tasks. Moreover, we find that topic labels include various information about utterance, such as gender, speaker, and its theme. This highlights the effectiveness of our approach in capturing multifaceted semantic nuances.
Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning
Chen, William, Shi, Jiatong, Yan, Brian, Berrebbi, Dan, Zhang, Wangyou, Peng, Yifan, Chang, Xuankai, Maiti, Soumi, Watanabe, Shinji
Multilingual self-supervised learning (SSL) has often lagged behind state-of-the-art (SOTA) methods due to the expenses and complexity required to handle many languages. This further harms the reproducibility of SSL, which is already limited to few research groups due to its resource usage. We show that more powerful techniques can actually lead to more efficient pre-training, opening SSL to more research groups. We propose WavLabLM, which extends WavLM's joint prediction and denoising to 40k hours of data across 136 languages. To build WavLabLM, we devise a novel multi-stage pre-training method, designed to address the language imbalance of multilingual data. WavLabLM achieves comparable performance to XLS-R on ML-SUPERB with less than 10% of the training data, making SSL realizable with academic compute. We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data, 4 GPUs, and limited trials. We open-source all code and models in ESPnet.