Huang, Shao-Syuan
Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling
Huang, Shao-Syuan, Huang, Kuan-Po, Liu, Andy T., Lee, Hung-yi
Multilingual Automatic Speech Recognition (ASR) aims to recognize and transcribe speech from multiple languages within a single system. Whisper, one of the most advanced ASR models, excels in this domain by handling 99 languages effectively, leveraging a vast amount of data and incorporating language tags as prefixes to guide the recognition process. However, despite its success, Whisper struggles with unseen languages, those not included in its pre-training. Motivated by the observation that many languages share linguistic characteristics, we propose methods that exploit these relationships to enhance ASR performance on unseen languages. Specifically, we introduce a weighted sum method, which computes a weighted sum of the embeddings of language tags, using Whisper's predicted language probabilities. In addition, we develop a predictor-based approach that refines the weighted sum embedding to more closely approximate the true embedding for unseen languages. Experimental results demonstrate substantial improvements in ASR performance, both in zero-shot and fine-tuning settings. Our proposed methods outperform baseline approaches, providing an effective solution for addressing unseen languages in multilingual ASR.
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
Huang, Chien-yu, Chen, Wei-Chih, Yang, Shu-wen, Liu, Andy T., Li, Chen-An, Lin, Yu-Xiang, Tseng, Wei-Cheng, Diwan, Anuj, Shih, Yi-Jen, Shi, Jiatong, Chen, William, Chen, Xuanjun, Hsiao, Chi-Yuan, Peng, Puyuan, Wang, Shih-Heng, Kuan, Chun-Yi, Lu, Ke-Han, Chang, Kai-Wei, Yang, Chih-Kai, Ritter-Gutierrez, Fabian, Chuang, Ming To, Huang, Kuan-Po, Arora, Siddhant, Lin, You-Kuan, Yeo, Eunjung, Chang, Kalvin, Chien, Chung-Ming, Choi, Kwanghee, Hsieh, Cheng-Hsiu, Lin, Yi-Cheng, Yu, Chee-En, Chiu, I-Hsiang, Guimarรฃes, Heitor R., Han, Jionghao, Lin, Tzu-Quan, Lin, Tzu-Yuan, Chang, Homu, Chang, Ting-Wu, Chen, Chun Wei, Chen, Shou-Jen, Chen, Yu-Hua, Cheng, Hsi-Chun, Dhawan, Kunal, Fang, Jia-Lin, Fang, Shi-Xin, Chiang, Kuan-Yu Fang, Fu, Chi An, Hsiao, Hsien-Fu, Hsu, Ching Yu, Huang, Shao-Syuan, Wei, Lee Chen, Lin, Hsi-Che, Lin, Hsuan-Hao, Lin, Hsuan-Ting, Lin, Jian-Ren, Liu, Ting-Chun, Lu, Li-Chun, Pai, Tsung-Min, Pasad, Ankita, Kuan, Shih-Yun Shan, Shon, Suwon, Tang, Yuxun, Tsai, Yun-Shao, Wei, Jui-Chiang, Wei, Tzu-Chieh, Wu, Chengxi, Wu, Dien-Ruei, Yang, Chao-Han Huck, Yang, Chieh-Chi, Yip, Jia Qi, Yuan, Shao-Xiang, Noroozi, Vahid, Chen, Zhehuai, Wu, Haibin, Livescu, Karen, Harwath, David, Watanabe, Shinji, Lee, Hung-yi
Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results indicate that none of the models performed well universally. SALMONN-13B excelled in English ASR, while WavLLM demonstrated high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We will soon open-source all task data and the evaluation pipeline.
Mimic-IV-ICD: A new benchmark for eXtreme MultiLabel Classification
Nguyen, Thanh-Tung, Schlegel, Viktor, Kashyap, Abhinav, Winkler, Stefan, Huang, Shao-Syuan, Liu, Jie-Jyun, Lin, Chih-Jen
Clinical notes are assigned ICD codes - sets of codes for diagnoses and procedures. In the recent years, predictive machine learning models have been built for automatic ICD coding. However, there is a lack of widely accepted benchmarks for automated ICD coding models based on large-scale public EHR data. This paper proposes a public benchmark suite for ICD-10 coding using a large EHR dataset derived from MIMIC-IV, the most recent public EHR dataset. We implement and compare several popular methods for ICD coding prediction tasks to standardize data preprocessing and establish a comprehensive ICD coding benchmark dataset. This approach fosters reproducibility and model comparison, accelerating progress toward employing automated ICD coding in future studies. Furthermore, we create a new ICD-9 benchmark using MIMIC-IV data, providing more data points and a higher number of ICD codes than MIMIC-III. Our open-source code offers easy access to data processing steps, benchmark creation, and experiment replication for those with MIMIC-IV access, providing insights, guidance, and protocols to efficiently develop ICD coding models.