Not enough data to create a plot.
Try a different view from the menu above.
Peng, Yifan
MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation
Peng, Yifan, Kulikov, Ilia, Yang, Yilin, Popuri, Sravya, Lu, Hui, Wang, Changhan, Gong, Hongyu
There have been emerging research interest and advances in speech-to-speech translation (S2ST), translating utterances from one language to another. This work proposes Multitask Speech Language Model (MSLM), which is a decoder-only speech language model trained in a multitask setting. Without reliance on text training data, our model is able to support multilingual S2ST with speaker style preserved.
A survey of recent methods for addressing AI fairness and bias in biomedicine
Yang, Yifan, Lin, Mingquan, Zhao, Han, Peng, Yifan, Huang, Furong, Lu, Zhiyong
Artificial intelligence (AI) systems have the potential to revolutionize clinical practices, including improving diagnostic accuracy and surgical decision-making, while also reducing costs and manpower. However, it is important to recognize that these systems may perpetuate social inequities or demonstrate biases, such as those based on race or gender. Such biases can occur before, during, or after the development of AI models, making it critical to understand and address potential biases to enable the accurate and reliable application of AI models in clinical settings. To mitigate bias concerns during model development, we surveyed recent publications on different debiasing methods in the fields of biomedical natural language processing (NLP) or computer vision (CV). Then we discussed the methods that have been applied in the biomedical domain to address bias. We performed our literature search on PubMed, ACM digital library, and IEEE Xplore of relevant articles published between January 2018 and December 2023 using multiple combinations of keywords. We then filtered the result of 10,041 articles automatically with loose constraints, and manually inspected the abstracts of the remaining 890 articles to identify the 55 articles included in this review. Additional articles in the references are also included in this review. We discuss each method and compare its strengths and weaknesses. Finally, we review other potential methods from the general domain that could be applied to biomedicine to address bias and improve fairness.The bias of AIs in biomedicine can originate from multiple sources. Existing debiasing methods that focus on algorithms can be categorized into distributional or algorithmic.
SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition
Wu, Yihan, Maiti, Soumi, Peng, Yifan, Zhang, Wangyou, Li, Chenda, Wang, Yuyue, Wang, Xihua, Watanabe, Shinji, Song, Ruihua
Recent advancements in language models have significantly enhanced performance in multiple speech-related tasks. Existing speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model. However, this design omits the intrinsic connections between different speech tasks, which can potentially boost the performance of each task. In this work, we propose a novel decoder-only speech language model, SpeechComposer, that can unify common speech tasks by composing a fixed set of prompt tokens. Built upon four primary tasks -- speech synthesis, speech recognition, speech language modeling, and text language modeling -- SpeechComposer can easily extend to more speech tasks via compositions of well-designed prompt tokens, like voice conversion and speech enhancement. The unification of prompt tokens also makes it possible for knowledge sharing among different speech tasks in a more structured manner. Experimental results demonstrate that our proposed SpeechComposer can improve the performance of both primary tasks and composite tasks, showing the effectiveness of the shared prompt tokens. Remarkably, the unified decoder-only model achieves a comparable and even better performance than the baselines which are expert models designed for single tasks.
OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
Peng, Yifan, Tian, Jinchuan, Chen, William, Arora, Siddhant, Yan, Brian, Sudo, Yui, Shakeel, Muhammad, Choi, Kwanghee, Shi, Jiatong, Chang, Xuankai, Jung, Jee-weon, Watanabe, Shinji
Recent studies have advocated for fully open foundation models to promote transparency and open science. As an initial step, the Open Whisper-style Speech Model (OWSM) reproduced OpenAI's Whisper using publicly available data and open-source toolkits. With the aim of reproducing Whisper, the previous OWSM v1 through v3 models were still based on Transformer, which might lead to inferior performance compared to other state-of-the-art speech encoders. In this work, we aim to improve the performance and efficiency of OWSM without extra training data. We present E-Branchformer based OWSM v3.1 models at two scales, i.e., 100M and 1B. The 1B model is the largest E-Branchformer based speech model that has been made publicly available. It outperforms the previous OWSM v3 in a vast majority of evaluation benchmarks, while demonstrating up to 25% faster inference speed. We publicly release the data preparation scripts, pre-trained models and training logs.
Leveraging Generative AI for Clinical Evidence Summarization Needs to Ensure Trustworthiness
Zhang, Gongbo, Jin, Qiao, McInerney, Denis Jered, Chen, Yong, Wang, Fei, Cole, Curtis L., Yang, Qian, Wang, Yanshan, Malin, Bradley A., Peleg, Mor, Wallace, Byron C., Lu, Zhiyong, Weng, Chunhua, Peng, Yifan
Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, developing accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated summarization of medical evidence.
Improving Fairness of Automated Chest X-ray Diagnosis by Contrastive Learning
Lin, Mingquan, Li, Tianhao, Sun, Zhaoyi, Holste, Gregory, Ding, Ying, Wang, Fei, Shih, George, Peng, Yifan
Purpose: Limited studies exploring concrete methods or approaches to tackle and enhance model fairness in the radiology domain. Our proposed AI model utilizes supervised contrastive learning to minimize bias in CXR diagnosis. Materials and Methods: In this retrospective study, we evaluated our proposed method on two datasets: the Medical Imaging and Data Resource Center (MIDRC) dataset with 77,887 CXR images from 27,796 patients collected as of April 20, 2023 for COVID-19 diagnosis, and the NIH Chest X-ray (NIH-CXR) dataset with 112,120 CXR images from 30,805 patients collected between 1992 and 2015. In the NIH-CXR dataset, thoracic abnormalities include atelectasis, cardiomegaly, effusion, infiltration, mass, nodule, pneumonia, pneumothorax, consolidation, edema, emphysema, fibrosis, pleural thickening, or hernia. Our proposed method utilizes supervised contrastive learning with carefully selected positive and negative samples to generate fair image embeddings, which are fine-tuned for subsequent tasks to reduce bias in chest X-ray (CXR) diagnosis. We evaluated the methods using the marginal AUC difference ($\delta$ mAUC). Results: The proposed model showed a significant decrease in bias across all subgroups when compared to the baseline models, as evidenced by a paired T-test (p<0.0001). The $\delta$ mAUC obtained by our method were 0.0116 (95\% CI, 0.0110-0.0123), 0.2102 (95% CI, 0.2087-0.2118), and 0.1000 (95\% CI, 0.0988-0.1011) for sex, race, and age on MIDRC, and 0.0090 (95\% CI, 0.0082-0.0097) for sex and 0.0512 (95% CI, 0.0512-0.0532) for age on NIH-CXR, respectively. Conclusion: Employing supervised contrastive learning can mitigate bias in CXR diagnosis, addressing concerns of fairness and reliability in deep learning-based diagnostic methods.
Hidden Flaws Behind Expert-Level Accuracy of GPT-4 Vision in Medicine
Jin, Qiao, Chen, Fangyuan, Zhou, Yiliang, Xu, Ziyang, Cheung, Justin M., Chen, Robert, Summers, Ronald M., Rousseau, Justin F., Ni, Peiyun, Landsman, Marc J, Baxter, Sally L., Al'Aref, Subhi J., Li, Yijia, Chiang, Michael F., Peng, Yifan, Lu, Zhiyong
Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V outperforms human physicians regarding multi-choice accuracy (88.0% vs. 77.0%, p=0.034). GPT-4V also performs well in cases where physicians incorrectly answer, with over 80% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (27.3%), most prominent in image comprehension (21.6%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such models into clinical workflows.
Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
Maiti, Soumi, Peng, Yifan, Choi, Shukjae, Jung, Jee-weon, Chang, Xuankai, Watanabe, Shinji
We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90. VoxtLM also improves speech generation and speech recognition performance over the single-task counterpart. Further, VoxtLM is trained with publicly available data and training recipes and model checkpoints are open-sourced to make fully reproducible work.
Nonparametric Estimation via Variance-Reduced Sketching
Khoo, Yuehaw, Peng, Yifan, Wang, Daren
Nonparametric models have extensive applications across diverse fields, including biology (Mac-Farland et al. (2016)), economics (Ullah and Pagan (1999); Li and Racine (2023)), engineering (Lanzante (1996)), and machine learning (Hofmann et al. (2008); Schmidt-hieber (2020)). The most representative nonparametric approaches are kernel methods, known for their numerical robustness and statistical stability in lower-dimensional settings. However, kernel methods often suffer from the curse of dimensionality in higher-dimensional spaces. Recently, a number of significant studies have tackled various modern challenges in nonparametric models. For example, Ravikumar et al. (2009), Raskutti et al. (2012), and Yuan and Zhou (2016) have studied additive models for high-dimensional nonparametric regression; Zhang et al. (2015) and Yang et al. (2017) analyzed randomized algorithms for kernel regression estimation; and Liu et al. (2007) explored nonparametric density estimation in higher dimensions. Despite these contributions, the curse of dimensionality in nonparametric problems, particularly in aspects of statistical accuracy and computational efficiency, remains an open area for further exploration. In this paper, we aim to develop a new framework specifically designed for nonparametric estimation problems. Within this framework, we conceptualize functions as matrices or tensors and explore new methods for handling the bias-variance trade-off, aiming to reduce the curse of dimensionality in higher dimensions. Matrix approximation algorithms, such as singular value decomposition and QR decomposition, play a crucial role in computational mathematics and statistics.
Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search
Sudo, Yui, Shakeel, Muhammad, Fukumoto, Yosuke, Peng, Yifan, Watanabe, Shinji
End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or developer. This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data. In addition, to improve the contextualization performance during inference further, we propose a bias phrase boosted (BPB) beam search algorithm based on the bias phrase index probability. Experimental results demonstrate that the proposed method consistently improves the word error rate and the character error rate of the target phrases in the bias list on both the Librispeech-960 (English) and our in-house (Japanese) dataset, respectively.