AITopics | Peng, Yifan

Plotting

Peng, Yifan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models

Tian, Jinchuan, Peng, Yifan, Chen, William, Choi, Kwanghee, Livescu, Karen, Watanabe, Shinji

arXiv.org Artificial IntelligenceJun-13-2024

The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the impacts of this data heterogeneity. Our study begins with a detailed analysis of each dataset, from which we derive two key strategies: data filtering with proxy task to enhance data quality, and the incorporation of punctuation and true-casing using an open large language model (LLM). With all other configurations staying the same, OWSM v3.2 improves performance over the OWSM v3.1 baseline while using 15% less training data.

large language model, machine learning, speech-to-text foundation model, (3 more...)

arXiv.org Artificial Intelligence

2406.09282

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.60)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)
Information Technology > Artificial Intelligence > Machine Learning (0.53)

Add feedback

4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Sudo, Yui, Shakeel, Muhammad, Fukumoto, Yosuke, Yan, Brian, Shi, Jiatong, Peng, Yifan, Watanabe, Shinji

arXiv.org Artificial IntelligenceJun-5-2024

End-to-end automatic speech recognition (E2E-ASR) can be classified into several network architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and mask-predict models. Each network architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on application requirements. Instead of building separate models, we propose a joint modeling scheme where four decoders (CTC, RNN-T, attention, and mask-predict) share the same encoder -- we refer to this as 4D modeling. The 4D model is trained using multitask learning, which will bring model regularization and maximize the model robustness thanks to their complementary properties. To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning. In addition, we propose three novel one-pass beam search algorithms by combining three decoders (CTC, RNN-T, and attention) to further improve performance. These three beam search algorithms differ in which decoder is used as the primary decoder. We carefully evaluate the performance and computational tradeoffs associated with each algorithm. Experimental results demonstrate that the jointly trained 4D model outperforms the E2E-ASR models trained with only one individual decoder. Furthermore, we demonstrate that the proposed one-pass beam search algorithm outperforms the previously proposed CTC/attention decoding.

artificial intelligence, decoder, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2406.0295

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Shakeel, Muhammad, Sudo, Yui, Peng, Yifan, Watanabe, Shinji

arXiv.org Artificial IntelligenceMay-22-2024

ABSTRACT End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optimization of streaming and non-streaming ASR based on multidecoder and knowledge distillation. Primarily, we study 1) the encoder integration of these ASR modules, followed by 2) separate decoders to make the switching mode flexible, and enhancing performance by 3) incorporating similarity-preserving knowledge distillation between the two modular encoders and decoders. Evaluation Figure 1: Joint optimization of multi-decoder ASR model: A single results show 2.6%-5.3%

architecture, artificial intelligence, speech recognition, (15 more...)

arXiv.org Artificial Intelligence

2405.13514

Country:

Asia > Japan (0.14)
North America > United States (0.14)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Contextualized Automatic Speech Recognition with Dynamic Vocabulary

Sudo, Yui, Fukumoto, Yosuke, Shakeel, Muhammad, Peng, Yifan, Watanabe, Shinji

arXiv.org Artificial IntelligenceMay-22-2024

Deep biasing (DB) improves the performance of end-to-end automatic speech recognition (E2E-ASR) for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined static vocabulary, which can result in ineffective learning of the dependencies between subwords. More advanced techniques address this problem by incorporating additional text data, which increases the overall workload. This paper proposes a dynamic vocabulary where phrase-level bias tokens can be added during the inference phase. Each bias token represents an entire bias phrase within a single token, thereby eliminating the need to learn the dependencies between the subwords within the bias phrases. This method can be applied to various architectures because it only extends the embedding and output layers in common E2E-ASR architectures. Experimental results demonstrate that the proposed method improves the performance of bias phrases on English and Japanese datasets.

artificial intelligence, machine learning, speech recognition, (18 more...)

arXiv.org Artificial Intelligence

2405.13344

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Harnessing the power of longitudinal medical imaging for eye disease prognosis using Transformer-based sequence modeling

Holste, Gregory, Lin, Mingquan, Zhou, Ruiwen, Wang, Fei, Liu, Lei, Yan, Qi, Van Tassel, Sarah H., Kovacs, Kyle, Chew, Emily Y., Lu, Zhiyong, Wang, Zhangyang, Peng, Yifan

arXiv.org Artificial IntelligenceMay-14-2024

Deep learning has enabled breakthroughs in automated diagnosis from medical imaging, with many successful applications in ophthalmology. However, standard medical image classification approaches only assess disease presence at the time of acquisition, neglecting the common clinical setting of longitudinal imaging. For slow, progressive eye diseases like age-related macular degeneration (AMD) and primary open-angle glaucoma (POAG), patients undergo repeated imaging over time to track disease progression and forecasting the future risk of developing disease is critical to properly plan treatment. Our proposed Longitudinal Transformer for Survival Analysis (LTSA) enables dynamic disease prognosis from longitudinal medical imaging, modeling the time to disease from sequences of fundus photography images captured over long, irregular time periods. Using longitudinal imaging data from the Age-Related Eye Disease Study (AREDS) and Ocular Hypertension Treatment Study (OHTS), LTSA significantly outperformed a single-image baseline in 19/20 head-to-head comparisons on late AMD prognosis and 18/20 comparisons on POAG prognosis. A temporal attention analysis also suggested that, while the most recent image is typically the most influential, prior imaging still provides additional prognostic value.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2405.0878

Country: North America > United States > Texas > Travis County > Austin (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Evaluating GPT-4 with Vision on Detection of Radiological Findings on Chest Radiographs

Zhou, Yiliang, Ong, Hanley, Kennedy, Patrick, Wu, Carol, Kazam, Jacob, Hentel, Keith, Flanders, Adam, Shih, George, Peng, Yifan

arXiv.org Artificial IntelligenceMay-12-2024

Background Generating radiologic findings from chest radiographs is pivotal in medical image analysis. The emergence of OpenAI's generative pretrained transformer, GPT-4 with vision (GPT-4V)[1], has opened new perspectives on the potential for automated image-text pair generation. However, the application of GPT-4V to real-world chest radiography is yet to be thoroughly examined. Purpose To investigate GPT-4V's capability to generate radiologic findings from real-world chest radiographs. Materials and Methods In this retrospective study, 100 chest radiographs with free-text radiology reports were annotated by a cohort of radiologists, two attending physicians and three residents, to establish a reference standard.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2403.15528

Country: North America > United States > Texas (0.14)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.71)

Industry:

Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.73)

Add feedback

A Literature Review and Framework for Human Evaluation of Generative Large Language Models in Healthcare

Tam, Thomas Yu Chow, Sivarajkumar, Sonish, Kapoor, Sumit, Stolyar, Alisa V, Polanska, Katelyn, McCarthy, Karleigh R, Osterhoudt, Hunter, Wu, Xizhi, Visweswaran, Shyam, Fu, Sunyang, Mathur, Piyush, Cacciamani, Giovanni E., Sun, Cong, Peng, Yifan, Wang, Yanshan

arXiv.org Artificial IntelligenceMay-4-2024

As generative artificial intelligence (AI), particularly Large Language Models (LLMs), continues to permeate healthcare, it remains crucial to supplement traditional automated evaluations with human expert evaluation. Understanding and evaluating the generated texts is vital for ensuring safety, reliability, and effectiveness. However, the cumbersome, time-consuming, and non-standardized nature of human evaluation presents significant obstacles to the widespread adoption of LLMs in practice. This study reviews existing literature on human evaluation methodologies for LLMs within healthcare. We highlight a notable need for a standardized and consistent human evaluation approach. Our extensive literature search, adhering to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, spans publications from January 2018 to February 2024. This review provides a comprehensive overview of the human evaluation approaches used in diverse healthcare applications.This analysis examines the human evaluation of LLMs across various medical specialties, addressing factors such as evaluation dimensions, sample types, and sizes, the selection and recruitment of evaluators, frameworks and metrics, the evaluation process, and statistical analysis of the results. Drawing from diverse evaluation strategies highlighted in these studies, we propose a comprehensive and practical framework for human evaluation of generative LLMs, named QUEST: Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence. This framework aims to improve the reliability, generalizability, and applicability of human evaluation of generative LLMs in different healthcare applications by defining clear evaluation dimensions and offering detailed guidelines.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2405.02559

Country:

Europe (0.67)
North America > United States > Pennsylvania (0.28)
North America > United States > California > Los Angeles County > Los Angeles (0.14)

Genre:

Research Report > Experimental Study (1.00)
Overview (1.00)
Research Report > New Finding (0.66)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Nuclear Medicine (1.00)
(9 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.48)

Add feedback

Uncovering Misattributed Suicide Causes through Annotation Inconsistency Detection in Death Investigation Notes

Wang, Song, Zhou, Yiliang, Han, Ziqiang, Tao, Cui, Xiao, Yunyu, Ding, Ying, Ghosh, Joydeep, Peng, Yifan

arXiv.org Artificial IntelligenceMar-29-2024

Data accuracy is essential for scientific research and policy development. The National Violent Death Reporting System (NVDRS) data is widely used for discovering the patterns and causes of death. Recent studies suggested the annotation inconsistencies within the NVDRS and the potential impact on erroneous suicide-cause attributions. We present an empirical Natural Language Processing (NLP) approach to detect annotation inconsistencies and adopt a cross-validation-like paradigm to identify problematic instances. We analyzed 267,804 suicide death incidents between 2003 and 2020 from the NVDRS. Our results showed that incorporating the target state's data into training the suicide-crisis classifier brought an increase of 5.4% to the F-1 score on the target state's test set and a decrease of 1.1% on other states' test set. To conclude, we demonstrated the annotation inconsistencies in NVDRS's death investigation notes, identified problematic instances, evaluated the effectiveness of correcting problematic instances, and eventually proposed an NLP improvement solution.

annotation, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2403.19432

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Texas > Travis County > Austin (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Consumer Health (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback

Deep learning with noisy labels in medical prediction problems: a scoping review

Wei, Yishu, Deng, Yu, Sun, Cong, Lin, Mingquan, Jiang, Hongmei, Peng, Yifan

arXiv.org Artificial IntelligenceMar-19-2024

Objectives: Medical research faces substantial challenges from noisy labels attributed to factors like inter-expert variability and machine-extracted labels. Despite this, the adoption of label noise management remains limited, and label noise is largely ignored. To this end, there is a critical need to conduct a scoping review focusing on the problem space. This scoping review aims to comprehensively review label noise management in deep learning-based medical prediction problems, which includes label noise detection, label noise handling, and evaluation. Research involving label uncertainty is also included. Methods: Our scoping review follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. We searched 4 databases, including PubMed, IEEE Xplore, Google Scholar, and Semantic Scholar. Our search terms include "noisy label AND medical / healthcare / clinical", "un-certainty AND medical / healthcare / clinical", and "noise AND medical / healthcare / clinical". Results: A total of 60 papers met inclusion criteria between 2016 and 2023. A series of practical questions in medical research are investigated. These include the sources of label noise, the impact of label noise, the detection of label noise, label noise handling techniques, and their evaluation. Categorization of both label noise detection methods and handling techniques are provided. Discussion: From a methodological perspective, we observe that the medical community has been up to date with the broader deep-learning community, given that most techniques have been evaluated on medical data. We recommend considering label noise as a standard element in medical research, even if it is not dedicated to handling noisy labels. Initial experiments can start with easy-to-implement methods, such as noise-robust loss functions, weighting, and curriculum learning.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2403.13111

Country: North America > United States (0.28)

Genre:

Overview (1.00)
Research Report > Experimental Study (0.66)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Health Care Technology (1.00)
(5 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis

Peng, Yifan, Kulikov, Ilia, Yang, Yilin, Popuri, Sravya, Lu, Hui, Wang, Changhan, Gong, Hongyu

arXiv.org Artificial IntelligenceMar-18-2024

Speech language models (LMs) are promising for high-quality speech synthesis through in-context learning. A typical speech LM takes discrete semantic units as content and a short utterance as prompt, and synthesizes speech which preserves the content's semantics but mimics the prompt's style. However, there is no systematic understanding on how the synthesized audio is controlled by the prompt and content. In this work, we conduct an empirical study of the widely used autoregressive (AR) and non-autoregressive (NAR) speech LMs and provide insights into the prompt design and content semantic units. Our analysis reveals that heterogeneous and nonstationary prompts hurt the audio quality in contrast to the previous finding that longer prompts always lead to better synthesis. Moreover, we find that the speaker style of the synthesized audio is also affected by the content in addition to the prompt. We further show that semantic units carry rich acoustic information such as pitch, tempo, volume and speech emphasis, which might be leaked from the content to the synthesized audio.

large language model, machine learning, semantic unit, (18 more...)

arXiv.org Artificial Intelligence

2403.12402

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback