Liu, Che
ERABAL: Enhancing Role-Playing Agents through Boundary-Aware Learning
Tang, Yihong, Ou, Jiao, Liu, Che, Zhang, Fuzheng, Zhang, Di, Gai, Kun
Role-playing is an emerging application in the field of Human-Computer Interaction (HCI), primarily implemented through the alignment training of a large language model (LLM) with assigned characters. Despite significant progress, role-playing agents (RPLAs) still struggle with maintaining role-consistency across conversations, particularly when confronted with boundary queries subtly related to character attributes. In this paper, we present ERABAL, a framework aimed at enhancing RPLAs' role-playing capabilities through boundary-aware learning. ERABAL encompasses a generation pipeline for role-specific dialogues and a concomitant methodology for alignment training. Through comprehensive evaluations, we demonstrate that ERABAL is both efficient and effective. By training with significantly fewer dialogues than those used in leading approaches, ERABAL achieves notable improvements across WikiRoleEval, CharacterEval, and the role-playing subset of MT-Bench compared to the generalist baseline models. Our code and datasets will be made publicly available to support further research.
An Electrocardiogram Foundation Model Built on over 10 Million Recordings with External Evaluation across Multiple Domains
Li, Jun, Aguirre, Aaron, Moura, Junior, Liu, Che, Zhong, Lanhai, Sun, Chenxi, Clifford, Gari, Westover, Brandon, Hong, Shenda
Artificial intelligence (AI) has demonstrated significant potential in ECG analysis and cardiovascular disease assessment. Recently, foundation models have played a remarkable role in advancing medical AI. The development of an ECG foundation model holds the promise of elevating AI-ECG research to new heights. However, building such a model faces several challenges, including insufficient database sample sizes and inadequate generalization across multiple domains. Additionally, there is a notable performance gap between single-lead and multi-lead ECG analyses. We introduced an ECG Foundation Model (ECGFounder), a general-purpose model that leverages real-world ECG annotations from cardiology experts to broaden the diagnostic capabilities of ECG analysis. ECGFounder was trained on over 10 million ECGs with 150 label categories from the Harvard-Emory ECG Database, enabling comprehensive cardiovascular disease diagnosis through ECG analysis. The model is designed to be both an effective out-of-the-box solution, and a to be fine-tunable for downstream tasks, maximizing usability. Importantly, we extended its application to lower rank ECGs, and arbitrary single-lead ECGs in particular. ECGFounder is applicable to supporting various downstream tasks in mobile monitoring scenarios. Experimental results demonstrate that ECGFounder achieves expert-level performance on internal validation sets, with AUROC exceeding 0.95 for eighty diagnoses. It also shows strong classification performance and generalization across various diagnoses on external validation sets. When fine-tuned, ECGFounder outperforms baseline models in demographic analysis, clinical event detection, and cross-modality cardiac rhythm diagnosis. The trained model and data will be publicly released upon publication through the bdsp.io. Our code is available at https://github.com/bdsp-core/ECGFounder
Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data?
Liu, Che, Wan, Zhongwei, Wang, Haozhe, Chen, Yinda, Qaiser, Talha, Jin, Chen, Yousefi, Fariba, Burlutskiy, Nikolay, Arcucci, Rossella
Medical Vision-Language Pre-training (MedVLP) has made significant progress in enabling zero-shot tasks for medical image understanding. However, training MedVLP models typically requires large-scale datasets with paired, high-quality image-text data, which are scarce in the medical domain. Recent advancements in Large Language Models (LLMs) and diffusion models have made it possible to generate large-scale synthetic image-text pairs. This raises the question: "Can MedVLP succeed using purely synthetic data?" To address this, we use off-the-shelf generative models to create synthetic radiology reports and paired Chest X-ray (CXR) images, and propose an automated pipeline to build a diverse, high-quality synthetic dataset, enabling a rigorous study that isolates model and training settings, focusing entirely from the data perspective. Our results show that MedVLP models trained exclusively on synthetic data outperform those trained on real data by 3.8% in averaged AUC on zero-shot classification. Moreover, using a combination of synthetic and real data leads to a further improvement of 9.07%. Additionally, MedVLP models trained on synthetic or mixed data consistently outperform those trained on real data in zero-shot grounding, as well as in fine-tuned classification and segmentation tasks. Our analysis suggests MedVLP trained on well-designed synthetic data can outperform models trained on real datasets, which may be limited by low-quality samples and long-tailed distributions.
Zero-Shot ECG Classification with Multimodal Learning and Test-time Clinical Knowledge Enhancement
Liu, Che, Wan, Zhongwei, Ouyang, Cheng, Shah, Anand, Bai, Wenjia, Arcucci, Rossella
Electrocardiograms (ECGs) are non-invasive diagnostic tools crucial for detecting cardiac arrhythmic diseases in clinical practice. While ECG Self-supervised Learning (eSSL) methods show promise in representation learning from unannotated ECG data, they often overlook the clinical knowledge that can be found in reports. This oversight and the requirement for annotated samples for downstream tasks limit eSSL's versatility. In this work, we address these issues with the Multimodal ECG Representation Learning (MERL}) framework. Through multimodal learning on ECG records and associated reports, MERL is capable of performing zero-shot ECG classification with text prompts, eliminating the need for training data in downstream tasks. At test time, we propose the Clinical Knowledge Enhanced Prompt Engineering (CKEPE) approach, which uses Large Language Models (LLMs) to exploit external expert-verified clinical knowledge databases, generating more descriptive prompts and reducing hallucinations in LLM-generated content to boost zero-shot classification. Based on MERL, we perform the first benchmark across six public ECG datasets, showing the superior performance of MERL compared against eSSL methods. Notably, MERL achieves an average AUC score of 75.2% in zero-shot classification (without training data), 3.2% higher than linear probed eSSL methods with 10\% annotated training data, averaged across all six datasets. Code and models are available at https://github.com/cheliu-computation/MERL
LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference
Wan, Zhongwei, Wu, Ziang, Liu, Che, Huang, Jinfa, Zhu, Zhihong, Jin, Peng, Wang, Longyue, Yuan, Li
Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unlike single-modality LLMs that manage only textual contexts, the KV cache of long-context MLLMs includes representations from multiple images with temporal and spatial relationships and related textual contexts. The predominance of image tokens means traditional optimizations for LLMs' KV caches are unsuitable for multimodal long-context settings, and no prior works have addressed this challenge. In this work, we introduce LOOK-M, a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size while maintaining performance comparable to a full cache. We observe that during prompt prefill, the model prioritizes more textual attention over image features, and based on the multimodal interaction observation, a new proposed text-prior method is explored to compress the KV cache. Furthermore, to mitigate the degradation of image contextual information, we propose several compensatory strategies using KV pairs merging. LOOK-M demonstrates that with a significant reduction in KV Cache memory usage, such as reducing it by 80% in some cases, it not only achieves up to 1.5x faster decoding but also maintains or even enhances performance across a variety of long context multimodal tasks.
MEIT: Multi-Modal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation
Wan, Zhongwei, Liu, Che, Wang, Xin, Tao, Chaofan, Shen, Hui, Peng, Zhenwu, Fu, Jie, Arcucci, Rossella, Yao, Huaxiu, Zhang, Mi
Electrocardiogram (ECG) is the primary non-invasive diagnostic tool for monitoring cardiac conditions and is crucial in assisting clinicians. Recent studies have concentrated on classifying cardiac conditions using ECG data but have overlooked ECG report generation, which is time-consuming and requires clinical expertise. To automate ECG report generation and ensure its versatility, we propose the Multimodal ECG Instruction Tuning (MEIT) framework, the first attempt to tackle ECG report generation with LLMs and multimodal instructions. To facilitate future research, we establish a benchmark to evaluate MEIT with various LLMs backbones across two large-scale ECG datasets. Our approach uniquely aligns the representations of the ECG signal and the report, and we conduct extensive experiments to benchmark MEIT with nine open-source LLMs using more than 800,000 ECG reports. MEIT's results underscore the superior performance of instruction-tuned LLMs, showcasing their proficiency in quality report generation, zero-shot capabilities, and resilience to signal perturbation. These findings emphasize the efficacy of our MEIT framework and its potential for real-world clinical application.
Benchmarking and Boosting Radiology Report Generation for 3D High-Resolution Medical Images
Liu, Che, Wan, Zhongwei, Wang, Yuqi, Shen, Hui, Wang, Haozhe, Zheng, Kangyu, Zhang, Mi, Arcucci, Rossella
Automatic radiology report generation can significantly benefit the labor-intensive process of report writing by radiologists, especially for 3D radiographs like CT scans, which are crucial for broad clinical diagnostics yet underexplored compared to 2D radiographs. Existing methods often handle 3D volumes either slice-wise or with aggressive downsampling due to current GPU memory limitations, which results in a loss of the inherent 3D nature and critical details. To overcome these issues, we introduce a novel framework that efficiently and effectively generates radiology reports for high-resolution (HR) 3D volumes, based on large language models (LLMs). Specifically, our framework utilizes low-resolution (LR) visual tokens as queries to mine information from HR tokens, preserving detailed HR information while reducing computational costs by only processing HR informed LR visual queries. Further benefiting the field, we curate and release BIMCV-RG, a new dataset with 5,328 HR 3D volumes and paired reports, establishing the first benchmarks for report generation from 3D HR medical images.
Open-vocabulary Auditory Neural Decoding Using fMRI-prompted LLM
Chen, Xiaoyu, Du, Changde, Liu, Che, Wang, Yizhe, He, Huiguang
Decoding language information from brain signals represents a vital research area within brain-computer interfaces, particularly in the context of deciphering the semantic information from the fMRI signal. However, many existing efforts concentrate on decoding small vocabulary sets, leaving space for the exploration of open vocabulary continuous text decoding. In this paper, we introduce a novel method, the \textbf{Brain Prompt GPT (BP-GPT)}. By using the brain representation that is extracted from the fMRI as a prompt, our method can utilize GPT-2 to decode fMRI signals into stimulus text. Further, we introduce a text-to-text baseline and align the fMRI prompt to the text prompt. By introducing the text-to-text baseline, our BP-GPT can extract a more robust brain prompt and promote the decoding of pre-trained LLM. We evaluate our BP-GPT on the open-source auditory semantic decoding dataset and achieve a significant improvement up to $4.61\%$ on METEOR and $2.43\%$ on BERTScore across all the subjects compared to the state-of-the-art method. The experimental results demonstrate that using brain representation as a prompt to further drive LLM for auditory neural decoding is feasible and effective.
BIMCV-R: A Landmark Dataset for 3D CT Text-Image Retrieval
Chen, Yinda, Liu, Che, Liu, Xiaoyu, Arcucci, Rossella, Xiong, Zhiwei
The burgeoning integration of 3D medical imaging into healthcare has led to a substantial increase in the workload of medical professionals. To assist clinicians in their diagnostic processes and alleviate their workload, the development of a robust system for retrieving similar case studies presents a viable solution. While the concept holds great promise, the field of 3D medical text-image retrieval is currently limited by the absence of robust evaluation benchmarks and curated datasets. To remedy this, our study presents a groundbreaking dataset, BIMCV-R (This dataset will be released upon acceptance.), which includes an extensive collection of 8,069 3D CT volumes, encompassing over 2 million slices, paired with their respective radiological reports. Expanding upon the foundational work of our dataset, we craft a retrieval strategy, MedFinder. This approach employs a dual-stream network architecture, harnessing the potential of large language models to advance the field of medical image retrieval beyond existing text-image retrieval solutions. It marks our preliminary step towards developing a system capable of facilitating text-to-image, image-to-text, and keyword-based retrieval tasks.
Enhancing Role-playing Systems through Aggressive Queries: Evaluation and Improvement
Tang, Yihong, Ou, Jiao, Liu, Che, Zhang, Fuzheng, Zhang, Di, Gai, Kun
The advent of Large Language Models (LLMs) has propelled dialogue generation into new realms, particularly in the field of role-playing systems (RPSs). While enhanced with ordinary role-relevant training dialogues, existing LLM-based RPSs still struggle to align with roles when handling intricate and trapped queries in boundary scenarios. In this paper, we design the Modular ORchestrated Trap-setting Interaction SystEm (MORTISE) to benchmark and improve the role-playing LLMs' performance. MORTISE can produce highly role-relevant aggressive queries through the collaborative effort of multiple LLM-based modules, and formulate corresponding responses to create an adversarial training dataset via a consistent response generator. We select 190 Chinese and English roles to construct aggressive queries to benchmark existing role-playing LLMs. Through comprehensive evaluation, we find that existing models exhibit a general deficiency in role alignment capabilities. We further select 180 of the roles to collect an adversarial training dataset (named RoleAD) and retain the other 10 roles for testing. Experiments on models improved by RoleAD indicate that our adversarial dataset ameliorates this deficiency, with the improvements demonstrating a degree of generalizability in ordinary scenarios.