Cai, Xiaoyan
Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models
Li, Bin, Gao, Dehong, Wang, Yeyuan, Jin, Linbo, Yu, Shanqing, Cai, Xiaoyan, Yang, Libin
Despite the significant success of Large Vision-Language models(LVLMs), these models still suffer hallucinations when describing images, generating answers that include non-existent objects. It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question and distort the output. To address this, we propose an Instruction-Aligned Visual Attention(IAVA) approach, which identifies irrelevant tokens by comparing changes in attention weights under two different instructions. By applying contrastive decoding, we dynamically adjust the logits generated from original image tokens and irrelevant image tokens, reducing the model's over-attention to irrelevant information. The experimental results demonstrate that IAVA consistently outperforms existing decoding techniques on benchmarks such as MME, POPE, and TextVQA in mitigating object hallucinations. Our IAVA approach is available online at https://github.com/Lee-lab558/IAVA.
MoDULA: Mixture of Domain-Specific and Universal LoRA for Multi-Task Learning
Ma, Yufei, Liang, Zihan, Dai, Huangyu, Chen, Ben, Gao, Dehong, Ran, Zhuoran, Zihan, Wang, Jin, Linbo, Jiang, Wen, Zhang, Guannan, Cai, Xiaoyan, Yang, Libin
The growing demand for larger-scale models in the development of \textbf{L}arge \textbf{L}anguage \textbf{M}odels (LLMs) poses challenges for efficient training within limited computational resources. Traditional fine-tuning methods often exhibit instability in multi-task learning and rely heavily on extensive training resources. Here, we propose MoDULA (\textbf{M}ixture \textbf{o}f \textbf{D}omain-Specific and \textbf{U}niversal \textbf{L}oR\textbf{A}), a novel \textbf{P}arameter \textbf{E}fficient \textbf{F}ine-\textbf{T}uning (PEFT) \textbf{M}ixture-\textbf{o}f-\textbf{E}xpert (MoE) paradigm for improved fine-tuning and parameter efficiency in multi-task learning. The paradigm effectively improves the multi-task capability of the model by training universal experts, domain-specific experts, and routers separately. MoDULA-Res is a new method within the MoDULA paradigm, which maintains the model's general capability by connecting universal and task-specific experts through residual connections. The experimental results demonstrate that the overall performance of the MoDULA-Flan and MoDULA-Res methods surpasses that of existing fine-tuning methods on various LLMs. Notably, MoDULA-Res achieves more significant performance improvements in multiple tasks while reducing training costs by over 80\% without losing general capability. Moreover, MoDULA displays flexible pluggability, allowing for the efficient addition of new tasks without retraining existing experts from scratch. This progressive training paradigm circumvents data balancing issues, enhancing training efficiency and model stability. Overall, MoDULA provides a scalable, cost-effective solution for fine-tuning LLMs with enhanced parameter efficiency and generalization capability.
General2Specialized LLMs Translation for E-commerce
Chen, Kaidi, Chen, Ben, Gao, Dehong, Dai, Huangyu, Jiang, Wen, Ning, Wei, Yu, Shanqing, Yang, Libin, Cai, Xiaoyan
Existing Neural Machine Translation (NMT) models mainly handle translation in the general domain, while overlooking domains with special writing formulas, such as e-commerce and legal documents. Taking e-commerce as an example, the texts usually include amounts of domain-related words and have more grammar problems, which leads to inferior performances of current NMT methods. To address these problems, we collect two domain-related resources, including a set of term pairs (aligned Chinese-English bilingual terms) and a parallel corpus annotated for the e-commerce domain. Furthermore, we propose a two-step fine-tuning paradigm (named G2ST) with self-contrastive semantic enhancement to transfer one general NMT model to the specialized NMT model for e-commerce. The paradigm can be used for the NMT models based on Large language models (LLMs). Extensive evaluations on real e-commerce titles demonstrate the superior translation quality and robustness of our G2ST approach, as compared with state-of-the-art NMT models such as LLaMA, Qwen, GPT-3.5, and even GPT-4.
Holistic Evaluation of GPT-4V for Biomedical Imaging
Liu, Zhengliang, Jiang, Hanqi, Zhong, Tianyang, Wu, Zihao, Ma, Chong, Li, Yiwei, Yu, Xiaowei, Zhang, Yutong, Pan, Yi, Shu, Peng, Lyu, Yanjun, Zhang, Lu, Yao, Junjie, Dong, Peixin, Cao, Chao, Xiao, Zhenxiang, Wang, Jiaqi, Zhao, Huan, Xu, Shaochen, Wei, Yaonai, Chen, Jingyuan, Dai, Haixing, Wang, Peilong, He, Hao, Wang, Zewei, Wang, Xinyu, Zhang, Xu, Zhao, Lin, Liu, Yiheng, Zhang, Kai, Yan, Liheng, Sun, Lichao, Liu, Jun, Qiang, Ning, Ge, Bao, Cai, Xiaoyan, Zhao, Shijie, Hu, Xintao, Yuan, Yixuan, Li, Gang, Zhang, Shu, Zhang, Xin, Jiang, Xi, Zhang, Tuo, Shen, Dinggang, Li, Quanzheng, Liu, Wei, Li, Xiang, Zhu, Dajiang, Liu, Tianming
In this paper, we present a large-scale evaluation probing GPT-4V's capabilities and limitations for biomedical image analysis. GPT-4V represents a breakthrough in artificial general intelligence (AGI) for computer vision, with applications in the biomedical domain. We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and more. Tasks include modality recognition, anatomy localization, disease diagnosis, report generation, and lesion detection. The extensive experiments provide insights into GPT-4V's strengths and weaknesses. Results show GPT-4V's proficiency in modality and anatomy recognition but difficulty with disease diagnosis and localization. GPT-4V excels at diagnostic report generation, indicating strong image captioning skills. While promising for biomedical imaging AI, GPT-4V requires further enhancement and validation before clinical deployment. We emphasize responsible development and testing for trustworthy integration of biomedical AGI. This rigorous evaluation of GPT-4V on diverse medical images advances understanding of multimodal large language models (LLMs) and guides future work toward impactful healthcare applications.
ChatRadio-Valuer: A Chat Large Language Model for Generalizable Radiology Report Generation Based on Multi-institution and Multi-system Data
Zhong, Tianyang, Zhao, Wei, Zhang, Yutong, Pan, Yi, Dong, Peixin, Jiang, Zuowei, Kui, Xiaoyan, Shang, Youlan, Yang, Li, Wei, Yaonai, Yang, Longtao, Chen, Hao, Zhao, Huan, Liu, Yuxiao, Zhu, Ning, Li, Yiwei, Wang, Yisong, Yao, Jiaqi, Wang, Jiaqi, Zeng, Ying, He, Lei, Zheng, Chao, Zhang, Zhixue, Li, Ming, Liu, Zhengliang, Dai, Haixing, Wu, Zihao, Zhang, Lu, Zhang, Shu, Cai, Xiaoyan, Hu, Xintao, Zhao, Shijie, Jiang, Xi, Zhang, Xin, Li, Xiang, Zhu, Dajiang, Guo, Lei, Shen, Dinggang, Han, Junwei, Liu, Tianming, Liu, Jun, Zhang, Tuo
Radiology report generation, as a key step in medical image analysis, is critical to the quantitative analysis of clinically informed decision-making levels. However, complex and diverse radiology reports with cross-source heterogeneity pose a huge generalizability challenge to the current methods under massive data volume, mainly because the style and normativity of radiology reports are obviously distinctive among institutions, body regions inspected and radiologists. Recently, the advent of large language models (LLM) offers great potential for recognizing signs of health conditions. To resolve the above problem, we collaborate with the Second Xiangya Hospital in China and propose ChatRadio-Valuer based on the LLM, a tailored model for automatic radiology report generation that learns generalizable representations and provides a basis pattern for model adaptation in sophisticated analysts' cases. Specifically, ChatRadio-Valuer is trained based on the radiology reports from a single institution by means of supervised fine-tuning, and then adapted to disease diagnosis tasks for human multi-system evaluation (i.e., chest, abdomen, muscle-skeleton, head, and maxillofacial $\&$ neck) from six different institutions in clinical-level events. The clinical dataset utilized in this study encompasses a remarkable total of \textbf{332,673} observations. From the comprehensive results on engineering indicators, clinical efficacy and deployment cost metrics, it can be shown that ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al., in terms of the diseases diagnosis from radiology reports. ChatRadio-Valuer provides an effective avenue to boost model generalization performance and alleviate the annotation workload of experts to enable the promotion of clinical AI applications in radiology reports.
Evaluating Large Language Models for Radiology Natural Language Processing
Liu, Zhengliang, Zhong, Tianyang, Li, Yiwei, Zhang, Yutong, Pan, Yi, Zhao, Zihao, Dong, Peixin, Cao, Chao, Liu, Yuxiao, Shu, Peng, Wei, Yaonai, Wu, Zihao, Ma, Chong, Wang, Jiaqi, Wang, Sheng, Zhou, Mengyue, Jiang, Zuowei, Li, Chunlin, Holmes, Jason, Xu, Shaochen, Zhang, Lu, Dai, Haixing, Zhang, Kai, Zhao, Lin, Chen, Yuanhao, Liu, Xu, Wang, Peilong, Yan, Pingkun, Liu, Jun, Ge, Bao, Sun, Lichao, Zhu, Dajiang, Li, Xiang, Liu, Wei, Cai, Xiaoyan, Hu, Xintao, Jiang, Xi, Zhang, Shu, Zhang, Xin, Zhang, Tuo, Zhao, Shijie, Li, Quanzheng, Zhu, Hongtu, Shen, Dinggang, Liu, Tianming
The rise of large language models (LLMs) has marked a pivotal shift in the field of natural language processing (NLP). LLMs have revolutionized a multitude of domains, and they have made a significant impact in the medical field. Large language models are now more abundant than ever, and many of these models exhibit bilingual capabilities, proficient in both English and Chinese. However, a comprehensive evaluation of these models remains to be conducted. This lack of assessment is especially apparent within the context of radiology NLP. This study seeks to bridge this gap by critically evaluating thirty two LLMs in interpreting radiology reports, a crucial component of radiology NLP. Specifically, the ability to derive impressions from radiologic findings is assessed. The outcomes of this evaluation provide key insights into the performance, strengths, and weaknesses of these LLMs, informing their practical applications within the medical domain.
ImpressionGPT: An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT
Ma, Chong, Wu, Zihao, Wang, Jiaqi, Xu, Shaochen, Wei, Yaonai, Liu, Zhengliang, Jiang, Xi, Guo, Lei, Cai, Xiaoyan, Zhang, Shu, Zhang, Tuo, Zhu, Dajiang, Shen, Dinggang, Liu, Tianming, Li, Xiang
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians, and it is typically written by radiologists based on the 'Findings' section. However, writing numerous impressions can be laborious and error-prone for radiologists. Although recent studies have achieved promising results in automatic impression generation using large-scale medical text data for pre-training and fine-tuning pre-trained language models, such models often require substantial amounts of medical text data and have poor generalization performance. While large language models (LLMs) like ChatGPT have shown strong generalization capabilities and performance, their performance in specific domains, such as radiology, remains under-investigated and potentially limited. To address this limitation, we propose ImpressionGPT, which leverages the in-context learning capability of LLMs by constructing dynamic contexts using domain-specific, individualized data. This dynamic prompt approach enables the model to learn contextual knowledge from semantically similar examples from existing data. Additionally, we design an iterative optimization algorithm that performs automatic evaluation on the generated impression results and composes the corresponding instruction prompts to further optimize the model. The proposed ImpressionGPT model achieves state-of-the-art performance on both MIMIC-CXR and OpenI datasets without requiring additional training data or fine-tuning the LLMs. This work presents a paradigm for localizing LLMs that can be applied in a wide range of similar application scenarios, bridging the gap between general-purpose LLMs and the specific language processing needs of various domains.
Generative Adversarial Network Based Heterogeneous Bibliographic Network Representation for Personalized Citation Recommendation
Cai, Xiaoyan (School of Automation, Northwestern Polytechnical University) | Han, Junwei (School of Automation, Northwestern Polytechnical University) | Yang, Libin (School of Automation, Northwestern Polytechnical University)
Network representation has been recently exploited for many applications, such as citation recommendation, multi-label classification and link prediction. It learns low-dimensional vector representation for each vertex in networks. Existing network representation methods only focus on incomplete aspects of vertex information (i.e., vertex content, network structure or partial integration), moreover they are commonly designed for homogeneous information networks where all the vertices of a network are of the same type. In this paper, we propose a deep network representation model that integrates network structure and the vertex content information into a unified framework by exploiting generative adversarial network, and represents different types of vertices in the heterogeneous network in a continuous and common vector space. Based on the proposed model, we can obtain heterogeneous bibliographic network representation for efficient citation recommendation. The proposed model also makes personalized citation recommendation possible, which is a new issue that a few papers addressed in the past. When evaluated on the AAN and DBLP datasets, the performance of the proposed heterogeneous bibliographic network based citation recommendation approach is comparable with that of the other network representation based citation recommendation approaches. The results also demonstrate that the personalized citation recommendation approach is more effective than the non-personalized citation recommendation approach.