A Study of Large Language Models for Patient Information Extraction: Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning

Peng, Cheng, Dong, Xinyu, Lyu, Mengxian, Paredes, Daniel, Zhang, Yaoyun, Wu, Yonghui

arXiv.org Artificial Intelligence 

Keywords: Clinical information extraction Large language model Clinical concept extraction Clinical relation extraction Instruction tuning ABSTRACT Background N atural language processing (NLP) is a key technology t o extract important patient information from clinical narratives to support healthcare applications. The r apid development of large language models (LLMs) has revolutionized many NLP tasks in the clinical domain, yet their optimal use in patient information extraction tasks requires further exploration . This study examines LLMs ' effectiveness in patient information extraction, focusing on LLM architectures, fine - tuning strategies, and multi - task instruction tuning techniques for developing robust and generalizable patient information extraction systems . Methods This study aims to explore k ey concept s of using LLMs for clinical concept and relation extraction tasks, includ ing: ( 1) encoder - only or decoder - only LLMs, ( 2) prompt - based parameter - efficient fine - tuning (PEFT) algorithms, and ( 3) multi - task instruction tuning on few - shot learning performance . We benchmarked a suite of LLMs, including encoder - based LLMs (BERT, GatorTron) and decoder - based LLMs (GatorTronGPT, Llama 3.1, GatorTronLlama), across five datasets. We compared traditional full - size fine - tuning and prompt - based PEFT . W e explored a multi - task instruction tuning framework that combines both tasks across four datasets to evaluate the zero - shot and few - shot learning performance using the leave - one - dataset - out strategy . Results For single - task clinical CE, t he two decoder - based LLMs (Llama 3.1 and GatorTronLlama) achieved the best performance, with average F1 score s of 0.8964 and 0.8981, respectively, across the five datasets, outperforming other LLMs with average F1 improvement of 0.7~3.3%. E ncoder - based LLMs with prompt - based learning outperformed those implemented using classification .