Combining Domain-Specific Models and LLMs for Automated Disease Phenotyping from Survey Data

Beeri, Gal, Chamot, Benoit, Latchem, Elena, Venkatesh, Shruthi, Whalan, Sarah, Kruger, Van Zyl, Martino, David

arXiv.org Artificial Intelligence 

Funding and support: The Generative AI Challenge is funded by grants from the Future Health Research and Innovation Fund (FHRIF), Grant ID IC2023-GAIA/11. Conflict of interest statement: The authors declare no conflicts of interest. Abstract This exploratory pilot study investigated the potential of combining a domain-specific model, BERN2, with large language models (LLMs) to enhance automated disease phenotyping from research survey data. Motivated by the need for efficient and accurate methods to harmonize the growing volume of survey data with standardized disease ontologies, we employed BERN2, a biomedical named entity recognition and normalization model, to extract disease information from the ORIGINS birth cohort survey data. After rigorously evaluating BERN2's performance against a manually curated ground truth dataset, we integrated various LLMs using prompt engineering, Retrieval-Augmented Generation (RAG), and Instructional Fine-Tuning (IFT) to refine the model's outputs. BERN2 demonstrated high performance in extracting and normalizing disease mentions, and the integration of LLMs, particularly with Few Shot Inference and RAG orchestration, further improved accuracy. This approach, especially when incorporating structured examples, logical reasoning prompts, and detailed context, offers a promising avenue for developing tools to enable efficient cohort profiling and data harmonization across large, heterogeneous research datasets. Introduction The increasing availability of research survey data from cohort studies and clinical trials offers unprecedented opportunities to advance biomedical research and improve healthcare (1).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found