Goto

Collaborating Authors

 biolord-2023


NDAI-NeuroMAP: A Neuroscience-Specific Embedding Model for Domain-Specific Retrieval

arXiv.org Artificial Intelligence

The exponential growth in neuroscience research output and clinical data necessitates the development of specialized natural language processing models tailored to this domain. Contemporary embedding models, while demonstrating superior performance on general-purpose benchmarks, exhibit suboptimal efficacy when applied to neuroscience-specific tasks due to their broad training objectives and limited exposure to domain-specific terminologies and conceptual relationships. This limitation significantly constrains the development of advanced applications including patient-centric retrieval-augmented generation (RAG) systems and comprehensive electronic health record (EHR) mining for neurological healthcare applications. To address this critical gap, we present NDAI-NeuroMAP, the first neuroscience-domain-specific dense vector embedding model engineered for high-precision information retrieval tasks. Our methodology encompasses the curation of an extensive domain-specific training corpus comprising 500,000 carefully constructed triplets (query-positive-negative configurations), augmented with 250,000 neuroscience-specific definitional entries and 250,000 structured knowledge-graph triplets derived from authoritative neurological ontologies. We employ a sophisticated fine-tuning approach utilizing the FremyCompany/BioLORD-2023 foundation model, implementing a multi-objective optimization framework combining contrastive learning with triplet-based metric learning paradigms. Comprehensive evaluation on a held-out test dataset comprising approximately 24,000 neuroscience-specific queries demonstrates substantial performance improvements over state-of-the-art general-purpose and biomedical embedding models. These empirical findings underscore the critical importance of domain-specific embedding architectures for neuroscience-oriented RAG systems and related clinical natural language processing applications. The landscape of natural language processing (NLP) has evolved profoundly over the past decade, driven by advances in neural embedding architectures. These models, which transform text into dense, high-dimensional vectors, now support diverse tasks spanning cross-lingual translation to large-scale information retrieval. Early methods, such as the seminal Word2V ec [1] and GloV e [2], introduced static word embeddings that successfully captured semantic relationships through distributional statistics, but failed to account for context, producing identical vectors for terms like "bank" regardless of meaning. Contextualized embedding architectures subsequently overcame these limitations.


BioLORD-2023: Semantic Textual Representations Fusing LLM and Clinical Knowledge Graph Insights

arXiv.org Artificial Intelligence

In this study, we investigate the potential of Large Language Models to complement biomedical knowledge graphs in the training of semantic models for the biomedical and clinical domains. Drawing on the wealth of the UMLS knowledge graph and harnessing cutting-edge Large Language Models, we propose a new state-of-the-art approach for obtaining high-fidelity representations of biomedical concepts and sentences, consisting of three steps: an improved contrastive learning phase, a novel self-distillation phase, and a weight averaging phase. Through rigorous evaluations via the extensive BioLORD testing suite and diverse downstream tasks, we demonstrate consistent and substantial performance improvements over the previous state of the art (e.g. +2pts on MedSTS, +2.5pts on MedNLI-S, +6.1pts on EHR-Rel-B). Besides our new state-of-the-art biomedical model for English, we also distill and release a multilingual model compatible with 50+ languages and finetuned on 7 European languages. Many clinical pipelines can benefit from our latest models. Our new multilingual model enables a range of languages to benefit from our advancements in biomedical semantic representation learning, opening a new avenue for bioinformatics researchers around the world. As a result, we hope to see BioLORD-2023 becoming a precious tool for future biomedical applications.