clinical question
Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering
Bogireddy, Sai Prasanna Teja Reddy, Majeedi, Abrar, Gajjala, Viswanatha Reddy, Xu, Zhuoyan, Rai, Siddhant, Potlapalli, Vaishnav
Automated question answering (QA) over electronic health records (EHRs) can bridge critical information gaps for clinicians and patients, yet it demands both precise evidence retrieval and faithful answer generation under limited supervision. In this work, we present Neural, the runner-up in the BioNLP 2025 ArchEHR-QA shared task on evidence-grounded clinical QA. Our proposed method decouples the task into (1) sentence-level evidence identification and (2) answer synthesis with explicit citations. For each stage, we automatically explore the prompt space with DSPy's MIPROv2 optimizer, jointly tuning instructions and few-shot demonstrations on the development set. A self-consistency voting scheme further improves evidence recall without sacrificing precision. On the hidden test set, our method attains an overall score of 51.5, placing second stage while outperforming standard zero-shot and few-shot prompting by over 20 and 10 points, respectively. These results indicate that data-driven prompt optimization is a cost-effective alternative to model fine-tuning for high-stakes clinical QA, advancing the reliability of AI assistants in healthcare.
- Europe > Austria > Vienna (0.14)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Demo: Guide-RAG: Evidence-Driven Corpus Curation for Retrieval-Augmented Generation in Long COVID
DiGiacomo, Philip, Wang, Haoyang, Fang, Jinrui, Leng, Yan, Brode, W Michael, Ding, Ying
As AI chatbots gain adoption in clinical medicine, developing effective frameworks for complex, emerging diseases presents significant challenges. We developed and evaluated six Retrieval-Augmented Generation (RAG) corpus configurations for Long COVID (LC) clinical question answering, ranging from expert-curated sources to large-scale literature databases. Our evaluation employed an LLM-as-a-judge framework across faithfulness, relevance, and comprehensiveness metrics using LongCOVID-CQ, a novel dataset of expert-generated clinical questions. Our RAG corpus configuration combining clinical guidelines with high-quality systematic reviews consistently outperformed both narrow single-guideline approaches and large-scale literature databases. Our findings suggest that for emerging diseases, retrieval grounded in curated secondary reviews provides an optimal balance between narrow consensus documents and unfiltered primary literature, supporting clinical decision-making while avoiding information overload and oversimplified guidance. We propose Guide-RAG, a chatbot system and accompanying evaluation framework that integrates both curated expert knowledge and comprehensive literature databases to effectively answer LC clinical questions.
- North America > United States > Texas > Travis County > Austin (0.05)
- Asia > Middle East > Jordan (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- Europe > Ukraine > Kyiv Oblast > Chernobyl (0.04)
A Locally Executable AI System for Improving Preoperative Patient Communication: A Multi-Domain Clinical Evaluation
Sato, Motoki, Matsushita, Yuki, Takahashi, Hidekazu, Kakazu, Tomoaki, Nagata, Sou, Ohnuma, Mizuho, Yoshikawa, Atsushi, Yamamura, Masayuki
Patients awaiting invasive procedures often have unanswered pre-procedural questions; however, time-pressured workflows and privacy constraints limit personalized counseling. We present LENOHA (Low Energy, No Hallucination, Leave No One Behind Architecture), a safety-first, local-first system that routes inputs with a high-precision sentence-transformer classifier and returns verbatim answers from a clinician-curated FAQ for clinical queries, eliminating free-text generation in the clinical path. We evaluated two domains (tooth extraction and gastroscopy) using expert-reviewed validation sets (n=400/domain) for thresholding and independent test sets (n=200/domain). Among the four encoders, E5-large-instruct (560M) achieved an overall accuracy of 0.983 (95% CI 0.964-0.991), AUC 0.996, and seven total errors, which were statistically indistinguishable from GPT-4o on this task; Gemini made no errors on this test set. Energy logging shows that the non-generative clinical path consumes ~1.0 mWh per input versus ~168 mWh per small-talk reply from a local 8B SLM, a ~170x difference, while maintaining ~0.10 s latency on a single on-prem GPU. These results indicate that near-frontier discrimination and generation-induced errors are structurally avoided in the clinical path by returning vetted FAQ answers verbatim, supporting privacy, sustainability, and equitable deployment in bandwidth-limited environments.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Asia > Japan > Kyūshū & Okinawa > Kyūshū > Nagasaki Prefecture > Nagasaki (0.05)
- North America > United States > California (0.04)
- (8 more...)
Leveraging Self-Supervised Learning Methods for Remote Screening of Subjects with Paroxysmal Atrial Fibrillation
Atienza, Adrian, Manimaran, Gouthamaan, Puthusserypady, Sadasivan, Dominguez, Helena, Jacobsen, Peter K., Bardram, Jakob E.
The integration of Artificial Intelligence (AI) into clinical research has great potential to reveal patterns that are difficult for humans to detect, creating impactful connections between inputs and clinical outcomes. However, these methods often require large amounts of labeled data, which can be difficult to obtain in healthcare due to strict privacy laws and the need for experts to annotate data. This requirement creates a bottleneck when investigating unexplored clinical questions. This study explores the application of Self-Supervised Learning (SSL) as a way to obtain preliminary results from clinical studies with limited sized cohorts. To assess our approach, we focus on an underexplored clinical task: screening subjects for Paroxysmal Atrial Fibrillation (P-AF) using remote monitoring, single-lead ECG signals captured during normal sinus rhythm. We evaluate state-of-the-art SSL methods alongside supervised learning approaches, where SSL outperforms supervised learning in this task of interest. More importantly, it prevents misleading conclusions that may arise from poor performance in the latter paradigm when dealing with limited cohort settings.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
NLP-based assessment of prescription appropriateness from Italian referrals
Torri, Vittorio, Bottelli, Annamaria, Ercolanoni, Michele, Leoni, Olivia, Ieva, Francesca
Objective: This study proposes a Natural Language Processing pipeline to evaluate prescription appropriateness in Italian referrals, where reasons for prescriptions are recorded only as free text, complicating automated comparisons with guidelines. The pipeline aims to derive, for the first time, a comprehensive summary of the reasons behind these referrals and a quantification of their appropriateness. While demonstrated in a specific case study, the approach is designed to generalize to other types of examinations. Methods: Leveraging embeddings from a transformer-based model, the proposed approach clusters referral texts, maps clusters to labels, and aligns these labels with existing guidelines. We present a case study on a dataset of 496,971 referrals, consisting of all referrals for venous echocolordopplers of the lower limbs between 2019 and 2021 in the Lombardy Region. A sample of 1,000 referrals was manually annotated to validate the results. Results: The pipeline exhibited high performance for referrals' reasons (Prec=92.43%, Rec=83.28%) and excellent results for referrals' appropriateness (Prec=93.58%, Rec=91.52%) on the annotated subset. Analysis of the entire dataset identified clusters matching guideline-defined reasons - both appropriate and inappropriate - as well as clusters not addressed in the guidelines. Overall, 34.32% of referrals were marked as appropriate, 34.07% inappropriate, 14.37% likely inappropriate, and 17.24% could not be mapped to guidelines. Conclusions: The proposed pipeline effectively assessed prescription appropriateness across a large dataset, serving as a valuable tool for health authorities. Findings have informed the Lombardy Region's efforts to strengthen recommendations and reduce the burden of inappropriate referrals.
- Europe > Italy > Lombardy (0.45)
- South America > Chile (0.04)
- North America > United States (0.04)
- (4 more...)
RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions
Kell, Gregory, Roberts, Angus, Umansky, Serge, Khare, Yuti, Ahmed, Najma, Patel, Nikhil, Simela, Chloe, Coumbe, Jack, Rozario, Julian, Griffiths, Ryan-Rhys, Marshall, Iain J.
Clinical question answering systems have the potential to provide clinicians with relevant and timely answers to their questions. Nonetheless, despite the advances that have been made, adoption of these systems in clinical settings has been slow. One issue is a lack of question-answering datasets which reflect the real-world needs of health professionals. In this work, we present RealMedQA, a dataset of realistic clinical questions generated by humans and an LLM. We describe the process for generating and verifying the QA pairs and assess several QA models on BioASQ and RealMedQA to assess the relative difficulty of matching answers to questions. We show that the LLM is more cost-efficient for generating "ideal" QA pairs. Additionally, we achieve a lower lexical similarity between questions and answers than BioASQ which provides an additional challenge to the top two QA models, as per the results. Introduction Clinical question answering (QA) systems could allow clinicians to find timely and relevant answers to questions occurring during consultations in real-time [1, 2, 3, 4, 5].
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
A Course Shared Task on Evaluating LLM Output for Clinical Questions
Hou, Yufang, Tran, Thy Thy, Vu, Doan Nam Long, Cao, Yiwen, Li, Kai, Rohde, Lukas, Gurevych, Iryna
This paper presents a shared task that we organized at the Foundations of Language Technology (FoLT) course in 2023/2024 at the Technical University of Darmstadt, which focuses on evaluating the output of Large Language Models (LLMs) in generating harmful answers to health-related clinical questions. We describe the task design considerations and report the feedback we received from the students. We expect the task and the findings reported in this paper to be relevant for instructors teaching natural language processing (NLP) and designing course assignments.
- Instructional Material > Course Syllabus & Notes (0.67)
- Research Report (0.64)
Answering real-world clinical questions using large language model based systems
Low, Yen Sia, Jackson, Michael L., Hyde, Rebecca J., Brown, Robert E., Sanghavi, Neil M., Baldwin, Julian D., Pike, C. William, Muralidharan, Jananee, Hui, Gavin, Alexander, Natasha, Hassan, Hadeel, Nene, Rahul V., Pike, Morgan, Pokrzywa, Courtney J., Vedak, Shivam, Yan, Adam Paul, Yao, Dong-han, Zipursky, Amy R., Dinh, Christina, Ballentine, Philip, Derieg, Dan C., Polony, Vladimir, Chawdry, Rehan N., Davies, Jordan, Hyde, Brigham B., Shah, Nigam H., Gombar, Saurabh
Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agentic ChatRWD was able to answer novel questions compared to other LLMs (65% vs. 0-9%). These results suggest that while general-purpose LLMs should not be used as-is, a purpose-built system for evidence summarization based on RAG and one for generating novel evidence working synergistically would improve availability of pertinent evidence for patient care.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.88)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Therapeutic Area > Immunology (1.00)
- Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
- (6 more...)
PubMed and Beyond: Biomedical Literature Search in the Age of Artificial Intelligence
Jin, Qiao, Leaman, Robert, Lu, Zhiyong
Biomedical research yields a wealth of information, much of which is only accessible through the literature. Consequently, literature search is an essential tool for building on prior knowledge in clinical and biomedical research. Although recent improvements in artificial intelligence have expanded functionality beyond keyword-based search, these advances may be unfamiliar to clinicians and researchers. In response, we present a survey of literature search tools tailored to both general and specific information needs in biomedicine, with the objective of helping readers efficiently fulfill their information needs. We first examine the widely used PubMed search engine, discussing recent improvements and continued challenges. We then describe literature search tools catering to five specific information needs: 1. Identifying high-quality clinical research for evidence-based medicine. 2. Retrieving gene-related information for precision medicine and genomics. 3. Searching by meaning, including natural language questions. 4. Locating related articles with literature recommendation. 5. Mining literature to discover associations between concepts such as diseases and genetic variants. Additionally, we cover practical considerations and best practices for choosing and using these tools. Finally, we provide a perspective on the future of literature search engines, considering recent breakthroughs in large language models such as ChatGPT. In summary, our survey provides a comprehensive view of biomedical literature search functionalities with 36 publicly available tools.
- North America > United States (0.14)
- Europe (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Overview (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Teaching AI to ask clinical questions
Physicians often query a patient's electronic health record for information that helps them make treatment decisions, but the cumbersome nature of these records hampers the process. Research has shown that even when a doctor has been trained to use an electronic health record (EHR), finding an answer to just one question can take, on average, more than eight minutes. The more time physicians must spend navigating an oftentimes clunky EHR interface, the less time they have to interact with patients and provide treatment. Researchers have begun developing machine-learning models that can streamline the process by automatically finding information physicians need in an EHR. However, training effective models requires huge datasets of relevant medical questions, which are often hard to come by due to privacy restrictions.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.40)
- Asia > Philippines > Luzon > National Capital Region > City of Manila (0.05)