Goto

Collaborating Authors

 snippet




Refine Medical Diagnosis Using Generation Augmented Retrieval and Clinical Practice Guidelines

Li, Wenhao, Zhang, Hongkuan, Zhang, Hongwei, Li, Zhengxu, Dong, Zengjie, Chen, Yafan, Bidargaddi, Niranjan, Liu, Hong

arXiv.org Artificial Intelligence

-- Current medical language models, adapted from large language models (LLMs), typically predict ICD code - based diagnosis from electronic health records (EHRs) because these labels are readily available. However, ICD codes do not capture the nuanced, context - rich reasoning clinicians use for diagnosis. Clinicians synthesize diverse patient data and reference clinical practice guidelines (CPGs) to make evidence - based decisions. This misalignment limits the clinical utility of existing models. We introduce GARMLE - G, a Generation - Augmented Retrieval framework that grounds medical language model outp uts in authoritative CPGs. Unlike conventional Retrieval - Augmented Generation based approaches, GARMLE - G enables hallucination - free outputs by directly retrieving authoritative guideline content without relying on model - generated text. It (1) integrates LLM predictions with EHR data to create semantically rich queries, (2) retrieves relevant CPG knowledge snippets via embedding similarity, and (3) fuses guideline content with model output to generate clinically aligned recommendations. A prototype system for hypertension diagnosis was developed and evaluated on multiple metrics, demonstrating superior retrieval precision, semantic relevance, and clinical guideline adherence compared to RAG - based baselines, while maintaining a lightweight architecture suitable for localized healthcare deployment. This work provides a scalable, low - cost, and hallucination - free method for grounding medical language models in evidence - based clinical practice, with strong potential for broader clinical deployment. The research reported in this paper is financially supported by the National Natural Science Foundation of China (62276156), the project of Shandong Provincial Natural Science Foundation (ZR2024LZH005), the Taishan Scholar Program of Shandong Province of China (No.tsq nz20240809), and the Excellent Youth Foundation of Shandong Natural Science Foundation (2024HWYQ - 055). Wenhao Li is with Shandong Normal University, Jinan, China, 250358 (email: lwh@sdnu.edu.cn) Hongkuan Zhang is with Shandong Normal University, Jinan, China, 250358 (email: 2024217028@stu.sdnu.edu.cn) In the healthcare sector, language models and related tools, such as ChatGPT and ClinicalBERT, have been increasingly applied across multiple scenarios, including disease prediction, clinical decision support, patient interaction, drug discovery, and personalized medicine, significantly driving innovation and transformation in medical technology [1, 2] . As a fundamental task in healthcare, disease diagnosis refers to the process by which health professionals identify the most likely disease or disorder causing a patient's symptoms [3] .


Reformulate, Retrieve, Localize: Agents for Repository-Level Bug Localization

Caumartin, Genevieve, Melo, Glaucia

arXiv.org Artificial Intelligence

Bug localization remains a critical yet time-consuming challenge in large-scale software repositories. Traditional information retrieval-based bug localization (IRBL) methods rely on unchanged bug descriptions, which often contain noisy information, leading to poor retrieval accuracy. Recent advances in large language models (LLMs) have improved bug localization through query reformulation, yet the effect on agent performance remains unexplored. In this study, we investigate how an LLM-powered agent can improve file-level bug localization via lightweight query reformulation and summarization. We first employ an open-source, non-fine-tuned LLM to extract key information from bug reports, such as identifiers and code snippets, and reformulate queries pre-retrieval. Our agent then orchestrates BM25 retrieval using these preprocessed queries, automating localization workflow at scale. Using the best-performing query reformulation technique, our agent achieves 35% better ranking in first-file retrieval than our BM25 baseline and up to +22% file retrieval performance over SWE-agent.


Justice in Judgment: Unveiling (Hidden) Bias in LLM-assisted Peer Reviews

Vasu, Sai Suresh Macharla, Sheth, Ivaxi, Wang, Hui-Po, Binkyte, Ruta, Fritz, Mario

arXiv.org Artificial Intelligence

The adoption of large language models (LLMs) is transforming the peer review process, from assisting reviewers in writing more detailed evaluations to generating entire reviews automatically. While these capabilities offer exciting opportunities, they also raise critical concerns about fairness and reliability. In this paper, we investigate bias in LLM-generated peer reviews by conducting controlled experiments on sensitive metadata, including author affiliation and gender. Our analysis consistently shows affiliation bias favoring institutions highly ranked on common academic rankings. Additionally, we find some gender preferences, which, even though subtle in magnitude, have the potential to compound over time. Notably, we uncover implicit biases that become more evident with token-based soft ratings.


Discovering the Underlying Analytic Structure Within Standard Model Constants Using Artificial Intelligence

Chekanov, S. V., Kjellerstrand, H.

arXiv.org Artificial Intelligence

This paper presents a method for uncovering hidden analytic relationships among the fundamental parameters of the Standard Model (SM), a foundational theory in physics that describes the fundamental particles and their interactions, using symbolic regression and genetic programming. Using this approach, we identify the simplest analytic relationships connecting pairs of these constants and report several notable expressions obtained with relative precision better than 1%. These results may serve as valuable inputs for model builders and artificial intelligence methods aimed at uncovering hidden patterns among the SM constants, or potentially used as building blocks for a deeper underlying law that connects all parameters of the SM through a small set of fundamental constants.


Chain of Unit-Physics: A Primitive-Centric Approach to Scientific Code Synthesis

Sharma, Vansh, Raman, Venkat

arXiv.org Artificial Intelligence

Agentic large language models are proposed as autonomous code generators for scientific computing, yet their reliability in high-stakes problems remains unclear. Developing computational scientific software from natural-language queries remains challenging broadly due to (a) sparse representation of domain codes during training and (b) the limited feasibility of RLHF with a small expert community. To address these limitations, this work conceptualizes an inverse approach to code design, embodied in the Chain of Unit-Physics framework: a first-principles (or primitives)-centric, multi-agent system in which human expert knowledge is encoded as unit-physics tests that explicitly constrain code generation. The framework is evaluated on a nontrivial combustion task, used here as a representative benchmark for scientific problem with realistic physical constraints. Closed-weight systems and code-focused agentic variants fail to produce correct end-to-end solvers, despite tool and web access, exhibiting four recurrent error classes: interface (syntax/API) hallucinations, overconfident assumptions, numerical/physical incoherence, and configuration fragility. Open-weight models with chain-of-thought (CoT) decoding reduce interface errors but still yield incorrect solutions. On the benchmark task, the proposed framework converges within 5-6 iterations, matches the human-expert implementation (mean error of $3.1\times10^{-3}$ %), with a $\sim$33.4 % faster runtime and a $\sim$30 % efficient memory usage at a cost comparable to mid-sized commercial APIs, yielding a practical template for physics-grounded scientific code generation. As datasets and models evolve, zero-shot code accuracy will improve; however, the Chain of Unit-Physics framework goes further by embedding first-principles analysis that is foundational to scientific codes.


Future Is Unevenly Distributed: Forecasting Ability of LLMs Depends on What We're Asking

Karkar, Chinmay, Chopra, Paras

arXiv.org Artificial Intelligence

Large Language Models (LLMs) demonstrate partial forecasting competence across social, political, and economic events. Y et, their predictive ability varies sharply with domain structure and prompt framing. We investigate how forecasting performance varies with different model families on real-world questions about events that happened beyond the model cutoff date. We analyze how context, question type, and external knowledge affect accuracy and calibration, and how adding factual news context modifies belief formation and failure modes. Our results show that forecasting ability is highly variable as it depends on what, and how, we ask.


A ROS2 Interface for Universal Robots Collaborative Manipulators Based on ur_rtde

Saccuti, Alessio, Monica, Riccardo, Aleotti, Jacopo

arXiv.org Artificial Intelligence

The Universal Robots RTDE communication interface is well-known in literature and it was used in several works. In [5] and [6] RTDE was adopted to control UR cobots. In [7], [8], and [9], the RTDE interface was used only for data acquisition. To facilitate the development of external applications for UR cobots, various higher-level software interfaces and drivers have been proposed based on RTDE. In addition to the official software interface by Universal Robots (ur_client_li-brary), a few alternatives have been developed by third-parties. One of these software interfaces is ur_rtde [4] by SDU Robotics, which was used in this work. Another similar interface is python-urx [10], which is a Python interface for tasks that do not require high control frequency.


When concept-based XAI is imprecise: Do people distinguish between generalisations and misrepresentations?

Müller, Romy

arXiv.org Artificial Intelligence

Concept-based explainable artificial intelligence (C-XAI) can let people see which representations an AI model has learned. This is particularly important when high-level semantic information (e.g., actions and relations) is used to make decisions about abstract categories (e.g., danger). In such tasks, AI models need to generalise beyond situation-specific details, and this ability can be reflected in C-XAI outputs that randomise over irrelevant features. However, it is unclear whether people appreciate such generalisation and can distinguish it from other, less desirable forms of imprecision in C-XAI outputs. Therefore, the present study investigated how the generality and relevance of C-XAI outputs affect people's evaluation of AI. In an experimental railway safety evaluation scenario, participants rated the performance of a simulated AI that classified traffic scenes involving people as dangerous or not. These classification decisions were explained via concepts in the form of similar image snippets. The latter differed in their match with the classified image, either regarding a highly relevant feature (i.e., people's relation to tracks) or a less relevant feature (i.e., people's action). Contrary to the hypotheses, concepts that generalised over less relevant features were rated lower than concepts that matched the classified image precisely. Moreover, their ratings were no better than those for systematic misrepresentations of the less relevant feature. Conversely, participants were highly sensitive to imprecisions in relevant features. These findings cast doubts on the assumption that people can easily infer from C-XAI outputs whether AI models have gained a deeper understanding of complex situations.