Goto

Collaborating Authors

 Aurora


FRAGMENTA: End-to-end Fragmentation-based Generative Model with Agentic Tuning for Drug Lead Optimization

Suzuki, Yuto, Awolade, Paul, LaBarbera, Daniel V., Banaei-Kashani, Farnoush

arXiv.org Artificial Intelligence

Molecule generation using generative AI is vital for drug discovery, yet class-specific datasets often contain fewer than 100 training examples. While fragment-based models handle limited data better than atom-based approaches, existing heuristic fragmentation limits diversity and misses key fragments. Additionally, model tuning typically requires slow, indirect collaboration between medicinal chemists and AI engineers. We introduce FRAGMENTA, an end-to-end framework for drug lead optimization comprising: 1) a novel generative model that reframes fragmentation as a "vocabulary selection" problem, using dynamic Q-learning to jointly optimize fragmentation and generation; and 2) an agentic AI system that refines objectives via conversational feedback from domain experts. This system removes the AI engineer from the loop and progressively learns domain knowledge to eventually automate tuning. In real-world cancer drug discovery experiments, FRAGMENTA's Human-Agent configuration identified nearly twice as many high-scoring molecules as baselines. Furthermore, the fully autonomous Agent-Agent system outperformed traditional Human-Human tuning, demonstrating the efficacy of agentic tuning in capturing expert intent.


Police agencies turn to virtual reality to improve split-second decision-making

FOX News

This material may not be published, broadcast, rewritten, or redistributed. Quotes displayed in real-time or delayed by at least 15 minutes. Market data provided by Factset . Powered and implemented by FactSet Digital Solutions . Mutual Fund and ETF data provided by Refinitiv Lipper .


Large language models management of medications: three performance analyses

Henry, Kelli, Xu, Steven, Blotske, Kaitlin, Cargile, Moriah, Barreto, Erin F., Murray, Brian, Smith, Susan, Bauer, Seth R., Zhao, Xingmeng, Tilley, Adeleine, Gao, Yanjun, Liu, Tianming, Sohn, Sunghwan, Sikora, Andrea

arXiv.org Artificial Intelligence

Purpose: Large language models (LLMs) have proven performance for certain diagnostic tasks, however limited studies have evaluated their consistency in recommending appropriate medication regimens for a given diagnosis. Medication management is a complex task that requires synthesis of drug formulation and complete order instructions for safe use. Here, the performance of GPT 4o, an LLM available with ChatGPT, was tested for three medication management tasks. Methods: GPT-4o performance was tested using three medication tasks: identifying available formulations for a given generic drug name, identifying drug-drug interactions (DDI) for a given medication regimen, and preparing a medication order for a given generic drug name. For each experiment, the models raw text response was captured exactly as returned and evaluated using clinician evaluation in addition to standard LLM metrics, including Term Frequency-Inverse Document Frequency (TF IDF) vectors, normalized Levenshtein similarity, and Recall-Oriented Understudy for Gisting Evaluation (ROUGE 1/ROUGE L F1) between each response and its reference string. Results: For the first task of drug-formulation matching, GPT-4o had 49% accuracy for generic medications being matched to all available formulations, with an average of 1.23 omissions per medication and 1.14 hallucinations per medication. For the second task of drug-drug interaction identification, the accuracy was 54.7% for identifying the DDI pair. For the third task, GPT-4o generated order sentences containing no medication or abbreviation errors in 65.8% of cases. Conclusions: Model performance for basic medication tasks was consistently poor. This evaluation highlights the need for domain-specific training through clinician-annotated datasets and a comprehensive evaluation framework for benchmarking performance.


Brittleness and Promise: Knowledge Graph Based Reward Modeling for Diagnostic Reasoning

Khatwani, Saksham, Cheng, He, Afshar, Majid, Dligach, Dmitriy, Gao, Yanjun

arXiv.org Artificial Intelligence

Large language models (LLMs) show promise for diagnostic reasoning but often lack reliable, knowledge grounded inference. Knowledge graphs (KGs), such as the Unified Medical Language System (UMLS), offer structured biomedical knowledge that can support trustworthy reasoning. Prior approaches typically integrate KGs via retrieval augmented generation or fine tuning, inserting KG content into prompts rather than enabling structured reasoning. We explore an alternative paradigm: treating the LLM as a reward model of KG reasoning paths, where the model learns to judge whether a candidate path leads to correct diagnosis for a given patient input. This approach is inspired by recent work that leverages reward training to enhance model reasoning abilities, and grounded in computational theory, which suggests that verifying a solution is often easier than generating one from scratch. It also parallels physicians' diagnostic assessment, where they judge which sequences of findings and intermediate conditions most plausibly support a diagnosis. We first systematically evaluate five task formulation for knowledge path judging and eight training paradigm. Second, we test whether the path judging abilities generalize to downstream diagnostic tasks, including diagnosis summarization and medical question answering. Experiments with three open source instruct-tuned LLMs reveal both promise and brittleness: while specific reward optimization and distillation lead to strong path-judging performance, the transferability to downstream tasks remain weak. Our finding provides the first systematic assessment of "reward model style" reasoning over clinical KGs, offering insights into how structured, reward-based supervision influences diagnostic reasoning in GenAI systems for healthcare.


Standards in the Preparation of Biomedical Research Metadata: A Bridge2AI Perspective

Caufield, Harry, Ghosh, Satrajit, Kong, Sek Wong, Parker, Jillian, Sheffield, Nathan, Patel, Bhavesh, Williams, Andrew, Clark, Timothy, Munoz-Torres, Monica C.

arXiv.org Artificial Intelligence

AI-readiness describes the degree to which data may be optimally and ethically used for subsequent AI and Machine Learning (AI/ML) methods, where those methods may involve some combination of model training, data classification, and ethical, explainable prediction. The Bridge2AI consortium has defined the particular criteria a biomedical dataset may possess to render it AI-ready: in brief, a dataset's readiness is related to its FAIRness, provenance, degree of characterization, explainability, sustainability, and computability, in addition to its accompaniment with documentation about ethical data practices. To ensure AI-readiness and to clarify data structure and relationships within Bridge2AI's Grand Challenges (GCs), particular types of metadata are necessary. The GCs within the Bridge2AI initiative include four data-generating projects focusing on generating AI/ML-ready datasets to tackle complex biomedical and behavioral research problems. These projects develop standardized, multimodal data, tools, and training resources to support AI integration, while addressing ethical data practices. Examples include using voice as a biomarker, building interpretable genomic tools, modeling disease trajectories with diverse multimodal data, and mapping cellular and molecular health indicators across the human body. This report assesses the state of metadata creation and standardization in the Bridge2AI GCs, provides guidelines where required, and identifies gaps and areas for improvement across the program. New projects, including those outside the Bridge2AI consortium, would benefit from what we have learned about creating metadata as part of efforts to promote AI readiness.


Identifying Neural Signatures from fMRI using Hybrid Principal Components Regression

Rieck, Jared, Wrobel, Julia, Gowin, Joshua L., Wang, Yue, Paulus, Martin, Peterson, Ryan

arXiv.org Machine Learning

Recent advances in neuroimaging analysis have enabled accurate decoding of mental state from brain activation patterns during functional magnetic resonance imaging scans. A commonly applied tool for this purpose is principal components regression regularized with the least absolute shrinkage and selection operator (LASSO PCR), a type of multi-voxel pattern analysis (MVPA). This model presumes that all components are equally likely to harbor relevant information, when in fact the task-related signal may be concentrated in specific components. In such cases, the model will fail to select the optimal set of principal components that maximizes the total signal relevant to the cognitive process under study. Here, we present modifications to LASSO PCR that allow for a regularization penalty tied directly to the index of the principal component, reflecting a prior belief that task-relevant signal is more likely to be concentrated in components explaining greater variance. Additionally, we propose a novel hybrid method, Joint Sparsity-Ranked LASSO (JSRL), which integrates component-level and voxel-level activity under an information parity framework and imposes ranked sparsity to guide component selection. We apply the models to brain activation during risk taking, monetary incentive, and emotion regulation tasks. Results demonstrate that incorporating sparsity ranking into LASSO PCR produces models with enhanced classification performance, with JSRL achieving up to 51.7\% improvement in cross-validated deviance $R^2$ and 7.3\% improvement in cross-validated AUC. Furthermore, sparsity-ranked models perform as well as or better than standard LASSO PCR approaches across all classification tasks and allocate predictive weight to brain regions consistent with their established functional roles, offering a robust alternative for MVPA.


Enabling Down Syndrome Research through a Knowledge Graph-Driven Analytical Framework

Krishnamurthy, Madan, Saha, Surya, Lo, Pierrette, Whetzel, Patricia L., Issabekova, Tursynay, Vargas, Jamed Ferreris, DiGiovanna, Jack, Haendel, Melissa A

arXiv.org Artificial Intelligence

Trisomy 21 results in Down syndrome, a multifaceted genetic disorder with diverse clinical phenotypes, including heart defects, immune dysfunction, neurodevelopmental differences, and early-onset dementia risk. Heterogeneity and fragmented data across studies challenge comprehensive research and translational discovery. The NIH INCLUDE (INvestigation of Co-occurring conditions across the Lifespan to Understand Down syndromE) initiative has assembled harmonized participant-level datasets, yet realizing their potential requires integrative analytical frameworks. We developed a knowledge graph-driven platform transforming nine INCLUDE studies, comprising 7,148 participants, 456 conditions, 501 phenotypes, and over 37,000 biospecimens, into a unified semantic infrastructure. Cross-resource enrichment with Monarch Initiative data expands coverage to 4,281 genes and 7,077 variants. The resulting knowledge graph contains over 1.6 million semantic associations, enabling AI-ready analysis with graph embeddings and path-based reasoning for hypothesis generation. Researchers can query the graph via SPARQL or natural language interfaces. This framework converts static data repositories into dynamic discovery environments, supporting cross-study pattern recognition, predictive modeling, and systematic exploration of genotype-phenotype relationships in Down syndrome.


Automating Expert-Level Medical Reasoning Evaluation of Large Language Models

Zhou, Shuang, Xie, Wenya, Li, Jiaxi, Zhan, Zaifu, Song, Meijia, Yang, Han, Espinoza, Cheyenna, Welton, Lindsay, Mai, Xinnie, Jin, Yanwei, Xu, Zidu, Chung, Yuen-Hei, Xing, Yiyun, Tsai, Meng-Han, Schaffer, Emma, Shi, Yucheng, Liu, Ninghao, Liu, Zirui, Zhang, Rui

arXiv.org Artificial Intelligence

As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring transparent and trustworthy reasoning is essential. However, existing evaluation strategies of LLMs' medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains lacking. To address this, we introduce MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable assessment of LLMs' medical reasoning. MedThink -Bench comprises 500 challenging questions across ten medical domains, each annotated with expert-crafted step-by -step rationales. Building on this, we propose LLM -w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM -as -a -Judge mechanisms to assess intermediate reasoning with expert -level fidelity while maintaining scalability. Experiments show that LLM -w-Ref exhibits a strong positive correlation with expert judgments. Benchmarking twelve state-of -the -art LLMs, we find that smaller models (e.g ., MedGemma-27B) can surpass larger proprietary counterparts (e.g., OpenAI -o3). Overall, MedThink-Bench offers a foundational tool for evaluating LLMs' medical reasoning, advancing their safe and responsible deployment in clinical practice.


Uncertainty-Driven Radar-Inertial Fusion for Instantaneous 3D Ego-Velocity Estimation

Rai, Prashant Kumar, Kowsari, Elham, Strokina, Nataliya, Ghabcheloo, Reza

arXiv.org Artificial Intelligence

F 2 = ComplexBN ( ComplexConv (F 1)) (3) Equation 3 further processes the features F 1 from the previous layer through another complex convolution layer, and the output is normalized using complex batch normalization. This step enhances the stability and efficiency of the network by standardizing the features before they are further processed. F 3 = SpatialAttention (ChannelAttention (F 2)) (4) In Equation 4, an attention mechanism (Spatial + Channel) is applied to F 2, which allows the network to focus on the most informative features by weighting them based on their significance in the ego-velocity estimation. We use spatial attention on the feature maps (Doppler, Channels) and channel attention on the samples dimension. Moreover, each complex-valued residual block in the network incorporates a skip connection. This means that the output of each block is concatenated with its input before being passed to the subsequent blocks. This architecture choice helps to mitigate the vanishing gradient problem during training by allowing gradients to flow directly through the network layers, thus enhancing the learning and convergence of the network [34]. The network is designed to effectively handle the complex-valued input from radar scans, ensuring robust feature extraction for subsequent processing stages.


Machine Learning Applications Related to Suicide in Military and Veterans: A Scoping Literature Review

Zhang, Yuhan, Wei, Yishu, Wang, Yanshan, Xiao, Yunyu, COL, null, Poropatich, Ronald K., Haas, Gretchen L., Zhang, Yiye, Weng, Chunhua, Liu, Jinze, Brenner, Lisa A., Bjork, James M., Peng, Yifan

arXiv.org Artificial Intelligence

Suicide remains one of the main preventable causes of death among active service members and veterans. Early detection and prediction are crucial in suicide prevention. Machine learning techniques have yielded promising results in this area recently. This study aims to assess and summarize current research and provides a comprehensive review regarding the application of machine learning techniques in assessing and predicting suicidal ideation, attempts, and mortality among members of military and veteran populations. A keyword search using PubMed, IEEE, ACM, and Google Scholar was conducted, and the PRISMA protocol was adopted for relevant study selection. Thirty-two articles met the inclusion criteria. These studies consistently identified risk factors relevant to mental health issues such as depression, post-traumatic stress disorder (PTSD), suicidal ideation, prior attempts, physical health problems, and demographic characteristics. Machine learning models applied in this area have demonstrated reasonable predictive accuracy. However, additional research gaps still exist. First, many studies have overlooked metrics that distinguish between false positives and negatives, such as positive predictive value and negative predictive value, which are crucial in the context of suicide prevention policies. Second, more dedicated approaches to handling survival and longitudinal data should be explored. Lastly, most studies focused on machine learning methods, with limited discussion of their connection to clinical rationales. In summary, machine learning analyses have identified a wide range of risk factors associated with suicide in military populations. The diversity and complexity of these factors also demonstrates that effective prevention strategies must be comprehensive and flexible.