biomarker discovery
TriAgent: Automated Biomarker Discovery with Deep Research Grounding for Triage in Acute Care by LLM-Based Multi-Agent Collaboration
Delikoyun, Kerem, Chen, Qianyu, Kuan, Win Sen, Soong, John Tshon Yit, Cove, Matthew Edward, Hayden, Oliver
Emergency departments worldwide face rising patient volumes, workforce shortages, and variability in triage decisions that threaten the delivery of timely and accurate care. Current triage methods rely primarily on vital signs, routine laboratory values, and clinicians' judgment, which, while effective, often miss emerging biological signals that could improve risk prediction for infection typing or antibiotic administration in acute conditions. To address this challenge, we introduce TriAgent, a large language model (LLM)-based multi-agent framework that couples automated biomarker discovery with deep research for literature-grounded validation and novelty assessment. TriAgent employs a supervisor research agent to generate research topics and delegate targeted queries to specialized sub-agents for evidence retrieval from various data sources. Findings are synthesized to classify biomarkers as either grounded in existing knowledge or flagged as novel candidates, offering transparent justification and highlighting unexplored pathways in acute care risk stratification. Unlike prior frameworks limited to existing routine clinical biomarkers, TriAgent aims to deliver an end-to-end framework from data analysis to literature grounding to improve transparency, explainability and expand the frontier of potentially actionable clinical biomarkers. Given a user's clinical query and quantitative triage data, TriAgent achieved a topic adherence F1 score of 55.7 +/- 5.0%, surpassing the CoT-ReAct agent by over 10%, and a faithfulness score of 0.42 +/- 0.39, exceeding all baselines by more than 50%. Across experiments, TriAgent consistently outperformed state-of-the-art LLM-based agentic frameworks in biomarker justification and literature-grounded novelty assessment. We share our repo: https://github.com/CellFace/TriAgent.
An Interpretable Ensemble Framework for Multi-Omics Dementia Biomarker Discovery Under HDLSS Conditions
Lee, Byeonghee, Kang, Joonsung
The advent of multi-omics technologies has revolutionized biomedical research, enabling simultaneous interrogation of genomic, transcriptomic, proteomic, and metabolomic layers [Wang et al., 2021a]. This integrative paradigm has yielded unprecedented insights into the molecular architecture of complex diseases, particularly neurodegenerative disorders such as Alzheimer's disease. However, multi-omics datasets are often characterized by high-dimensional variables and limited sample sizes--a configuration known as high-dimension low-sample size (HDLSS). Under such constraints, conventional statistical methods suffer from reduced power and unrealistic assumptions [Fan and Lv, 2008], while deep learning models may exhibit overfitting and lack interpretability [LeCun et al., 2015]. Recent advances in dementia biomarker discovery have embraced multi-omics integration. For example, Iturria-Medina [2018] fused neuroimaging and omics data to identify disease-relevant signatures. Zhang [2020] employed transcriptomic-proteomic fusion to uncover molecular markers, and Lee [2022] demonstrated the discriminative utility of metabolomic features in Alzheimer's pathology. These efforts build upon foundational work in integrative omics [Hasin, 2017, Karczewski and Snyder, 2018], yet challenges persist in elucidating latent gene networks and selecting statistically robust features amidst inter-feature dependencies.
Optimizing Prognostic Biomarker Discovery in Pancreatic Cancer Through Hybrid Ensemble Feature Selection and Multi-Omics Data
Zobolas, John, George, Anne-Marie, Lรณpez, Alberto, Fischer, Sebastian, Becker, Marc, Aittokallio, Tero
Prediction of patient survival using high-dimensional multi-omics data requires systematic feature selection methods that ensure predictive performance, sparsity, and reliability for prognostic biomarker discovery. We developed a hybrid ensemble feature selection (hEFS) approach that combines data subsampling with multiple prognostic models, integrating both embedded and wrapper-based strategies for survival prediction. Omics features are ranked using a voting-theory-inspired aggregation mechanism across models and subsamples, while the optimal number of features is selected via a Pareto front, balancing predictive accuracy and model sparsity without any user-defined thresholds. When applied to multi-omics datasets from three pancreatic cancer cohorts, hEFS identifies significantly fewer and more stable biomarkers compared to the conventional, late-fusion CoxLasso models, while maintaining comparable discrimination performance. Implemented within the open-source mlr3fselect R package, hEFS offers a robust, interpretable, and clinically valuable tool for prognostic modelling and biomarker discovery in high-dimensional survival settings.
Deep-learning-based clustering of OCT images for biomarker discovery in age-related macular degeneration (Pinnacle study report 4)
Holland, Robbie, Kaye, Rebecca, Hagag, Ahmed M., Leingang, Oliver, Taylor, Thomas R. P., Bogunoviฤ, Hrvoje, Schmidt-Erfurth, Ursula, Scholl, Hendrik P. N., Rueckert, Daniel, Lotery, Andrew J., Sivaprasad, Sobha, Menten, Martin J.
Diseases are currently managed by grading systems, where patients are stratified by grading systems into stages that indicate patient risk and guide clinical management. However, these broad categories typically lack prognostic value, and proposals for new biomarkers are currently limited to anecdotal observations. In this work, we introduce a deep-learning-based biomarker proposal system for the purpose of accelerating biomarker discovery in age-related macular degeneration (AMD). It works by first training a neural network using self-supervised contrastive learning to discover, without any clinical annotations, features relating to both known and unknown AMD biomarkers present in 46,496 retinal optical coherence tomography (OCT) images. To interpret the discovered biomarkers, we partition the images into 30 subsets, termed clusters, that contain similar features. We then conduct two parallel 1.5-hour semi-structured interviews with two independent teams of retinal specialists that describe each cluster in clinical language. Overall, both teams independently identified clearly distinct characteristics in 27 of 30 clusters, of which 23 were related to AMD. Seven were recognised as known biomarkers already used in established grading systems and 16 depicted biomarker combinations or subtypes that are either not yet used in grading systems, were only recently proposed, or were unknown. Clusters separated incomplete from complete retinal atrophy, intraretinal from subretinal fluid and thick from thin choroids, and in simulation outperformed clinically-used grading systems in prognostic value. Overall, contrastive learning enabled the automatic proposal of AMD biomarkers that go beyond the set used by clinically established grading systems. Ultimately, we envision that equipping clinicians with discovery-oriented deep-learning tools can accelerate discovery of novel prognostic biomarkers.
Dual-stage optimizer for systematic overestimation adjustment applied to multi-objective genetic algorithms for biomarker selection
Cattelani, Luca, Fortino, Vittorio
The challenge in biomarker discovery using machine learning from omics data lies in the abundance of molecular features but scarcity of samples. Most feature selection methods in machine learning require evaluating various sets of features (models) to determine the most effective combination. This process, typically conducted using a validation dataset, involves testing different feature sets to optimize the model's performance. Evaluations have performance estimation error and when the selection involves many models the best ones are almost certainly overestimated. Biomarker identification with feature selection methods can be addressed as a multi-objective problem with trade-offs between predictive ability and parsimony in the number of features. Genetic algorithms are a popular tool for multi-objective optimization but they evolve numerous solutions thus are prone to overestimation. Methods have been proposed to reduce the overestimation after a model has already been selected in single-objective problems, but no algorithm existed capable of reducing the overestimation during the optimization, improving model selection, or applied in the more general multi-objective domain. We propose DOSA-MO, a novel multi-objective optimization wrapper algorithm that learns how the original estimation, its variance, and the feature set size of the solutions predict the overestimation. DOSA-MO adjusts the expectation of the performance during the optimization, improving the composition of the solution set. We verify that DOSA-MO improves the performance of a state-of-the-art genetic algorithm on left-out or external sample sets, when predicting cancer subtypes and/or patient overall survival, using three transcriptomics datasets for kidney and breast cancer.
scBeacon: single-cell biomarker extraction via identifying paired cell clusters across biological conditions with contrastive siamese networks
Liu, Chenyu, Kweon, Yong Jin, Ding, Jun
Despite the breakthroughs in biomarker discovery facilitated by differential gene analysis, challenges remain, particularly at the single-cell level. Traditional methodologies heavily rely on user-supplied cell annotations, focusing on individually expressed data, often neglecting the critical interactions between biological conditions, such as healthy versus diseased states. In response, here we introduce scBeacon, an innovative framework built upon a deep contrastive siamese network. scBeacon pioneers an unsupervised approach, adeptly identifying matched cell populations across varied conditions, enabling a refined differential gene analysis. By utilizing a VQ-VAE framework, a contrastive siamese network, and a greedy iterative strategy, scBeacon effectively pinpoints differential genes that hold potential as key biomarkers. Comprehensive evaluations on a diverse array of datasets validate scBeacon's superiority over existing single-cell differential gene analysis tools. Its precision and adaptability underscore its significant role in enhancing diagnostic accuracy in biomarker discovery. With the emphasis on the importance of biomarkers in diagnosis, scBeacon is positioned to be a pivotal asset in the evolution of personalized medicine and targeted treatments.
mSPD-NN: A Geometrically Aware Neural Framework for Biomarker Discovery from Functional Connectomics Manifolds
D'Souza, Niharika S., Venkataraman, Archana
Connectomics has emerged as a powerful tool in neuroimaging and has spurred recent advancements in statistical and machine learning methods for connectivity data. Despite connectomes inhabiting a matrix manifold, most analytical frameworks ignore the underlying data geometry. This is largely because simple operations, such as mean estimation, do not have easily computable closed-form solutions. We propose a geometrically aware neural framework for connectomes, i.e., the mSPD-NN, designed to estimate the geodesic mean of a collections of symmetric positive definite (SPD) matrices. The mSPD-NN is comprised of bilinear fully connected layers with tied weights and utilizes a novel loss function to optimize the matrix-normal equation arising from Fr\'echet mean estimation. Via experiments on synthetic data, we demonstrate the efficacy of our mSPD-NN against common alternatives for SPD mean estimation, providing competitive performance in terms of scalability and robustness to noise. We illustrate the real-world flexibility of the mSPD-NN in multiple experiments on rs-fMRI data and demonstrate that it uncovers stable biomarkers associated with subtle network differences among patients with ADHD-ASD comorbidities and healthy controls.
'Artificial Intelligence to Enable Multi-Omics Integration - Delivering Enhanced Data Utilisation for Biomarker Discovery' (Ref FHMS - FF - 01 BIO) at University of Surrey on FindAPhD.com
This is a great opportunity for an enthusiastic student to be trained in the cutting-edge disciplines of Artificial Intelligence (AI) and Machine Learning (ML), intersected with medicine and biology, as well as biomarker technologies for characterisation of human disease and wellness. The project also offers real-world data analysis skills development through the use of the world-leading UK Biobank. Such a training offers great advantage for a career in research or industry. Specifically, in this studentship we focus on the increasing need in medicine for molecular indicators (biomarkers) in a range of diseases. These can be identified in blood or other body fluids and enable more precise patient diagnosis as well as prediction of disease progression and response to treatments.
RandomSCM: interpretable ensembles of sparse classifiers tailored for omics data
Godon, Thibaud, Plante, Pier-Luc, Bauvin, Baptiste, Francovic-Fontaine, Elina, Drouin, Alexandre, Corbeil, Jacques
Background: Understanding the relationship between the Omics and the phenotype is a central problem in precision medicine. The high dimensionality of metabolomics data challenges learning algorithms in terms of scalability and generalization. Most learning algorithms do not produce interpretable models -- Method: We propose an ensemble learning algorithm based on conjunctions or disjunctions of decision rules. -- Results : Applications on metabolomics data shows that it produces models that achieves high predictive performances. The interpretability of the models makes them useful for biomarker discovery and patterns discovery in high dimensional data.
Artificial Intelligence and Machine Learning Show Promise in Cancer Diagnosis and Treatment
"The biomarker field is blessed with a plethora of imaging and molecular-based data, and at the same time, plagued with so much data that no one individual can comprehend it all," explained Guest Editor Karin Rodland, PhD, Pacific Northwest National Laboratory, Richland; and Oregon Health and Science University, Portland, OR, USA. "AI offers a solution to that problem, and it has the potential to uncover novel interactions that more accurately reflect the biology of cancer and other diseases." Promising applications of AI, DL, and ML presented in this issue include identifying early-stage cancers, inferring the site of the specific cancer, aiding in the assignment of appropriate therapeutic options for each patient, characterizing the tumor microenvironment, and predicting the response to immunotherapy. A comprehensive overview of the literature regarding the use of AI approaches to identify biomarkers for ovarian and pancreatic cancer illustrates underlying principles and looks at the gaps and challenges that face the field as a whole. Ovarian and pancreatic cancers are rare, but lethal because they lack early symptoms and detection.