Goto

Collaborating Authors

 association study


Multi-Agent Reasoning for Cardiovascular Imaging Phenotype Analysis

Zhang, Weitong, Qiao, Mengyun, Zang, Chengqi, Niederer, Steven, Matthews, Paul M, Bai, Wenjia, Kainz, Bernhard

arXiv.org Artificial Intelligence

Identifying associations between imaging phenotypes, disease risk factors, and clinical outcomes is essential for understanding disease mechanisms. However, traditional approaches rely on human-driven hypothesis testing and selection of association factors, often overlooking complex, non-linear dependencies among imaging phenotypes and other multi-modal data. To address this, we introduce Multi-agent Exploratory Synergy for the Heart (MESHAgents): a framework that leverages large language models as agents to dynamically elicit, surface, and decide confounders and phenotypes in association studies. Specifically, we orchestrate a multi-disciplinary team of AI agents, which spontaneously generate and converge on insights through iterative, self-organizing reasoning. The framework dynamically synthesizes statistical correlations with multi-expert consensus, providing an automated pipeline for phenome-wide association studies (PheWAS). We demonstrate the system's capabilities through a population-based study of imaging phenotypes of the heart and aorta. MESHAgents autonomously uncovered correlations between imaging phenotypes and a wide range of non-imaging factors, identifying additional confounder variables beyond standard demographic factors. Validation on diagnosis tasks reveals that MESHAgents-discovered phenotypes achieve performance comparable to expert-selected phenotypes, with mean AUC differences as small as $-0.004_{\pm0.010}$ on disease classification tasks. Notably, the recall score improves for 6 out of 9 disease types. Our framework provides clinically relevant imaging phenotypes with transparent reasoning, offering a scalable alternative to expert-driven methods.


A U-Statistic-based random forest approach for genetic interaction study

Li, Ming, Peng, Ruo-Sin, Wei, Changshuai, Lu, Qing

arXiv.org Artificial Intelligence

Variations in complex traits are influenced by multiple genetic variants, environmental risk factors, and their interactions. Though substantial progress has been made in identifying single genetic variants associated with complex traits, detecting the gene-gene and gene-environment interactions remains a great challenge. When a large number of genetic variants and environmental risk factors are involved, searching for interactions is limited to pair-wise interactions due to the exponentially increased feature space and computational intensity. Alternatively, recursive partitioning approaches, such as random forests, have gained popularity in high-dimensional genetic association studies. In this article, we propose a U-Statistic-based random forest approach, referred to as Forest U-Test, for genetic association studies with quantitative traits. Through simulation studies, we showed that the Forest U-Test outperformed existing methods. The proposed method was also applied to study Cannabis Dependence CD, using three independent datasets from the Study of Addiction: Genetics and Environment. A significant joint association was detected with an empirical p-value less than 0.001. The finding was also replicated in two independent datasets with p-values of 5.93e-19 and 4.70e-17, respectively.


Collapsing ROC approach for risk prediction research on both common and rare variants

Wei, Changshuai, Lu, Qing

arXiv.org Artificial Intelligence

Risk prediction that capitalizes on emerging genetic findings holds great promise for improving public health and clinical care. However, recent risk prediction research has shown that predictive tests formed on existing common genetic loci, including those from genome-wide association studies, have lacked sufficient accuracy for clinical use. Because most rare variants on the genome have not yet been studied for their role in risk prediction, future disease prediction discoveries should shift toward a more comprehensive risk prediction strategy that takes into account both common and rare variants. We are proposing a collapsing receiver operating characteristic CROC approach for risk prediction research on both common and rare variants. The new approach is an extension of a previously developed forward ROC FROC approach, with additional procedures for handling rare variants. The approach was evaluated through the use of 533 single-nucleotide polymorphisms SNPs in 37 candidate genes from the Genetic Analysis Workshop 17 mini-exome data set. We found that a prediction model built on all SNPs gained more accuracy AUC = 0.605 than one built on common variants alone AUC = 0.585. We further evaluated the performance of two approaches by gradually reducing the number of common variants in the analysis. We found that the CROC method attained more accuracy than the FROC method when the number of common variants in the data decreased. In an extreme scenario, when there are only rare variants in the data, the CROC reached an AUC value of 0.603, whereas the FROC had an AUC value of 0.524.


Functional Analysis of Variance for Association Studies

Vsevolozhskaya, Olga A., Zaykin, Dmitri V., Greenwood, Mark C., Wei, Changshuai, Lu, Qing

arXiv.org Artificial Intelligence

While progress has been made in identifying common genetic variants associated with human diseases, for most of common complex diseases, the identified genetic variants only account for a small proportion of heritability. Challenges remain in finding additional unknown genetic variants predisposing to complex diseases. With the advance in next-generation sequencing technologies, sequencing studies have become commonplace in genetic research. The ongoing exome-sequencing and whole-genome-sequencing studies generate a massive amount of sequencing variants and allow researchers to comprehensively investigate their role in human diseases. The discovery of new disease-associated variants can be enhanced by utilizing powerful and computationally efficient statistical methods. In this paper, we propose a functional analysis of variance (FANOVA) method for testing an association of sequence variants in a genomic region with a qualitative trait. The FANOVA has a number of advantages: (1) it tests for a joint effect of gene variants, including both common and rare; (2) it fully utilizes linkage disequilibrium and genetic position information; and (3) allows for either protective or risk-increasing causal variants. Through simulations, we show that FANOVA outperform two popularly used methods - SKAT and a previously proposed method based on functional linear models (FLM), - especially if a sample size of a study is small and/or sequence variants have low to moderate effects. We conduct an empirical study by applying three methods (FANOVA, SKAT and FLM) to sequencing data from Dallas Heart Study. While SKAT and FLM respectively detected ANGPTL 4 and ANGPTL 3 associated with obesity, FANOVA was able to identify both genes associated with obesity.


StarBASE-GP: Biologically-Guided Automated Machine Learning for Genotype-to-Phenotype Association Analysis

Hernandez, Jose Guadalupe, Ghosh, Attri, Freda, Philip J., Meng, Yufei, Matsumoto, Nicholas, Moore, Jason H.

arXiv.org Artificial Intelligence

We present the Star-Based Automated Single-locus and Epistasis analysis tool - Genetic Programming (StarBASE-GP), an automated framework for discovering meaningful genetic variants associated with phenotypic variation in large-scale genomic datasets. StarBASE-GP uses a genetic programming-based multi-objective optimization strategy to evolve machine learning pipelines that simultaneously maximize explanatory power (r2) and minimize pipeline complexity. Biological domain knowledge is integrated at multiple stages, including the use of nine inheritance encoding strategies to model deviations from additivity, a custom linkage disequilibrium pruning node that minimizes redundancy among features, and a dynamic variant recommendation system that prioritizes informative candidates for pipeline inclusion. We evaluate StarBASE-GP on a cohort of Rattus norvegicus (brown rat) to identify variants associated with body mass index, benchmarking its performance against a random baseline and a biologically naive version of the tool. StarBASE-GP consistently evolves Pareto fronts with superior performance, yielding higher accuracy in identifying both ground truth and novel quantitative trait loci, highlighting relevant targets for future validation. By incorporating evolutionary search and relevant biological theory into a flexible automated machine learning framework, StarBASE-GP demonstrates robust potential for advancing variant discovery in complex traits.


Improving Diseases Predictions Utilizing External Bio-Banks

Pinto, Hido, Segal, Eran

arXiv.org Artificial Intelligence

Machine learning has been successfully used in critical domains, such as medicine. However, extracting meaningful insights from biomedical data is often constrained by the lack of their available disease labels. In this research, we demonstrate how machine learning can be leveraged to enhance explainability and uncover biologically meaningful associations, even when predictive improvements in disease modeling are limited. We train LightGBM models from scratch on our dataset (10K) to impute metabolomics features and apply them to the UK Biobank (UKBB) for downstream analysis. The imputed metabolomics features are then used in survival analysis to assess their impact on disease-related risk factors. As a result, our approach successfully identified biologically relevant connections that were not previously known to the predictive models. Additionally, we applied a genome-wide association study (GWAS) on key metabolomics features, revealing a link between vascular dementia and smoking. Although being a well-established epidemiological relationship, this link was not embedded in the model's training data, which validated the method's ability to extract meaningful signals. Furthermore, by integrating survival models as inputs in the 10K data, we uncovered associations between metabolic substances and obesity, demonstrating the ability to infer disease risk for future patients without requiring direct outcome labels. These findings highlight the potential of leveraging external bio-banks to extract valuable biomedical insights, even in data-limited scenarios. Our results demonstrate that machine learning models trained on smaller datasets can still be used to uncover real biological associations when carefully integrated with survival analysis and genetic studies.


Whole-Genome Phenotype Prediction with Machine Learning: Open Problems in Bacterial Genomics

James, Tamsin, Williamson, Ben, Tino, Peter, Wheeler, Nicole

arXiv.org Artificial Intelligence

The goal of bacterial genome-wide association studies (bGWAS) is to identify genetic variants that influence a trait or phenotype ([31]). These studies traditionally employ statistical methods to perform population genomic analyses to yield a list of candidate genes or genetic markers associated with a phenotype, and have been a significant contributor in uncovering numerous genetic loci that are causally related to a phenotype, e.g., resistance to an antibiotic ([8, 15, 19, 10, 4]). Improvements in whole-genome sequencing techniques have led to the generation of increasing amounts of data, creating an impracticality surrounding functional investigations of all loci individually. However, this up-scaling has lead to the prediction of a greater number of significantly associated loci despite efforts to minimize false discovery rate. Machine learning (ML) algorithms are an obvious successor to bGWAS that may more effectively find signal in genetic noise. To date, existing algorithms have been applied to the data with little to no adaptation ([34, 26, 9, 33]). Researchers are finding that these ML models fail to reliably generalize to out-of-distribution examples ([7], [14]), and frequently identify false positive associations ([26]). In addition, they have found that removing all known causal variables from a model does not meaningfully impact model accuracy ([25]).


Interpreting artificial neural networks to detect genome-wide association signals for complex traits

Yelmen, Burak, Alver, Maris, Team, Estonian Biobank Research, Jay, Flora, Milani, Lili

arXiv.org Artificial Intelligence

Investigating the genetic architecture of complex diseases is challenging due to the highly polygenic and interactive landscape of genetic and environmental factors. Although genome-wide association studies (GWAS) have identified thousands of variants for multiple complex phenotypes, conventional statistical approaches can be limited by simplified assumptions such as linearity and lack of epistasis models. In this work, we trained artificial neural networks for predicting complex traits using both simulated and real genotype/phenotype datasets. We extracted feature importance scores via different post hoc interpretability methods to identify potentially associated loci (PAL) for the target phenotype. Simulations we performed with various parameters demonstrated that associated loci can be detected with good precision using strict selection criteria, but downstream analyses are required for fine-mapping the exact variants due to linkage disequilibrium, similarly to conventional GWAS. By applying our approach to the schizophrenia cohort in the Estonian Biobank, we were able to detect multiple PAL related to this highly polygenic and heritable disorder. We also performed enrichment analyses with PAL in genic regions, which predominantly identified terms associated with brain morphology. With further improvements in model optimization and confidence measures, artificial neural networks can enhance the identification of genomic loci associated with complex diseases, providing a more comprehensive approach for GWAS and serving as initial screening tools for subsequent functional studies.



Evaluating unsupervised disentangled representation learning for genomic discovery and disease risk prediction

Yun, Taedong

arXiv.org Artificial Intelligence

High-dimensional clinical data have become invaluable resources for genetic studies, due to their accessibility in biobank-scale datasets and the development of high performance modeling techniques especially using deep learning. Recent work has shown that low dimensional embeddings of these clinical data learned by variational autoencoders (VAE) can be used for genome-wide association studies and polygenic risk prediction. In this work, we consider multiple unsupervised learning methods for learning disentangled representations, namely autoencoders, VAE, beta-VAE, and FactorVAE, in the context of genetic association studies. Using spirograms from UK Biobank as a running example, we observed improvements in the number of genome-wide significant loci, heritability, and performance of polygenic risk scores for asthma and chronic obstructive pulmonary disease by using FactorVAE or beta-VAE, compared to standard VAE or non-variational autoencoders. FactorVAEs performed effectively across multiple values of the regularization hyperparameter, while beta-VAEs were much more sensitive to the hyperparameter values.