Whole-Genome Phenotype Prediction with Machine Learning: Open Problems in Bacterial Genomics

James, Tamsin, Williamson, Ben, Tino, Peter, Wheeler, Nicole

Feb-11-2025–arXiv.org Artificial Intelligence

The goal of bacterial genome-wide association studies (bGWAS) is to identify genetic variants that influence a trait or phenotype ([31]). These studies traditionally employ statistical methods to perform population genomic analyses to yield a list of candidate genes or genetic markers associated with a phenotype, and have been a significant contributor in uncovering numerous genetic loci that are causally related to a phenotype, e.g., resistance to an antibiotic ([8, 15, 19, 10, 4]). Improvements in whole-genome sequencing techniques have led to the generation of increasing amounts of data, creating an impracticality surrounding functional investigations of all loci individually. However, this up-scaling has lead to the prediction of a greater number of significantly associated loci despite efforts to minimize false discovery rate. Machine learning (ML) algorithms are an obvious successor to bGWAS that may more effectively find signal in genetic noise. To date, existing algorithms have been applied to the data with little to no adaptation ([34, 26, 9, 33]). Researchers are finding that these ML models fail to reliably generalize to out-of-distribution examples ([7], [14]), and frequently identify false positive associations ([26]). In addition, they have found that removing all known causal variables from a model does not meaningfully impact model accuracy ([25]).

artificial intelligence, machine learning, mapping, (18 more...)

arXiv.org Artificial Intelligence

Feb-11-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.14)

Genre:
- Research Report > Experimental Study (0.35)

Industry:
- Health & Medicine
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area > Infections and Infectious Diseases (1.00)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.34)