Deep neural networks with controlled variable selection for the identification of putative causal genetic variants

Kassani, Peyman H., Lu, Fred, Guen, Yann Le, He, Zihuai

Sep-29-2021–arXiv.org Machine Learning

Deep neural networks (DNN) have been used successfully in many scientific problems for their high prediction accuracy, but their application to genetic studies remains challenging due to their poor interpretability. In this paper, we consider the problem of scalable, robust variable selection in DNN for the identification of putative causal genetic variants in genome sequencing studies. We identified a pronounced randomness in feature selection in DNN due to its stochastic nature, which may hinder interpretability and give rise to misleading results. We propose an interpretable neural network model, stabilized using ensembling, with controlled variable selection for genetic studies. The merit of the proposed method includes: (1) flexible modelling of the non-linear effect of genetic variants to improve statistical power; (2) multiple knockoffs in the input layer to rigorously control false discovery rate; (3) hierarchical layers to substantially reduce the number of weight parameters and activations to improve computational efficiency; (4) de-randomized feature selection to stabilize identified signals. We evaluated the proposed method in extensive simulation studies and applied it to the analysis of Alzheimer's disease genetics. We showed that the proposed method, when compared to conventional linear and nonlinear methods, can lead to substantially more discoveries. Introduction Recent advances in whole genome sequencing (WGS) technology have led the way to explore the contribution of common and rare variants in both coding and non-coding regions towards risk for complex traits. Large-scale genome sequencing studies, such as the Trans-Omics for Precision Medicine (TOPMed) study and the Alzheimer's Disease Sequencing Project (ADSP), have collected thousands of samples with directly sequenced whole genomes. Genetic variants or genes below a p-value threshold are deemed as associated variants. The marginal association tests are well-known for their simplicity and effectiveness, but they often identify proxy variants that are only correlated with the true causal variants, and the statistical power can be suboptimal. One obstacle for the widespread application of DNN to genetic data is their interpretability.

genetic variant, hide-mk, variant, (16 more...)

arXiv.org Machine Learning

Sep-29-2021

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Michigan (0.04)
  - California > Santa Clara County
    - Stanford (0.04)
    - Palo Alto (0.04)

Genre:
- Research Report
  - New Finding (0.93)
  - Experimental Study (0.88)

Industry:
- Health & Medicine
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area > Neurology
    - Alzheimer's Disease (1.00)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Performance Analysis > Accuracy (1.00)
  - Neural Networks > Deep Learning (1.00)