Deng, Hong-Wen
SGUQ: Staged Graph Convolution Neural Network for Alzheimer's Disease Diagnosis using Multi-Omics Data
Tao, Liang, Xie, Yixin, Deng, Jeffrey D, Shen, Hui, Deng, Hong-Wen, Zhou, Weihua, Zhao, Chen
Alzheimer's disease (AD) is a chronic neurodegenerative disorder and the leading cause of dementia, significantly impacting cost, mortality, and burden worldwide. The advent of high-throughput omics technologies, such as genomics, transcriptomics, proteomics, and epigenomics, has revolutionized the molecular understanding of AD. Conventional AI approaches typically require the completion of all omics data at the outset to achieve optimal AD diagnosis, which are inefficient and may be unnecessary. To reduce the clinical cost and improve the accuracy of AD diagnosis using multi-omics data, we propose a novel staged graph convolutional network with uncertainty quantification (SGUQ). SGUQ begins with mRNA and progressively incorporates DNA methylation and miRNA data only when necessary, reducing overall costs and exposure to harmful tests. Experimental results indicate that 46.23% of the samples can be reliably predicted using only single-modal omics data (mRNA), while an additional 16.04% of the samples can achieve reliable predictions when combining two omics data types (mRNA + DNA methylation). In addition, the proposed staged SGUQ achieved an accuracy of 0.858 on ROSMAP dataset, which outperformed existing methods significantly. The proposed SGUQ can not only be applied to AD diagnosis using multi-omics data but also has the potential for clinical decision-making using multi-viewed data. Our implementation is publicly available at https://github.com/chenzhao2023/multiomicsuncertainty.
A Staged Approach using Machine Learning and Uncertainty Quantification to Predict the Risk of Hip Fracture
Shaik, Anjum, Larsen, Kristoffer, Lane, Nancy E., Zhao, Chen, Su, Kuan-Jui, Keyak, Joyce H., Tian, Qing, Sha, Qiuying, Shen, Hui, Deng, Hong-Wen, Zhou, Weihua
Center for Biocomputing and Digital Health, Institute of Computing and Cybersystems, and Health Research Institute, Michigan Technological University, Houghton, MI 49931 # Anjum Shaik and Kristoffer Larsen contribute equally. Abstract Page ABSTRACT Hip fractures present a significant healthcare challenge, especially within aging populations, where they are often caused by falls. These fractures lead to substantial morbidity and mortality, emphasizing the need for timely surgical intervention. Despite advancements in medical care, hip fractures impose a significant burden on individuals and healthcare systems. This paper focuses on the prediction of hip fracture risk in older and middle-aged adults, where falls and compromised bone quality are predominant factors. We propose a novel staged model that combines advanced imaging and clinical data to improve predictive performance. By using convolutional neural networks (CNNs) to extract features from hip DXA images, along with clinical variables, shape measurements, and texture features, our method provides a comprehensive framework for assessing fracture risk. The study cohort included 547 patients, with 94 experiencing hip fracture. A staged machine learning-based model was developed using two ensemble models: Ensemble 1 (clinical variables only) and Ensemble 2 (clinical variables and DXA imaging features). This staged approach used uncertainty quantification from Ensemble 1 to decide if DXA features are necessary for further prediction. Ensemble 2 exhibited the highest performance, achieving an Area Under the Curve (AUC) of 0.9541, an accuracy of 0.9195, a sensitivity of 0.8078, and a specificity of 0.9427.
A Demographic-Conditioned Variational Autoencoder for fMRI Distribution Sampling and Removal of Confounds
Orlichenko, Anton, Qu, Gang, Zhou, Ziyu, Liu, Anqi, Deng, Hong-Wen, Ding, Zhengming, Stephen, Julia M., Wilson, Tony W., Calhoun, Vince D., Wang, Yu-Ping
Objective: fMRI and derived measures such as functional connectivity (FC) have been used to predict brain age, general fluid intelligence, psychiatric disease status, and preclinical neurodegenerative disease. However, it is not always clear that all demographic confounds, such as age, sex, and race, have been removed from fMRI data. Additionally, many fMRI datasets are restricted to authorized researchers, making dissemination of these valuable data sources challenging. Methods: We create a variational autoencoder (VAE)-based model, DemoVAE, to decorrelate fMRI features from demographics and generate high-quality synthetic fMRI data based on user-supplied demographics. We train and validate our model using two large, widely used datasets, the Philadelphia Neurodevelopmental Cohort (PNC) and Bipolar and Schizophrenia Network for Intermediate Phenotypes (BSNIP). Results: We find that DemoVAE recapitulates group differences in fMRI data while capturing the full breadth of individual variations. Significantly, we also find that most clinical and computerized battery fields that are correlated with fMRI data are not correlated with DemoVAE latents. An exception are several fields related to schizophrenia medication and symptom severity. Conclusion: Our model generates fMRI data that captures the full distribution of FC better than traditional VAE or GAN models. We also find that most prediction using fMRI data is dependent on correlation with, and prediction of, demographics. Significance: Our DemoVAE model allows for generation of high quality synthetic data conditioned on subject demographics as well as the removal of the confounding effects of demographics. We identify that FC-based prediction tasks are highly influenced by demographic confounds.
Multi-View Variational Autoencoder for Missing Value Imputation in Untargeted Metabolomics
Zhao, Chen, Su, Kuan-Jui, Wu, Chong, Cao, Xuewei, Sha, Qiuying, Li, Wu, Luo, Zhe, Qin, Tian, Qiu, Chuan, Zhao, Lan Juan, Liu, Anqi, Jiang, Lindong, Zhang, Xiao, Shen, Hui, Zhou, Weihua, Deng, Hong-Wen
Background: Missing data is a common challenge in mass spectrometry-based metabolomics, which can lead to biased and incomplete analyses. The integration of whole-genome sequencing (WGS) data with metabolomics data has emerged as a promising approach to enhance the accuracy of data imputation in metabolomics studies. Method: In this study, we propose a novel method that leverages the information from WGS data and reference metabolites to impute unknown metabolites. Our approach utilizes a multi-view variational autoencoder to jointly model the burden score, polygenetic risk score (PGS), and linkage disequilibrium (LD) pruned single nucleotide polymorphisms (SNPs) for feature extraction and missing metabolomics data imputation. By learning the latent representations of both omics data, our method can effectively impute missing metabolomics values based on genomic information. Results: We evaluate the performance of our method on empirical metabolomics datasets with missing values and demonstrate its superiority compared to conventional imputation techniques. Using 35 template metabolites derived burden scores, PGS and LD-pruned SNPs, the proposed methods achieved r2-scores > 0.01 for 71.55% of metabolites. Conclusion: The integration of WGS data in metabolomics imputation not only improves data completeness but also enhances downstream analyses, paving the way for more comprehensive and accurate investigations of metabolic pathways and disease associations. Our findings offer valuable insights into the potential benefits of utilizing WGS data for metabolomics data imputation and underscore the importance of leveraging multi-modal data integration in precision medicine research.
CLCLSA: Cross-omics Linked embedding with Contrastive Learning and Self Attention for multi-omics integration with incomplete multi-omics data
Zhao, Chen, Liu, Anqi, Zhang, Xiao, Cao, Xuewei, Ding, Zhengming, Sha, Qiuying, Shen, Hui, Deng, Hong-Wen, Zhou, Weihua
Integration of heterogeneous and high-dimensional multi-omics data is becoming increasingly important in understanding genetic data. Each omics technique only provides a limited view of the underlying biological process and integrating heterogeneous omics layers simultaneously would lead to a more comprehensive and detailed understanding of diseases and phenotypes. However, one obstacle faced when performing multi-omics data integration is the existence of unpaired multi-omics data due to instrument sensitivity and cost. Studies may fail if certain aspects of the subjects are missing or incomplete. In this paper, we propose a deep learning method for multi-omics integration with incomplete data by Cross-omics Linked unified embedding with Contrastive Learning and Self Attention (CLCLSA). Utilizing complete multi-omics data as supervision, the model employs cross-omics autoencoders to learn the feature representation across different types of biological data. The multi-omics contrastive learning, which is used to maximize the mutual information between different types of omics, is employed before latent feature concatenation. In addition, the feature-level self-attention and omics-level self-attention are employed to dynamically identify the most informative features for multi-omics data integration. Extensive experiments were conducted on four public multi-omics datasets. The experimental results indicated that the proposed CLCLSA outperformed the state-of-the-art approaches for multi-omics data classification using incomplete multi-omics data.
Multi-view information fusion using multi-view variational autoencoders to predict proximal femoral strength
Zhao, Chen, Keyak, Joyce H, Cao, Xuewei, Sha, Qiuying, Wu, Li, Luo, Zhe, Zhao, Lanjuan, Tian, Qing, Qiu, Chuan, Su, Ray, Shen, Hui, Deng, Hong-Wen, Zhou, Weihua
The aim of this paper is to design a deep learning-based model to predict proximal femoral strength using multi-view information fusion. Method: We developed new models using multi-view variational autoencoder (MVAE) for feature representation learning and a product of expert (PoE) model for multi-view information fusion. We applied the proposed models to an in-house Louisiana Osteoporosis Study (LOS) cohort with 931 male subjects, including 345 African Americans and 586 Caucasians. With an analytical solution of the product of Gaussian distribution, we adopted variational inference to train the designed MVAE-PoE model to perform common latent feature extraction. We performed genome-wide association studies (GWAS) to select 256 genetic variants with the lowest p-values for each proximal femoral strength and integrated whole genome sequence (WGS) features and DXA-derived imaging features to predict proximal femoral strength. Results: The best prediction model for fall fracture load was acquired by integrating WGS features and DXA-derived imaging features. The designed models achieved the mean absolute percentage error of 18.04%, 6.84% and 7.95% for predicting proximal femoral fracture loads using linear models of fall loading, nonlinear models of fall loading, and nonlinear models of stance loading, respectively. Compared to existing multi-view information fusion methods, the proposed MVAE-PoE achieved the best performance. Conclusion: The proposed models are capable of predicting proximal femoral strength using WGS features and DXA-derived imaging features. Though this tool is not a substitute for FEA using QCT images, it would make improved assessment of hip fracture risk more widely available while avoiding the increased radiation dosage and clinical costs from QCT.
A New Hip Fracture Risk Index Derived from FEA-Computed Proximal Femur Fracture Loads and Energies-to-Failure
Cao, Xuewei, Keyak, Joyce H, Sigurdsson, Sigurdur, Zhao, Chen, Zhou, Weihua, Liu, Anqi, Lang, Thomas, Deng, Hong-Wen, Gudnason, Vilmundur, Sha, Qiuying
Hip fracture risk assessment is an important but challenging task. Quantitative CT-based patient specific finite element analysis (FEA) computes the force (fracture load) to break the proximal femur in a particular loading condition. It provides different structural information about the proximal femur that can influence a subject overall fracture risk. To obtain a more robust measure of fracture risk, we used principal component analysis (PCA) to develop a global FEA computed fracture risk index that incorporates the FEA-computed yield and ultimate failure loads and energies to failure in four loading conditions (single-limb stance and impact from a fall onto the posterior, posterolateral, and lateral aspects of the greater trochanter) of 110 hip fracture subjects and 235 age and sex matched control subjects from the AGES-Reykjavik study. We found that the first PC (PC1) of the FE parameters was the only significant predictor of hip fracture. Using a logistic regression model, we determined if prediction performance for hip fracture using PC1 differed from that using FE parameters combined by stratified random resampling with respect to hip fracture status. The results showed that the average of the area under the receive operating characteristic curve (AUC) using PC1 was always higher than that using all FE parameters combined in the male subjects. The AUC of PC1 and AUC of the FE parameters combined were not significantly different than that in the female subjects or in all subjects
A robust kernel machine regression towards biomarker selection in multi-omics datasets of osteoporosis for drug discovery
Alam, Md Ashad, Shen, Hui, Deng, Hong-Wen
Many statistical machine approaches could ultimately highlight novel features of the etiology of complex diseases by analyzing multi-omics data. However, they are sensitive to some deviations in distribution when the observed samples are potentially contaminated with adversarial corrupted outliers (e.g., a fictional data distribution). Likewise, statistical advances lag in supporting comprehensive data-driven analyses of complex multi-omics data integration. We propose a novel non-linear M-estimator-based approach, "robust kernel machine regression (RobKMR)," to improve the robustness of statistical machine regression and the diversity of fictional data to examine the higher-order composite effect of multi-omics datasets. We address a robust kernel-centered Gram matrix to estimate the model parameters accurately. We also propose a robust score test to assess the marginal and joint Hadamard product of features from multi-omics data. We apply our proposed approach to a multi-omics dataset of osteoporosis (OP) from Caucasian females. Experiments demonstrate that the proposed approach effectively identifies the inter-related risk factors of OP. With solid evidence (p-value = 0.00001), biological validations, network-based analysis, causal inference, and drug repurposing, the selected three triplets ((DKK1, SMTN, DRGX), (MTND5, FASTKD2, CSMD3), (MTND5, COG3, CSMD3)) are significant biomarkers and directly relate to BMD. Overall, the top three selected genes (DKK1, MTND5, FASTKD2) and one gene (SIDT1 at p-value= 0.001) significantly bond with four drugs- Tacrolimus, Ibandronate, Alendronate, and Bazedoxifene out of 30 candidates for drug repurposing in OP. Further, the proposed approach can be applied to any disease model where multi-omics datasets are available.