Collaborating Authors


eQTL mapping using allele-specific gene expression


Using information from allele-specific gene expression (ASE) can sub-stantially improve the power to map gene expression quantitative trait loci (eQTLs). However, such practice has been limited, partly due to high computational cost and the requirement to access raw data that can take a large amount of storage space. To address these computational challenges, we have developed a computational framework that uses a statistical method named TReCASE as its computational engine, and it is computationally feasible for large scale analysis. We applied it to map eQTLs in 28 human tissues using the data from the Genotype-Tissue Expression (GTEx) project. Compared with a popular linear regression method that does not use ASE data, TReCASE can double the number of eGenes (i.e., genes with at least one significant eQTL) when sample size is relatively small, e.g., n 200.

Machine Learning in Enzyme Engineering


Enzyme engineering is the process of customizing new biocatalysts with improved properties by altering their constituting sequences of amino acids. Despite the immensity of possible alterations, this procedure has already yielded remarkable results in new designs and optimization of enzymes for chemical and pharmaceutical biosynthesis, regenerative medicine, food production, waste biodegradation and biosensing.(1 The two established and widely used enzyme engineering strategies are rational design(5,6) and directed evolution.(7,8) The former approach is based on the structural analysis and in-depth computational modeling of enzymes by accounting for the physicochemical properties of amino acids and simulating their interactions with the environment. The latter approach takes after the natural evolution in using mutagenesis for iterative production of mutant libraries, which are then screened for enzyme variants with the desired properties. These two strategies may naturally complement each other: e.g., site-directed or saturation mutagenesis may be applied on the rationally chosen hotspots.(9)

Common genetic variation influencing human white matter microstructure


The white matter of the brain, which is composed of axonal tracts connecting different brain regions, plays key roles in both normal brain function and a variety of neurological disorders. Zhao et al. combined detailed magnetic resonance imaging–based assessment of brain structures with genetic data on nearly 44,000 individuals (see the Perspective by Filley). On the basis of this comprehensive analysis, the authors identified structural and genetic abnormalities associated with neurological and psychiatric disorders, as well as some nondisease traits, thus creating a valuable resource and providing some insights into the underlying neurobiology. Science , abf3736, this issue p. [eabf3736][1]; see also abj1881, p. [1265][2] ### INTRODUCTION White matter in the human brain serves a critical role in organizing distributed neural networks. Diffusion magnetic resonance imaging (dMRI) has enabled the study of white matter in vivo, showing that interindividual variations in white matter microstructure are associated with a wide variety of clinical outcomes. Although white matter differences in general population cohorts are known to be heritable, few common genetic variants influencing white matter microstructure have been identified. ### RATIONALE To identify genetic variants influencing white matter microstructure, we conducted a genome-wide association study (GWAS) of dMRI data from 43,802 individuals across five data resources. We analyzed five major diffusion tensor imaging (DTI) model–derived parameters along 21 cerebral white matter tracts. ### RESULTS In the discovery GWAS with 34,024 individuals of British ancestry, we replicated 42 of the 44 genomic regions discovered in the largest previous GWAS and identified 109 additional regions associated with white matter microstructure ( P < 2.3 × 10−10, adjusted for the number of phenotypes studied). These results indicate strong polygenic influences on white matter microstructure. Of the 151 regions, 52 passed the Bonferroni significance level ( P < 5 × 10−5) in our analysis of nine independent validation datasets, including four with subjects of non-European ancestry. On average, common genetic variants explained 41% (standard error = 2%) of the variation in white matter microstructure. The 151 identified genomic regions can explain 32.3% of heritability for white matter microstructure, whereas the 44 previously identified genomic regions can only explain 11.7% of heritability. As a biological validation of our GWAS findings, we observed heritability enrichment within regulatory elements active in oligodendrocytes and other glia, whereas no enrichment was observed in neurons. These results are expected and suggest that genetic variation leads to changes in white matter microstructure by affecting gene regulation in glia. We observed genetic correlations and colocalizations of white matter microstructure with a wide range of brain-related complex traits and diseases, such as cognitive functions, cardiovascular risk factors, as well as various neurological and psychiatric diseases. For example, of the 25 reported genetic risk regions of glioma, 11 were also associated with white matter microstructure, which illustrates the close genetic relationship between glioma and white matter integrity. Additionally, we found that 14 white matter microstructure–associated genes ( P < 1.2 × 10−8) were targets for 79 commonly used nervous system drugs, such as antipsychotics, antidepressants, anticonvulsants, and drugs for Parkinson’s disease and dementia. ### CONCLUSION This large-scale study of dMRI scans from 43,802 subjects improved our understanding of the highly polygenic genetic architecture of human brain white matter tracts. We identified 151 genomic regions associated with white matter microstructure. The GWAS findings were supported by enrichments within cell types that make up white matter microstructure. Moreover, we uncovered genetic relationships between white matter and various clinical endpoints, such as stroke, major depressive disorder, schizophrenia, and attention deficit hyperactivity disorder. The targets of many drugs commonly used for disabling cognitive disorders have genetic associations with white matter, which suggests that the neuropharmacology of many disorders can potentially be improved by studying how these medications work in the brain white matter. ![Figure][3] Identifying genetic variants influencing human brain white matter microstructure. (Top left) Quantifying the microstructure in white matter tracts using DTI models. (Bottom left) Genomic locations of common genetic variants associated with white matter microstructure. (Top right) Selected genetic correlations between white matter microstructure and brain disorders (stroke and major depressive disorder). (Bottom right) Partitioned heritability enrichment analysis in brain cell types. FDR, false discovery rate. Brain regions communicate with each other through tracts of myelinated axons, commonly referred to as white matter. We identified common genetic variants influencing white matter microstructure using diffusion magnetic resonance imaging of 43,802 individuals. Genome-wide association analysis identified 109 associated loci, 30 of which were detected by tract-specific functional principal components analysis. A number of loci colocalized with brain diseases, such as glioma and stroke. Genetic correlations were observed between white matter microstructure and 57 complex traits and diseases. Common variants associated with white matter microstructure altered the function of regulatory elements in glial cells, particularly oligodendrocytes. This large-scale tract-specific study advances the understanding of the genetic architecture of white matter and its genetic links to a wide spectrum of clinical outcomes. [1]: /lookup/doi/10.1126/science.abf3736 [2]: /lookup/doi/10.1126/science.abj1881 [3]: pending:yes

Importance measures derived from random forests: characterisation and extension Machine Learning

Nowadays new technologies, and especially artificial intelligence, are more and more established in our society. Big data analysis and machine learning, two sub-fields of artificial intelligence, are at the core of many recent breakthroughs in many application fields (e.g., medicine, communication, finance, ...), including some that are strongly related to our day-to-day life (e.g., social networks, computers, smartphones, ...). In machine learning, significant improvements are usually achieved at the price of an increasing computational complexity and thanks to bigger datasets. Currently, cutting-edge models built by the most advanced machine learning algorithms typically became simultaneously very efficient and profitable but also extremely complex. Their complexity is to such an extent that these models are commonly seen as black-boxes providing a prediction or a decision which can not be interpreted or justified. Nevertheless, whether these models are used autonomously or as a simple decision-making support tool, they are already being used in machine learning applications where health and human life are at stake. Therefore, it appears to be an obvious necessity not to blindly believe everything coming out of those models without a detailed understanding of their predictions or decisions. Accordingly, this thesis aims at improving the interpretability of models built by a specific family of machine learning algorithms, the so-called tree-based methods. Several mechanisms have been proposed to interpret these models and we aim along this thesis to improve their understanding, study their properties, and define their limitations.

PyKale: Knowledge-Aware Machine Learning from Multiple Sources in Python Artificial Intelligence

Machine learning is a general-purpose technology holding promises for many interdisciplinary research problems. However, significant barriers exist in crossing disciplinary boundaries when most machine learning tools are developed in different areas separately. We present Pykale - a Python library for knowledge-aware machine learning on graphs, images, texts, and videos to enable and accelerate interdisciplinary research. We formulate new green machine learning guidelines based on standard software engineering practices and propose a novel pipeline-based application programming interface (API). PyKale focuses on leveraging knowledge from multiple sources for accurate and interpretable prediction, thus supporting multimodal learning and transfer learning (particularly domain adaptation) with latest deep learning and dimensionality reduction models. We build PyKale on PyTorch and leverage the rich PyTorch ecosystem. Our pipeline-based API design enforces standardization and minimalism, embracing green machine learning concepts via reducing repetitions and redundancy, reusing existing resources, and recycling learning models across areas. We demonstrate its interdisciplinary nature via examples in bioinformatics, knowledge graph, image/video recognition, and medical imaging.

Prototypical Graph Contrastive Learning Artificial Intelligence

Graph-level representations are critical in various real-world applications, such as predicting the properties of molecules. But in practice, precise graph annotations are generally very expensive and time-consuming. To address this issue, graph contrastive learning constructs instance discrimination task which pulls together positive pairs (augmentation pairs of the same graph) and pushes away negative pairs (augmentation pairs of different graphs) for unsupervised representation learning. However, since for a query, its negatives are uniformly sampled from all graphs, existing methods suffer from the critical sampling bias issue, i.e., the negatives likely having the same semantic structure with the query, leading to performance degradation. To mitigate this sampling bias issue, in this paper, we propose a Prototypical Graph Contrastive Learning (PGCL) approach. Specifically, PGCL models the underlying semantic structure of the graph data via clustering semantically similar graphs into the same group, and simultaneously encourages the clustering consistency for different augmentations of the same graph. Then given a query, it performs negative sampling via drawing the graphs from those clusters that differ from the cluster of query, which ensures the semantic difference between query and its negative samples. Moreover, for a query, PGCL further reweights its negative samples based on the distance between their prototypes (cluster centroids) and the query prototype such that those negatives having moderate prototype distance enjoy relatively large weights. This reweighting strategy is proved to be more effective than uniform sampling. Experimental results on various graph benchmarks testify the advantages of our PGCL over state-of-the-art methods.

Pre-processing with Orthogonal Decompositions for High-dimensional Explanatory Variables Machine Learning

Strong correlations between explanatory variables are problematic for high-dimensional regularized regression methods. Due to the violation of the Irrepresentable Condition, the popular LASSO method may suffer from false inclusions of inactive variables. In this paper, we propose pre-processing with orthogonal decompositions (PROD) for the explanatory variables in high-dimensional regressions. The PROD procedure is constructed based upon a generic orthogonal decomposition of the design matrix. We demonstrate by two concrete cases that the PROD approach can be effectively constructed for improving the performance of high-dimensional penalized regression. Our theoretical analysis reveals their properties and benefits for high-dimensional penalized linear regression with LASSO.

Directed Graph Embeddings in Pseudo-Riemannian Manifolds Machine Learning

The inductive biases of graph representation learning algorithms are often encoded in the background geometry of their embedding space. In this paper, we show that general directed graphs can be effectively represented by an embedding model that combines three components: a pseudo-Riemannian metric structure, a non-trivial global topology, and a unique likelihood function that explicitly incorporates a preferred direction in embedding space. We demonstrate the representational capabilities of this method by applying it to the task of link prediction on a series of synthetic and real directed graphs from natural language applications and biology. In particular, we show that low-dimensional cylindrical Minkowski and anti-de Sitter spacetimes can produce equal or better graph representations than curved Riemannian manifolds of higher dimensions.

Contrastive Mixture of Posteriors for Counterfactual Inference, Data Integration and Fairness Machine Learning

Learning meaningful representations of data that can address challenges such as batch effect correction, data integration and counterfactual inference is a central problem in many domains including computational biology. Adopting a Conditional VAE framework, we identify the mathematical principle that unites these challenges: learning a representation that is marginally independent of a condition variable. We therefore propose the Contrastive Mixture of Posteriors (CoMP) method that uses a novel misalignment penalty to enforce this independence. This penalty is defined in terms of mixtures of the variational posteriors themselves, unlike prior work which uses external discrepancy measures such as MMD to ensure independence in latent space. We show that CoMP has attractive theoretical properties compared to previous approaches, especially when there is complex global structure in latent space. We further demonstrate state of the art performance on a number of real-world problems, including the challenging tasks of aligning human tumour samples with cancer cell-lines and performing counterfactual inference on single-cell RNA sequencing data. Incidentally, we find parallels with the fair representation learning literature, and demonstrate CoMP has competitive performance in learning fair yet expressive latent representations.

Doubly Non-Central Beta Matrix Factorization for DNA Methylation Data Machine Learning

We present a new non-negative matrix factorization model for $(0,1)$ bounded-support data based on the doubly non-central beta (DNCB) distribution, a generalization of the beta distribution. The expressiveness of the DNCB distribution is particularly useful for modeling DNA methylation datasets, which are typically highly dispersed and multi-modal; however, the model structure is sufficiently general that it can be adapted to many other domains where latent representations of $(0,1)$ bounded-support data are of interest. Although the DNCB distribution lacks a closed-form conjugate prior, several augmentations let us derive an efficient posterior inference algorithm composed entirely of analytic updates. Our model improves out-of-sample predictive performance on both real and synthetic DNA methylation datasets over state-of-the-art methods in bioinformatics. In addition, our model yields meaningful latent representations that accord with existing biological knowledge.