Goto

Collaborating Authors

 Schwab, Patrick


Multi-megabase scale genome interpretation with genetic language models

arXiv.org Artificial Intelligence

Understanding how molecular changes caused by genetic variation drive disease risk is crucial for deciphering disease mechanisms. However, interpreting genome sequences is challenging because of the vast size of the human genome, and because its consequences manifest across a wide range of cells, tissues and scales -- spanning from molecular to whole organism level. Here, we present Phenformer, a multi-scale genetic language model that learns to generate mechanistic hypotheses as to how differences in genome sequence lead to disease-relevant changes in expression across cell types and tissues directly from DNA sequences of up to 88 million base pairs. Using whole genome sequencing data from more than 150 000 individuals, we show that Phenformer generates mechanistic hypotheses about disease-relevant cell and tissue types that match literature better than existing state-of-the-art methods, while using only sequence data. Furthermore, disease risk predictors enriched by Phenformer show improved prediction performance and generalisation to diverse populations. Accurate multi-megabase scale interpretation of whole genomes without additional experimental data enables both a deeper understanding of molecular mechanisms involved in disease and improved disease risk prediction at the level of individuals.


Efficient Differentiable Discovery of Causal Order

arXiv.org Artificial Intelligence

In the algorithm Intersort, Chevalley et al. (2024) proposed a score-based method to discover the causal order of variables in a Directed Acyclic Graph (DAG) model, leveraging interventional data to outperform existing methods. However, as a score-based method over the permutahedron, Intersort is computationally expensive and non-differentiable, limiting its ability to be utilised in problems involving large-scale datasets, such as those in genomics and climate models, or to be integrated into end-to-end gradient-based learning frameworks. We address this limitation by reformulating Intersort using differentiable sorting and ranking techniques. Our approach enables scalable and differentiable optimization of causal orderings, allowing the continuous score function to be incorporated as a regularizer in downstream tasks. Empirical results demonstrate that causal discovery algorithms benefit significantly from regularizing on the causal order, underscoring the effectiveness of our method. Our work opens the door to efficiently incorporating regularization for causal order into the training of differentiable models and thereby addresses a long-standing limitation of purely associational supervised learning.


Deriving Causal Order from Single-Variable Interventions: Guarantees & Algorithm

arXiv.org Artificial Intelligence

Targeted and uniform interventions to a system are crucial for unveiling causal relationships. While several methods have been developed to leverage interventional data for causal structure learning, their practical application in real-world scenarios often remains challenging. Recent benchmark studies have highlighted these difficulties, even when large numbers of single-variable intervention samples are available. In this work, we demonstrate, both theoretically and empirically, that such datasets contain a wealth of causal information that can be effectively extracted under realistic assumptions about the data distribution. More specifically, we introduce the notion of interventional faithfulness, which relies on comparisons between the marginal distributions of each variable across observational and interventional settings, and we introduce a score on causal orders. Under this assumption, we are able to prove strong theoretical guarantees on the optimum of our score that also hold for large-scale settings. To empirically verify our theory, we introduce Intersort, an algorithm designed to infer the causal order from datasets containing large numbers of single-variable interventions by approximately optimizing our score. Intersort outperforms baselines (GIES, PC and EASE) on almost all simulated data settings replicating common benchmarks in the field. Our proposed novel approach to modeling interventional datasets thus offers a promising avenue for advancing causal inference, highlighting significant potential for further enhancements under realistic assumptions.


Sample Selection Bias in Machine Learning for Healthcare

arXiv.org Artificial Intelligence

While machine learning algorithms hold promise for personalised medicine, their clinical adoption remains limited. One critical factor contributing to this restraint is sample selection bias (SSB) which refers to the study population being less representative of the target population, leading to biased and potentially harmful decisions. Despite being well-known in the literature, SSB remains scarcely studied in machine learning for healthcare. Moreover, the existing techniques try to correct the bias by balancing distributions between the study and the target populations, which may result in a loss of predictive performance. To address these problems, our study illustrates the potential risks associated with SSB by examining SSB's impact on the performance of machine learning algorithms. Most importantly, we propose a new research direction for addressing SSB, based on the target population identification rather than the bias correction. Specifically, we propose two independent networks (T-Net) and a multitasking network (MT-Net) for addressing SSB, where one network/task identifies the target subpopulation which is representative of the study population and the second makes predictions for the identified subpopulation. Our empirical results with synthetic and semi-synthetic datasets highlight that SSB can lead to a large drop in the performance of an algorithm for the target population as compared with the study population, as well as a substantial difference in the performance for the target subpopulations that are representative of the selected and the non-selected patients from the study population. Furthermore, our proposed techniques demonstrate robustness across various settings, including different dataset sizes, event rates, and selection rates, outperforming the existing bias correction techniques.


DiscoBAX: Discovery of Optimal Intervention Sets in Genomic Experiment Design

arXiv.org Artificial Intelligence

The discovery of therapeutics to treat genetically-driven pathologies relies on identifying genes involved in the underlying disease mechanisms. Existing approaches search over the billions of potential interventions to maximize the expected influence on the target phenotype. However, to reduce the risk of failure in future stages of trials, practical experiment design aims to find a set of interventions that maximally change a target phenotype via diverse mechanisms. We propose DiscoBAX, a sample-efficient method for maximizing the rate of significant discoveries per experiment while simultaneously probing for a wide range of diverse mechanisms during a genomic experiment campaign. We provide theoretical guarantees of approximate optimality under standard assumptions, and conduct a comprehensive experimental evaluation covering both synthetic as well as real-world experimental design tasks. DiscoBAX outperforms existing state-of-the-art methods for experimental design, selecting effective and diverse perturbations in biological systems.


CausalBench: A Large-scale Benchmark for Network Inference from Single-cell Perturbation Data

arXiv.org Artificial Intelligence

Causal inference is a vital aspect of multiple scientific disciplines and is routinely applied to high-impact applications such as medicine. However, evaluating the performance of causal inference methods in real-world environments is challenging due to the need for observations under both interventional and control conditions. Traditional evaluations conducted on synthetic datasets do not reflect the performance in real-world systems. To address this, we introduce CausalBench, a benchmark suite for evaluating network inference methods on real-world interventional data from large-scale single-cell perturbation experiments. CausalBench incorporates biologically-motivated performance metrics, including new distribution-based interventional metrics. A systematic evaluation of state-of-the-art causal inference methods using our CausalBench suite highlights how poor scalability of current methods limits performance. Moreover, methods that use interventional information do not outperform those that only use observational data, contrary to what is observed on synthetic benchmarks. Thus, CausalBench opens new avenues in causal network inference research and provides a principled and reliable way to track progress in leveraging real-world interventional data.


FED-CD: Federated Causal Discovery from Interventional and Observational Data

arXiv.org Artificial Intelligence

Existing causal discovery methods typically require the data to be available in a centralized location. However, many practical domains, such as healthcare, limit access to the data gathered by local entities, primarily for privacy and regulatory constraints. To address this, we propose FED-CD, a federated framework for inferring causal structures from distributed datasets containing observational and interventional data. By exchanging updates instead of data samples, FED-CD ensures privacy while enabling decentralized discovery of the underlying directed acyclic graph (DAG). We accommodate scenarios with shared or disjoint intervened covariates, and mitigate the adverse effects of interventional data heterogeneity. We provide empirical evidence for the performance and scalability of FED-CD for decentralized causal discovery using synthetic and real-world DAGs.


Multi-omics Prediction from High-content Cellular Imaging with Deep Learning

arXiv.org Artificial Intelligence

High-content cellular imaging, transcriptomics, and proteomics data provide rich and complementary views on the molecular layers of biology that influence cellular states and function. However, the biological determinants through which changes in multi-omics measurements influence cellular morphology have not yet been systematically explored, and the degree to which cell imaging could potentially enable the prediction of multi-omics directly from cell imaging data is therefore currently unclear. Here, we address the question of whether it is possible to predict bulk multi-omics measurements directly from cell images using Image2Omics -- a deep learning approach that predicts multi-omics in a cell population directly from high-content images stained with multiplexed fluorescent dyes. We perform an experimental evaluation in gene-edited macrophages derived from human induced pluripotent stem cell (hiPSC) under multiple stimulation conditions and demonstrate that Image2Omics achieves significantly better performance in predicting transcriptomics and proteomics measurements directly from cell images than predictors based on the mean observed training set abundance. We observed significant predictability of abundances for 5903 (22.43%; 95% CI: 8.77%, 38.88%) and 5819 (22.11%; 95% CI: 10.40%, 38.08%) transcripts out of 26137 in M1 and M2-stimulated macrophages respectively and for 1933 (38.77%; 95% CI: 36.94%, 39.85%) and 2055 (41.22%; 95% CI: 39.31%, 42.42%) proteins out of 4986 in M1 and M2-stimulated macrophages respectively. Our results show that some transcript and protein abundances are predictable from cell imaging and that cell imaging may potentially, in some settings and depending on the mechanisms of interest and desired performance threshold, even be a scalable and resource-efficient substitute for multi-omics measurements.


Learning Neural Causal Models with Active Interventions

arXiv.org Machine Learning

Discovering causal structures from data is a challenging inference problem of fundamental importance in all areas of science. The appealing scaling properties of neural networks have recently led to a surge of interest in differentiable neural network-based methods for learning causal structures from data. So far differentiable causal discovery has focused on static datasets of observational or interventional origin. In this work, we introduce an active intervention-targeting mechanism which enables a quick identification of the underlying causal structure of the data-generating process. Our method significantly reduces the required number of interactions compared with random intervention targeting and is applicable for both discrete and continuous optimization formulations of learning the underlying directed acyclic graph (DAG) from data. We examine the proposed method across a wide range of settings and demonstrate superior performance on multiple benchmarks from simulated to real-world data. Learning causal structure from data is a challenging but important task that lies at the heart of scientific reasoning and accompanying progress in many disciplines (Sachs et al., 2005; Hill et al., 2016; Lauritzen & Spiegelhalter, 1988; Korb & Nicholson, 2010). While there exists a plethora of methods for the task, computationally and statistically more efficient algorithms are highly desired (Heinze-Deml et al., 2018). As a result, there has been a surge in interest in differentiable structure learning and the combination of deep learning and causal inference (Schölkopf et al., 2021). However, the improvement critically depends on the experiments and interventions available. Despite advances in high-throughput methods for interventional data in specific fields (Dixit et al., 2016), the acquisition of interventional samples in the general settings tends to be costly, technically impossible or even unethical for specific interventions.


NCoRE: Neural Counterfactual Representation Learning for Combinations of Treatments

arXiv.org Machine Learning

Estimating an individual's potential response to interventions from observational data is of high practical relevance for many domains, such as healthcare, public policy or economics. In this setting, it is often the case that combinations of interventions may be applied simultaneously, for example, multiple prescriptions in healthcare or different fiscal and monetary measures in economics. However, existing methods for counterfactual inference are limited to settings in which actions are not used simultaneously. Here, we present Neural Counterfactual Relation Estimation (NCoRE), a new method for learning counterfactual representations in the combination treatment setting that explicitly models cross-treatment interactions. NCoRE is based on a novel branched conditional neural representation that includes learnt treatment interaction modulators to infer the potential causal generative process underlying the combination of multiple treatments. Our experiments show that NCoRE significantly outperforms existing state-of-the-art methods for counterfactual treatment effect estimation that do not account for the effects of combining multiple treatments across several synthetic, semi-synthetic and real-world benchmarks.