Jaakkola, Tommi
Equivariant Scalar Fields for Molecular Docking with Fast Fourier Transforms
Jing, Bowen, Jaakkola, Tommi, Berger, Bonnie
Molecular docking is critical to structure-based virtual screening, yet the throughput of such workflows is limited by the expensive optimization of scoring functions involved in most docking algorithms. We explore how machine learning can accelerate this process by learning a scoring function with a functional form that allows for more rapid optimization. Specifically, we define the scoring function to be the cross-correlation of multi-channel ligand and protein scalar fields parameterized by equivariant graph neural networks, enabling rapid optimization over rigid-body degrees of freedom with fast Fourier transforms. The runtime of our approach can be amortized at several levels of abstraction, and is particularly favorable for virtual screening settings with a common binding pocket. We benchmark our scoring functions on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. Our method attains similar but faster performance on crystal structures compared to the widely-used Vina and Gnina scoring functions, and is more robust on computationally predicted structures. Proteins are the macromolecular machines that drive almost all biological processes, and much of early-stage drug discovery focuses on finding molecules which bind to and modulate their activity. Molecular docking--the computational task of predicting the binding pose of a small molecule to a protein target--is an important step in this pipeline. Traditionally, molecular docking has been formulated as an optimization problem over a scoring function designed to be a computational proxy for the free energy (Torres et al., 2019; Fan et al., 2019). Such scoring functions are typically a sum of pairwise interaction terms between atoms with physically-inspired functional forms and empirically tuned weights (Quiroga & Villarreal, 2016). While these terms are simple and hence fast to evaluate, exhaustive sampling or optimization over the space of ligand poses is difficult and leads to the significant runtime of docking software.
Fast non-autoregressive inverse folding with discrete diffusion
Yang, John J., Yim, Jason, Barzilay, Regina, Jaakkola, Tommi
Generating protein sequences that fold into a intended 3D structure is a fundamental step in de novo protein design. De facto methods utilize autoregressive generation, but this eschews higher order interactions that could be exploited to improve inference speed. We describe a non-autoregressive alternative that performs inference using a constant number of calls resulting in a 23 times speed up without a loss in performance on the CATH benchmark. Conditioned on the 3D structure, we fine-tune ProteinMPNN to perform discrete diffusion with a purity prior over the index sampling order. Our approach gives the flexibility in trading off inference speed and accuracy by modulating the diffusion speed.
Risk-Controlling Model Selection via Guided Bayesian Optimization
Laufer-Goldshtein, Bracha, Fisch, Adam, Barzilay, Regina, Jaakkola, Tommi
Our goal in this paper is to find a configuration that adheres to user-specified limits on certain risks while being useful with respect to other conflicting metrics. We solve this by combining Bayesian Optimization (BO) with rigorous risk-controlling procedures, where our core idea is to steer BO towards an efficient testing strategy. Our BO method identifies a set of Pareto optimal configurations residing in a designated region of interest. The resulting candidates are statistically verified and the best-performing configuration is selected with guaranteed risk levels. We demonstrate the effectiveness of our approach on a range of tasks with multiple desiderata, including low error rates, equitable predictions, handling spurious correlations, managing rate and distortion in generative models, and reducing computational costs. Deploying machine learning models in the real-world requires balancing different performance aspects such as low error rate, equality in predictive decisions (Hardt et al., 2016; Pessach & Shmueli, 2022), robustness to spurious correlations (Sagawa et al., 2019; Yang et al., 2023), and model efficiency (Laskaridis et al., 2021; Menghani, 2023). In many cases, we can influence the model's behavior favorably via sets of hyperparameters that determine the model configuration. However, selecting such a configuration that exactly meets user-defined requirements on test data is typically non-trivial, especially when considering a large number of objectives and configurations that are costly to assess (e.g., that require retraining large neural networks for new settings). Bayesian Optimization (BO) is widely used for efficiently selecting configurations of functions that require expensive evaluation, such as hyperparameters that govern the model architecture or influence the training procedure (Shahriari et al., 2015; Wang et al., 2022; Bischl et al., 2023). The basic concept is to substitute the costly function of interest with a cheap, and easily optimized, probabilistic surrogate model. This surrogate is used to select promising candidate configurations, while balancing exploration and exploitation.
Removing Biases from Molecular Representations via Information Maximization
Wang, Chenyu, Gupta, Sharut, Uhler, Caroline, Jaakkola, Tommi
High-throughput drug screening - using cell imaging or gene expression measurements as readouts of drug effect - is a critical tool in biotechnology to assess and understand the relationship between the chemical structure and biological activity of a drug. Since large-scale screens have to be divided into multiple experiments, a key difficulty is dealing with batch effects, which can introduce systematic errors and non-biological associations in the data. We propose InfoCORE, an Information maximization approach for COnfounder REmoval, to effectively deal with batch effects and obtain refined molecular representations. InfoCORE establishes a variational lower bound on the conditional mutual information of the latent representations given a batch identifier. It adaptively reweighs samples to equalize their implied batch distribution. Extensive experiments on drug screening data reveal InfoCORE's superior performance in a multitude of tasks including molecular property prediction and molecule-phenotype retrieval. Additionally, we show results for how InfoCORE offers a versatile framework and resolves general distribution shifts and issues of data fairness by minimizing correlation with spurious features or removing sensitive attributes. Representation learning (Bengio et al., 2013) has become pivotal in drug discovery (Wu et al., 2018) and understanding biological systems (Yang et al., 2021b). It serves as a pillar for recognizing drug mechanisms, predicting a drug's activity and toxicity, and identifying disease-associated chemical structures. A central challenge in this context is to accurately capture the nuanced relationship between the chemical structure of a small molecule and its biological or physical attributes. Most molecular representation learning methods only encode a molecule's chemical identity and hence provide unimodal representations (Wang et al., 2022; Xu et al., 2021b). A limitation of such techniques is that molecules with similar structures can have very different effects in the cellular context.
Particle Guidance: non-I.I.D. Diverse Sampling with Diffusion Models
Corso, Gabriele, Xu, Yilun, de Bortoli, Valentin, Barzilay, Regina, Jaakkola, Tommi
In light of the widespread success of generative models, a significant amount of research has gone into speeding up their sampling time. However, generative models are often sampled multiple times to obtain a diverse set incurring a cost that is orthogonal to sampling time. We tackle the question of how to improve diversity and sample efficiency by moving beyond the common assumption of independent samples. We propose particle guidance, an extension of diffusion-based generative sampling where a joint-particle time-evolving potential enforces diversity. We analyze theoretically the joint distribution that particle guidance generates, how to learn a potential that achieves optimal diversity, and the connections with methods in other disciplines. Empirically, we test the framework both in the setting of conditional image generation, where we are able to increase diversity without affecting quality, and molecular conformer generation, where we reduce the state-of-the-art median error by 13% on average. Deep generative modeling has become pervasive in many computational tasks across computer vision, natural language processing, physical sciences, and beyond. In many applications, these models are used to take a number of representative samples of some distribution of interest like Van Gogh's style paintings or the 3D conformers of a small molecule. Although independent samples drawn from a distribution will perfectly represent it in the limit of infinite samples, for a finite number, this may not be the optimal strategy. Therefore, while deep learning methods have so far largely focused on the task of taking independent identically distributed (I.I.D.) samples from some distribution, this paper examines how one can use deep generative models to take a finite number of samples that can better represent the distribution of interest. In other fields where finite-samples approximations are critical, researchers have developed various techniques to tackle this challenge.
Artificial Intelligence for Science in Quantum, Atomistic, and Continuum Systems
Zhang, Xuan, Wang, Limei, Helwig, Jacob, Luo, Youzhi, Fu, Cong, Xie, Yaochen, Liu, Meng, Lin, Yuchao, Xu, Zhao, Yan, Keqiang, Adams, Keir, Weiler, Maurice, Li, Xiner, Fu, Tianfan, Wang, Yucheng, Yu, Haiyang, Xie, YuQing, Fu, Xiang, Strasser, Alex, Xu, Shenglong, Liu, Yi, Du, Yuanqi, Saxton, Alexandra, Ling, Hongyi, Lawrence, Hannah, Stรคrk, Hannes, Gui, Shurui, Edwards, Carl, Gao, Nicholas, Ladera, Adriana, Wu, Tailin, Hofgard, Elyssa F., Tehrani, Aria Mansouri, Wang, Rui, Daigavane, Ameya, Bohde, Montgomery, Kurtin, Jerry, Huang, Qian, Phung, Tuong, Xu, Minkai, Joshi, Chaitanya K., Mathis, Simon V., Azizzadenesheli, Kamyar, Fang, Ada, Aspuru-Guzik, Alรกn, Bekkers, Erik, Bronstein, Michael, Zitnik, Marinka, Anandkumar, Anima, Ermon, Stefano, Liรฒ, Pietro, Yu, Rose, Gรผnnemann, Stephan, Leskovec, Jure, Ji, Heng, Sun, Jimeng, Barzilay, Regina, Jaakkola, Tommi, Coley, Connor W., Qian, Xiaoning, Qian, Xiaofeng, Smidt, Tess, Ji, Shuiwang
Advances in artificial intelligence (AI) are fueling a new paradigm of discoveries in natural sciences. Today, AI has started to advance natural sciences by improving, accelerating, and enabling our understanding of natural phenomena at a wide range of spatial and temporal scales, giving rise to a new area of research known as AI for science (AI4Science). Being an emerging research paradigm, AI4Science is unique in that it is an enormous and highly interdisciplinary area. Thus, a unified and technical treatment of this field is needed yet challenging. This work aims to provide a technically thorough account of a subarea of AI4Science; namely, AI for quantum, atomistic, and continuum systems. These areas aim at understanding the physical world from the subatomic (wavefunctions and electron density), atomic (molecules, proteins, materials, and interactions), to macro (fluids, climate, and subsurface) scales and form an important subarea of AI4Science. A unique advantage of focusing on these areas is that they largely share a common set of challenges, thereby allowing a unified and foundational treatment. A key common challenge is how to capture physics first principles, especially symmetries, in natural systems by deep learning methods. We provide an in-depth yet intuitive account of techniques to achieve equivariance to symmetry transformations. We also discuss other common technical challenges, including explainability, out-of-distribution generalization, knowledge transfer with foundation and large language models, and uncertainty quantification. To facilitate learning and education, we provide categorized lists of resources that we found to be useful. We strive to be thorough and unified and hope this initial effort may trigger more community interests and efforts to further advance AI4Science.
Harmonic Self-Conditioned Flow Matching for Multi-Ligand Docking and Binding Site Design
Stรคrk, Hannes, Jing, Bowen, Barzilay, Regina, Jaakkola, Tommi
A significant amount of protein function requires binding small molecules, including enzymatic catalysis. As such, designing binding pockets for small molecules has several impactful applications ranging from drug synthesis to energy storage. Designing proteins that can bind small molecules has many applications, ranging from drug synthesis to energy storage or gene editing. Indeed, a key part of any protein's function derives from its ability to bind and interact with other molecular species. For example, we may design proteins that act as antidotes that sequester toxins or design enzymes that enable chemical reactions through catalysis, which plays a major role in most biological processes. Specifically, we aim to design protein pockets to bind a certain small molecule (called ligand). We assume that we are given a protein pocket via the 3D backbone atom locations of its residues as well as the 2D chemical graph of the ligand. We do not assume any knowledge of the 3D structure or the binding pose of the ligand. Based on this information, our goal is to predict the amino acid identities for the given backbone locations (see Figure 1). We also consider the more challenging task of designing pockets that simultaneously bind multiple molecules and ions (which we call multi-ligand). Such multi-ligand binding proteins are important, for example, in enzyme design, where the ligands correspond to reactants.
Restart Sampling for Improving Generative Processes
Xu, Yilun, Deng, Mingyang, Cheng, Xiang, Tian, Yonglong, Liu, Ziming, Jaakkola, Tommi
Generative processes that involve solving differential equations, such as diffusion models, frequently necessitate balancing speed and quality. ODE-based samplers are fast but plateau in performance while SDE-based samplers deliver higher sample quality at the cost of increased sampling time. We attribute this difference to sampling errors: ODE-samplers involve smaller discretization errors while stochasticity in SDE contracts accumulated errors. Based on these findings, we propose a novel sampling algorithm called Restart in order to better balance discretization errors and contraction. The sampling method alternates between adding substantial noise in additional forward steps and strictly following a backward ODE. Empirically, Restart sampler surpasses previous SDE and ODE samplers in both speed and accuracy. Restart not only outperforms the previous best SDE results, but also accelerates the sampling speed by 10-fold / 2-fold on CIFAR-10 / ImageNet $64 \times 64$. In addition, it attains significantly better sample quality than ODE samplers within comparable sampling times. Moreover, Restart better balances text-image alignment/visual quality versus diversity than previous samplers in the large-scale text-to-image Stable Diffusion model pre-trained on LAION $512 \times 512$. Code is available at https://github.com/Newbeeer/diffusion_restart_sampling
Learning Interatomic Potentials at Multiple Scales
Fu, Xiang, Musaelian, Albert, Johansson, Anders, Jaakkola, Tommi, Kozinsky, Boris
The need to use a short time step is a key limit on the speed of molecular dynamics (MD) simulations. Simulations governed by classical potentials are often accelerated by using a multiple-time-step (MTS) integrator that evaluates certain potential energy terms that vary more slowly than others less frequently. This approach is enabled by the simple but limiting analytic forms of classical potentials. Machine learning interatomic potentials (MLIPs), in particular recent equivariant neural networks, are much more broadly applicable than classical potentials and can faithfully reproduce the expensive but accurate reference electronic structure calculations used to train them. They still, however, require the use of a single short time step, as they lack the inherent term-by-term scale separation of classical potentials. This work introduces a method to learn a scale separation in complex interatomic interactions by co-training two MLIPs. Initially, a small and efficient model is trained to reproduce short-time-scale interactions. Subsequently, a large and expressive model is trained jointly to capture the remaining interactions not captured by the small model. When running MD, the MTS integrator then evaluates the smaller model for every time step and the larger model less frequently, accelerating simulation. Compared to a conventionally trained MLIP, our approach can achieve a significant speedup (~3x in our experiments) without a loss of accuracy on the potential energy or simulation-derived quantities.
MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design
Fu, Xiang, Xie, Tian, Rosen, Andrew S., Jaakkola, Tommi, Smith, Jake
Metal-organic frameworks (MOFs) are of immense interest in applications such as gas storage and carbon capture due to their exceptional porosity and tunable chemistry. Their modular nature has enabled the use of template-based methods to generate hypothetical MOFs by combining molecular building blocks in accordance with known network topologies. However, the ability of these methods to identify top-performing MOFs is often hindered by the limited diversity of the resulting chemical space. In this work, we propose MOFDiff: a coarse-grained (CG) diffusion model that generates CG MOF structures through a denoising diffusion process over the coordinates and identities of the building blocks. The all-atom MOF structure is then determined through a novel assembly algorithm. Equivariant graph neural networks are used for the diffusion model to respect the permutational and roto-translational symmetries. We comprehensively evaluate our model's capability to generate valid and novel MOF structures and its effectiveness in designing outstanding MOF materials for carbon capture applications with molecular simulations.