Stärk, Hannes
Training on test proteins improves fitness, structure, and function prediction
Bushuiev, Anton, Bushuiev, Roman, Zadorozhny, Nikola, Samusevich, Raman, Stärk, Hannes, Sedlar, Jiri, Pluskal, Tomáš, Sivic, Josef
Data scarcity and distribution shifts often hinder the ability of machine learning models to generalize when applied to proteins and other biological data. Self-supervised pre-training on large datasets is a common method to enhance generalization. However, striving to perform well on all possible proteins can limit model's capacity to excel on any specific one, even though practitioners are often most interested in accurate predictions for the individual protein they study. To address this limitation, we propose an orthogonal approach to achieve generalization. Building on the prevalence of self-supervised pre-training, we introduce a method for self-supervised fine-tuning at test time, allowing models to adapt to the test protein of interest on the fly and without requiring any additional data. We study our test-time training (TTT) method through the lens of perplexity minimization and show that it consistently enhances generalization across different models, their scales, and datasets. Notably, our method leads to new state-of-the-art results on the standard benchmark for protein fitness prediction, improves protein structure prediction for challenging targets, and enhances function prediction accuracy.
Think While You Generate: Discrete Diffusion with Planned Denoising
Liu, Sulin, Nam, Juno, Campbell, Andrew, Stärk, Hannes, Xu, Yilun, Jaakkola, Tommi, Gómez-Bombarelli, Rafael
Discrete diffusion has achieved state-of-the-art performance, outperforming or approaching autoregressive models on standard benchmarks. In this work, we introduce Discrete Diffusion with Planned Denoising (DDPD), a novel framework that separates the generation process into two models: a planner and a denoiser. At inference time, the planner selects which positions to denoise next by identifying the most corrupted positions in need of denoising, including both initially corrupted and those requiring additional refinement. This plan-and-denoise approach enables more efficient reconstruction during generation by iteratively identifying and denoising corruptions in the optimal order. DDPD outperforms traditional denoiser-only mask diffusion methods, achieving superior results on language modeling benchmarks such as text8, OpenWebText, and token-based generation on ImageNet $256 \times 256$. Notably, in language modeling, DDPD significantly reduces the performance gap between diffusion-based and autoregressive methods in terms of generative perplexity. Code is available at https://github.com/liusulin/DDPD.
Generative Modeling of Molecular Dynamics Trajectories
Jing, Bowen, Stärk, Hannes, Jaakkola, Tommi, Berger, Bonnie
Molecular dynamics (MD) is a powerful technique for studying microscopic phenomena, but its computational cost has driven significant interest in the development of deep learning-based surrogate models. We introduce generative modeling of molecular trajectories as a paradigm for learning flexible multi-task surrogate models of MD from data. By conditioning on appropriately chosen frames of the trajectory, we show such generative models can be adapted to diverse tasks such as forward simulation, transition path sampling, and trajectory upsampling. By alternatively conditioning on part of the molecular system and inpainting the rest, we also demonstrate the first steps towards dynamics-conditioned molecular design. We validate the full set of these capabilities on tetrapeptide simulations and show that our model can produce reasonable ensembles of protein monomers. Altogether, our work illustrates how generative modeling can unlock value from MD data towards diverse downstream tasks that are not straightforward to address with existing methods or even MD itself. Code is available at https://github.com/bjing2016/mdgen.
Transition Path Sampling with Boltzmann Generator-based MCMC Moves
Plainer, Michael, Stärk, Hannes, Bunne, Charlotte, Günnemann, Stephan
Sampling all possible transition paths between two 3D states of a molecular system has various applications ranging from catalyst design to drug discovery. Current approaches to sample transition paths use Markov chain Monte Carlo and rely on time-intensive molecular dynamics simulations to find new paths. Our approach operates in the latent space of a normalizing flow that maps from the molecule's Boltzmann distribution to a Gaussian, where we propose new paths without requiring molecular simulations. Using alanine dipeptide, we explore Metropolis-Hastings acceptance criteria in the latent space for exact sampling and investigate different latent proposal mechanisms.
Artificial Intelligence for Science in Quantum, Atomistic, and Continuum Systems
Zhang, Xuan, Wang, Limei, Helwig, Jacob, Luo, Youzhi, Fu, Cong, Xie, Yaochen, Liu, Meng, Lin, Yuchao, Xu, Zhao, Yan, Keqiang, Adams, Keir, Weiler, Maurice, Li, Xiner, Fu, Tianfan, Wang, Yucheng, Yu, Haiyang, Xie, YuQing, Fu, Xiang, Strasser, Alex, Xu, Shenglong, Liu, Yi, Du, Yuanqi, Saxton, Alexandra, Ling, Hongyi, Lawrence, Hannah, Stärk, Hannes, Gui, Shurui, Edwards, Carl, Gao, Nicholas, Ladera, Adriana, Wu, Tailin, Hofgard, Elyssa F., Tehrani, Aria Mansouri, Wang, Rui, Daigavane, Ameya, Bohde, Montgomery, Kurtin, Jerry, Huang, Qian, Phung, Tuong, Xu, Minkai, Joshi, Chaitanya K., Mathis, Simon V., Azizzadenesheli, Kamyar, Fang, Ada, Aspuru-Guzik, Alán, Bekkers, Erik, Bronstein, Michael, Zitnik, Marinka, Anandkumar, Anima, Ermon, Stefano, Liò, Pietro, Yu, Rose, Günnemann, Stephan, Leskovec, Jure, Ji, Heng, Sun, Jimeng, Barzilay, Regina, Jaakkola, Tommi, Coley, Connor W., Qian, Xiaoning, Qian, Xiaofeng, Smidt, Tess, Ji, Shuiwang
Advances in artificial intelligence (AI) are fueling a new paradigm of discoveries in natural sciences. Today, AI has started to advance natural sciences by improving, accelerating, and enabling our understanding of natural phenomena at a wide range of spatial and temporal scales, giving rise to a new area of research known as AI for science (AI4Science). Being an emerging research paradigm, AI4Science is unique in that it is an enormous and highly interdisciplinary area. Thus, a unified and technical treatment of this field is needed yet challenging. This work aims to provide a technically thorough account of a subarea of AI4Science; namely, AI for quantum, atomistic, and continuum systems. These areas aim at understanding the physical world from the subatomic (wavefunctions and electron density), atomic (molecules, proteins, materials, and interactions), to macro (fluids, climate, and subsurface) scales and form an important subarea of AI4Science. A unique advantage of focusing on these areas is that they largely share a common set of challenges, thereby allowing a unified and foundational treatment. A key common challenge is how to capture physics first principles, especially symmetries, in natural systems by deep learning methods. We provide an in-depth yet intuitive account of techniques to achieve equivariance to symmetry transformations. We also discuss other common technical challenges, including explainability, out-of-distribution generalization, knowledge transfer with foundation and large language models, and uncertainty quantification. To facilitate learning and education, we provide categorized lists of resources that we found to be useful. We strive to be thorough and unified and hope this initial effort may trigger more community interests and efforts to further advance AI4Science.
Harmonic Self-Conditioned Flow Matching for Multi-Ligand Docking and Binding Site Design
Stärk, Hannes, Jing, Bowen, Barzilay, Regina, Jaakkola, Tommi
A significant amount of protein function requires binding small molecules, including enzymatic catalysis. As such, designing binding pockets for small molecules has several impactful applications ranging from drug synthesis to energy storage. Designing proteins that can bind small molecules has many applications, ranging from drug synthesis to energy storage or gene editing. Indeed, a key part of any protein's function derives from its ability to bind and interact with other molecular species. For example, we may design proteins that act as antidotes that sequester toxins or design enzymes that enable chemical reactions through catalysis, which plays a major role in most biological processes. Specifically, we aim to design protein pockets to bind a certain small molecule (called ligand). We assume that we are given a protein pocket via the 3D backbone atom locations of its residues as well as the 2D chemical graph of the ligand. We do not assume any knowledge of the 3D structure or the binding pose of the ligand. Based on this information, our goal is to predict the amino acid identities for the given backbone locations (see Figure 1). We also consider the more challenging task of designing pockets that simultaneously bind multiple molecules and ions (which we call multi-ligand). Such multi-ligand binding proteins are important, for example, in enzyme design, where the ligands correspond to reactants.
DiffDock-PP: Rigid Protein-Protein Docking with Diffusion Models
Ketata, Mohamed Amine, Laue, Cedrik, Mammadov, Ruslan, Stärk, Hannes, Wu, Menghua, Corso, Gabriele, Marquet, Céline, Barzilay, Regina, Jaakkola, Tommi S.
Understanding how proteins structurally interact is crucial to modern biology, with applications in drug discovery and protein design. Recent machine learning methods have formulated protein-small molecule docking as a generative problem with significant performance boosts over both traditional and deep learning baselines. We achieve state-ofthe-art performance on DIPS with a median C-RMSD of 4.85, outperforming all considered baselines. Proteins realize their myriad biological functions through interactions with biomolecules, such as other proteins, nucleic acids, or small molecules. The presence or absence of such interactions is dictated in part by the geometric and chemical complementarity of participating bodies. Thus, learning how individual proteins form complexes is crucial to understanding protein activity.
Task-Agnostic Graph Neural Network Evaluation via Adversarial Collaboration
Zhao, Xiangyu, Stärk, Hannes, Beaini, Dominique, Zhao, Yiren, Liò, Pietro
It has been increasingly demanding to develop reliable methods to evaluate the progress of Graph Neural Network (GNN) research for molecular representation learning. Existing GNN benchmarking methods for molecular representation learning focus on comparing the GNNs' performances on some node/graph classification/regression tasks on certain datasets. However, there lacks a principled, task-agnostic method to directly compare two GNNs. Additionally, most of the existing self-supervised learning works incorporate handcrafted augmentations to the data, which has several severe difficulties to be applied on graphs due to their unique characteristics. To address the aforementioned issues, we propose GraphAC (Graph Adversarial Collaboration) -- a conceptually novel, principled, task-agnostic, and stable framework for evaluating GNNs through contrastive self-supervision. We introduce a novel objective function: the Competitive Barlow Twins, that allow two GNNs to jointly update themselves from direct competitions against each other. GraphAC succeeds in distinguishing GNNs of different expressiveness across various aspects, and has demonstrated to be a principled and reliable GNN evaluation method, without necessitating any augmentations.
DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking
Corso, Gabriele, Stärk, Hannes, Jing, Bowen, Barzilay, Regina, Jaakkola, Tommi
Predicting the binding structure of a small molecule ligand to a protein--a task known as molecular docking--is critical to drug design. Recent deep learning methods that treat docking as a regression problem have decreased runtime compared to traditional search-based methods but have yet to offer substantial improvements in accuracy. To do so, we map this manifold to the product space of the degrees of freedom (translational, rotational, and torsional) involved in docking and develop an efficient diffusion process on this space. Moreover, while previous methods are not able to dock on computationally folded structures (maximum accuracy 10.4%), D The biological functions of proteins can be modulated by small molecule ligands (such as drugs) binding to them. Thus, a crucial task in computational drug design is molecular docking--predicting the position, orientation, and conformation of a ligand when bound to a target protein--from which the effect of the ligand (if any) might be ...
3D Infomax improves GNNs for Molecular Property Prediction
Stärk, Hannes, Beaini, Dominique, Corso, Gabriele, Tossou, Prudencio, Dallago, Christian, Günnemann, Stephan, Liò, Pietro
Molecular property prediction is one of the fastest-growing applications of deep learning with critical real-world impacts. Including 3D molecular structure as input to learned models improves their performance for many molecular tasks. However, this information is infeasible to compute at the scale required by several real-world applications. We propose pre-training a model to reason about the geometry of molecules given only their 2D molecular graphs. Using methods from self-supervised learning, we maximize the mutual information between 3D summary vectors and the representations of a Graph Neural Network (GNN) such that they contain latent 3D information. During fine-tuning on molecules with unknown geometry, the GNN still produces implicit 3D information and can use it to improve downstream tasks. We show that 3D pre-training provides significant improvements for a wide range of properties, such as a 22% average MAE reduction on eight quantum mechanical properties. Moreover, the learned representations can be effectively transferred between datasets in different molecular spaces. The understanding of molecular and quantum chemistry is a rapidly growing area for deep learning with models having direct real-world impacts in quantum chemistry (Dral, 2020), protein structure prediction (Jumper et al., 2021), materials science (Schmidt et al., 2019), and drug discovery (Stokes et al., 2020). In particular, for the task of molecular property prediction, GNNs have had great success (Yang et al., 2019). GNNs operate on the molecular graph by updating each atom's representation based on the atoms connected to it via covalent bonds. However, these models reason poorly about other important interatomic forces that depend on the atoms' relative positions in space. Previous works showed that using the atoms' 3D coordinates in space improves the accuracy of molecular property prediction (Schütt et al., 2017; Klicpera et al., 2020b; Liu et al., 2021; Klicpera et al., 2021). However, using classical molecular dynamics simulations to explicitly compute a molecule's geometry before predicting its properties is computationally intractable for many real-world applications. Even recent Machine Learning (ML) methods for conformation generation (Xu et al., 2021b; Shi et al., 2021; Ganea et al., 2021) are still too slow for large-scale applications. A GNN is pre-trained by maximizing the mutual information (MI) between its embedding of a 2D molecular graph and a representation capturing the 3D information that is produced by a separate network.