Goto

Collaborating Authors

 mldd workshop


Multiparameter Persistent Homology for Molecular Property Prediction

Demir, Andac, Kiziltan, Bulent

arXiv.org Artificial Intelligence

In this study, we present a novel molecular fingerprint generation method based on multiparameter persistent homology. This approach reveals the latent structures and relationships within molecular geometry, and detects topological features that exhibit persistence across multiple scales along multiple parameters, such as atomic mass, partial charge, and bond type, and can be further enhanced by incorporating additional parameters like ionization energy, electron affinity, chirality and orbital hybridization. The proposed fingerprinting method provides fresh perspectives on molecular structure that are not easily discernible from single-parameter or single-scale analysis. Besides, in comparison with traditional graph neural networks, multiparameter persistent homology has the advantage of providing a more comprehensive and interpretable characterization of the topology of the molecular data. We have established theoretical stability guarantees for multiparameter persistent homology, and have conducted extensive experiments on the Lipophilicity, FreeSolv, and ESOL datasets to demonstrate its effectiveness in predicting molecular properties.


Multi-scale Sinusoidal Embeddings Enable Learning on High Resolution Mass Spectrometry Data

Voronov, Gennady, Lightheart, Rose, Davison, Joe, Krettler, Christoph A., Healey, David, Butler, Thomas

arXiv.org Artificial Intelligence

Small molecules in biological samples are studied to provide information about disease states, environmental toxins, natural product drug discovery, and many other applications. The primary window into the composition of small molecule mixtures is tandem mass spectrometry (MS2), which produces high sensitivity and part per million resolution data. We adopt multi-scale sinusoidal embeddings of the mass data in MS2 designed to meet the challenge of learning from the full resolution of MS2 data. Using these embeddings, we provide a new state of the art model for spectral library search, the standard task for initial evaluation of MS2 data. We vary the resolution of the input spectra directly by using different floating point representations of the MS2 data, and show that the resulting sinusoidal embeddings are able to learn from high resolution portion of the input MS2 data. We apply dimensionality reduction to the embeddings that result from different resolution input masses to show the essential role multi-scale sinusoidal embeddings play in learning from MS2 data. Metabolomics is the study of the small molecule (1,000 Daltons) contents of complex biological samples. Tandem Mass Spectrometry (MS/MS), in conjunction with chromatography, is one of the most commonly used tools in metabolomics.


An Exploration of Conditioning Methods in Graph Neural Networks

Koishekenov, Yeskendir, Bekkers, Erik J.

arXiv.org Artificial Intelligence

The flexibility and effectiveness of message passing based graph neural networks (GNNs) induced considerable advances in deep learning on graph-structured data. In such approaches, GNNs recursively update node representations based on their neighbors and they gain expressivity through the use of node and edge attribute vectors. E.g., in computational tasks such as physics and chemistry usage of edge attributes such as relative position or distance proved to be essential. In this work, we address not what kind of attributes to use, but how to condition on this information to improve model performance. We consider three types of conditioning; weak, strong, and pure, which respectively relate to concatenation-based conditioning, gating, and transformations that are causally dependent on the attributes. This categorization provides a unifying viewpoint on different classes of GNNs, from separable convolutions to various forms of message passing networks. We provide an empirical study on the effect of conditioning methods in several tasks in computational chemistry.


DiffDock-PP: Rigid Protein-Protein Docking with Diffusion Models

Ketata, Mohamed Amine, Laue, Cedrik, Mammadov, Ruslan, Stärk, Hannes, Wu, Menghua, Corso, Gabriele, Marquet, Céline, Barzilay, Regina, Jaakkola, Tommi S.

arXiv.org Artificial Intelligence

Understanding how proteins structurally interact is crucial to modern biology, with applications in drug discovery and protein design. Recent machine learning methods have formulated protein-small molecule docking as a generative problem with significant performance boosts over both traditional and deep learning baselines. We achieve state-ofthe-art performance on DIPS with a median C-RMSD of 4.85, outperforming all considered baselines. Proteins realize their myriad biological functions through interactions with biomolecules, such as other proteins, nucleic acids, or small molecules. The presence or absence of such interactions is dictated in part by the geometric and chemical complementarity of participating bodies. Thus, learning how individual proteins form complexes is crucial to understanding protein activity.


The power of motifs as inductive bias for learning molecular distributions

Sommer, Johanna, Hetzel, Leon, Lüdke, David, Theis, Fabian, Günnemann, Stephan

arXiv.org Artificial Intelligence

Machine learning for molecules holds great potential for efficiently exploring the vast chemical space and thus streamlining the drug discovery process by facilitating the design of new therapeutic molecules. Deep generative models have shown promising results for molecule generation, but the benefits of specific inductive biases for learning distributions over small graphs are unclear. Our study aims to investigate the impact of subgraph structures and vocabulary design on distribution learning, using small drug molecules as a case study. To this end, we introduce Subcover, a new subgraph-based fragmentation scheme, and evaluate it through a two-step variational auto-encoder. Our results show that Subcover's improved identification of chemically meaningful subgraphs leads to a relative improvement of the FCD score by 30%, outperforming previous methods. Our findings highlight the potential of Subcover to enhance the performance and scalability of existing methods, contributing to the advancement of drug discovery. Generative models for molecules offer a way to create new compounds with specific properties, which can be useful in various fields, including drug discovery, material science, and chemistry (Bian & Xie, 2021; Choudhary et al., 2022; Hetzel et al., 2022; Zhu et al., 2022; Du et al., 2022).


EigenFold: Generative Protein Structure Prediction with Diffusion Models

Jing, Bowen, Erives, Ezra, Pao-Huang, Peter, Corso, Gabriele, Berger, Bonnie, Jaakkola, Tommi

arXiv.org Artificial Intelligence

Protein structure prediction has reached revolutionary levels of accuracy on single structures, yet distributional modeling paradigms are needed to capture the conformational ensembles and flexibility that underlie biological function. We define a diffusion process that models the structure as a system of harmonic oscillators and which naturally induces a cascading-resolution generative process along the eigenmodes of the system. 's ability to model and predict conformational heterogeneity for fold-switching proteins and ligand-induced conformational change. The development of accurate methods for protein structure prediction such as AlphaFold2 (Jumper et al., 2021) has revolutionized in silico understanding of protein structure and function. However, while such methods are designed to model static experimental structures from crystallography or cryo-EM, proteins in vivo adopt dynamic structural ensembles featuring conformational flexibility, change, and even disorder to effect their biological functions (Teague, 2003; Wright & Dyson, 2015).


Improving Small Molecule Generation using Mutual Information Machine

Reidenbach, Danny, Livne, Micha, Ilango, Rajesh K., Gill, Michelle, Israeli, Johnny

arXiv.org Artificial Intelligence

We address the task of controlled generation of small molecules, which entails finding novel molecules with desired properties under certain constraints (e.g., similarity to a reference molecule). Here we introduce MolMIM, a probabilistic auto-encoder for small molecule drug discovery that learns an informative and clustered latent space. MolMIM is trained with Mutual Information Machine (MIM) learning, and provides a fixed length representation of variable length SMILES strings. Since encoder-decoder models can learn representations with ``holes'' of invalid samples, here we propose a novel extension to the training procedure which promotes a dense latent space, and allows the model to sample valid molecules from random perturbations of latent codes. We provide a thorough comparison of MolMIM to several variable-size and fixed-size encoder-decoder models, demonstrating MolMIM's superior generation as measured in terms of validity, uniqueness, and novelty. We then utilize CMA-ES, a naive black-box and gradient free search algorithm, over MolMIM's latent space for the task of property guided molecule optimization. We achieve state-of-the-art results in several constrained single property optimization tasks as well as in the challenging task of multi-objective optimization, improving over previous success rate SOTA by more than 5\% . We attribute the strong results to MolMIM's latent representation which clusters similar molecules in the latent space, whereas CMA-ES is often used as a baseline optimization method. We also demonstrate MolMIM to be favourable in a compute limited regime, making it an attractive model for such cases.


Task-Agnostic Graph Neural Network Evaluation via Adversarial Collaboration

Zhao, Xiangyu, Stärk, Hannes, Beaini, Dominique, Zhao, Yiren, Liò, Pietro

arXiv.org Artificial Intelligence

It has been increasingly demanding to develop reliable methods to evaluate the progress of Graph Neural Network (GNN) research for molecular representation learning. Existing GNN benchmarking methods for molecular representation learning focus on comparing the GNNs' performances on some node/graph classification/regression tasks on certain datasets. However, there lacks a principled, task-agnostic method to directly compare two GNNs. Additionally, most of the existing self-supervised learning works incorporate handcrafted augmentations to the data, which has several severe difficulties to be applied on graphs due to their unique characteristics. To address the aforementioned issues, we propose GraphAC (Graph Adversarial Collaboration) -- a conceptually novel, principled, task-agnostic, and stable framework for evaluating GNNs through contrastive self-supervision. We introduce a novel objective function: the Competitive Barlow Twins, that allow two GNNs to jointly update themselves from direct competitions against each other. GraphAC succeeds in distinguishing GNNs of different expressiveness across various aspects, and has demonstrated to be a principled and reliable GNN evaluation method, without necessitating any augmentations.


FlexVDW: A machine learning approach to account for protein flexibility in ligand docking

Suriana, Patricia, Paggi, Joseph M., Dror, Ron O.

arXiv.org Artificial Intelligence

Most widely used ligand docking methods assume a rigid protein structure. This leads to problems when the structure of the target protein deforms upon ligand binding. In particular, the ligand's true binding pose is often scored very unfavorably due to apparent clashes between ligand and protein atoms, which lead to extremely high values of the calculated van der Waals energy term. Traditionally, this problem has been addressed by explicitly searching for receptor conformations to account for the flexibility of the receptor in ligand binding. Here we present a deep learning model trained to take receptor flexibility into account implicitly when predicting van der Waals energy. We show that incorporating this machine-learned energy term into a state-of-the-art physics-based scoring function improves small molecule ligand pose prediction results in cases with substantial protein deformation, without degrading performance in cases with minimal protein deformation. This work demonstrates the feasibility of learning effects of protein flexibility on ligand binding without explicitly modeling changes in protein structure. A critical problem in rational drug discovery is prediction of the position, orientation, and conformation of a ligand (e.g., a drug candidate) when bound to a target protein--i.e., the ligand's "binding pose." Protein-ligand docking methods, which are used to predict ligand binding poses, are key tools in drug discovery and molecular modeling applications (Kitchen et al., 2004; Ferreira et al., 2015).


SupSiam: Non-contrastive Auxiliary Loss for Learning from Molecular Conformers

Maser, Michael, Park, Ji Won, Lin, Joshua Yao-Yu, Lee, Jae Hyeon, Frey, Nathan C., Watkins, Andrew

arXiv.org Artificial Intelligence

We investigate Siamese networks for learning related embeddings for augmented samples of molecular conformers. We find that a non-contrastive (positive-pair only) auxiliary task aids in supervised training of Euclidean neural networks (E3NNs) and increases manifold smoothness (MS) around point-cloud geometries. We demonstrate this property for multiple drug-activity prediction tasks while maintaining relevant performance metrics, and propose an extension of MS to probabilistic and regression settings. We provide an analysis of representation collapse, finding substantial effects of task-weighting, latent dimension, and regularization. We expect the presented protocol to aid in the development of reliable E3NNs from molecular conformers, even for small-data drug discovery programs. Modeling conformational shape is of critical importance in many molecular machine learning (MolML) tasks (Zheng et al., 2017).