Flam-Shepherd, Daniel
Atom-by-atom protein generation and beyond with language models
Flam-Shepherd, Daniel, Zhu, Kevin, Aspuru-Guzik, Alán
Protein language models learn powerful representations directly from sequences of amino acids. However, they are constrained to generate proteins with only the set of amino acids represented in their vocabulary. In contrast, chemical language models learn atom-level representations of smaller molecules that include every atom, bond, and ring. In this work, we show that chemical language models can learn atom-level representations of proteins enabling protein generation unconstrained to the standard genetic code and far beyond it. In doing so, we show that language models can generate entire proteins atom by atom -- effectively learning the multiple hierarchical layers of molecular information that define proteins from their primary sequence to their secondary, and tertiary structure. We demonstrate language models are able to explore beyond protein space -- generating proteins with modified sidechains that form unnatural amino acids. Even further, we find that language models can explore chemical space and protein space simultaneously and generate novel examples of protein-drug conjugates. The results demonstrate the potential for biomolecular design at the atom level using language models.
Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files
Flam-Shepherd, Daniel, Aspuru-Guzik, Alán
Language models are powerful tools for molecular design. Currently, the dominant paradigm is to parse molecular graphs into linear string representations that can easily be trained on. This approach has been very successful, however, it is limited to chemical structures that can be completely represented by a graph -- like organic molecules -- while materials and biomolecular structures like protein binding sites require a more complete representation that includes the relative positioning of their atoms in space. In this work, we show how language models, without any architecture modifications, trained using next-token prediction -- can generate novel and valid structures in three dimensions from various substantially different distributions of chemical structures. In particular, we demonstrate that language models trained directly on sequences derived directly from chemical file formats like XYZ files, Crystallographic Information files (CIFs), or Protein Data Bank files (PDBs) can directly generate molecules, crystals, and protein binding sites in three dimensions. Furthermore, despite being trained on chemical file sequences -- language models still achieve performance comparable to state-of-the-art models that use graph and graph-derived string representations, as well as other domain-specific 3D generative models. In doing so, we demonstrate that it is not necessary to use simplified molecular representations to train chemical language models -- that they are powerful generative models capable of directly exploring chemical space in three dimensions for very different structures.
Learning quantum dynamics with latent neural ODEs
Choi, Matthew, Flam-Shepherd, Daniel, Kyaw, Thi Ha, Aspuru-Guzik, Alán
Deep learning and neural networks have recently become the powerhouse in machine learning (ML) and they have successfully been used to tackle complex problems In general, the study of open quantum systems are in classical [1-3] and quantum mechanics [4-7] (see Refs. important for quantum computing as well as many [8-12] for reviews). Machine-assisted scientific discovery other areas of physics from many-body phenomenon [27, is still in its infancy but progress has been made, mostly 28], light-matter interaction [29-31] to non-equilibrium by building the correct inductive bias-or structure into physics [32, 33]. the model or loss function. For example physical conservation laws can be learned [1, 2]. Other work has made progress, in a purely data-driven approach learning relationships between quantum experiments and entanglement Here, we demonstrate that latent ODEs can be trained using generative models [13]. Recently, neural to generate and extrapolate measurement data from dynamical ordinary differential equations (ODEs) were introduced quantum evolution in both closed and open [14, 15], a neural network layer defined by differential quantum systems using only physical observations without equations. Neural ODEs provide the perfect model for specifying the physics a priori. This is in line with physics, since many physical laws are governed by ODEs, treating the quantum system as a black box and the "shut and thus every neural ODE has the correct inductive bias up and calculate" philosophy [34] all the while ignoring built into the model itself.
Scalable Fragment-Based 3D Molecular Design with Reinforcement Learning
Flam-Shepherd, Daniel, Zhigalin, Alexander, Aspuru-Guzik, Alán
Machine learning has the potential to automate molecular design and drastically accelerate the discovery of new functional compounds. Towards this goal, generative models and reinforcement learning (RL) using string and graph representations have been successfully used to search for novel molecules. However, these approaches are limited since their representations ignore the three-dimensional (3D) structure of molecules. In fact, geometry plays an important role in many applications in inverse molecular design, especially in drug discovery. Thus, it is important to build models that can generate molecular structures in 3D space based on property-oriented geometric constraints. To address this, one approach is to generate molecules as 3D point clouds by sequentially placing atoms at locations in space -- this allows the process to be guided by physical quantities such as energy or other properties. However, this approach is inefficient as placing individual atoms makes the exploration unnecessarily deep, limiting the complexity of molecules that can be generated. Moreover, when optimizing a molecule, organic and medicinal chemists use known fragments and functional groups, not single atoms. We introduce a novel RL framework for scalable 3D design that uses a hierarchical agent to build molecules by placing molecular substructures sequentially in 3D space, thus attempting to build on the existing human knowledge in the field of molecular design. In a variety of experiments with different substructures, we show that our agent, guided only by energy considerations, can efficiently learn to produce molecules with over 100 atoms from many distributions including drug-like molecules, organic LED molecules, and biomolecules.
Keeping it Simple: Language Models can learn Complex Molecular Distributions
Flam-Shepherd, Daniel, Zhu, Kevin, Aspuru-Guzik, Alán
Deep generative models of molecules have grown immensely in popularity, trained on relevant datasets, these models are used to search through chemical space. The downstream utility of generative models for the inverse design of novel functional compounds depends on their ability to learn a training distribution of molecules. The most simple example is a language model that takes the form of a recurrent neural network and generates molecules using a string representation. More sophisticated are graph generative models, which sequentially construct molecular graphs and typically achieve state of the art results. However, recent work has shown that language models are more capable than once thought, particularly in the low data regime. In this work, we investigate the capacity of simple language models to learn distributions of molecules. For this purpose, we introduce several challenging generative modeling tasks by compiling especially complex distributions of molecules. On each task, we evaluate the ability of language models as compared with two widely used graph generative models. The results demonstrate that language models are powerful generative models, capable of adeptly learning complex molecular distributions -- and yield better performance than the graph models. Language models can accurately generate: distributions of the highest scoring penalized LogP molecules in ZINC15, multi-modal molecular distributions as well as the largest molecules in PubChem.
Bayesian Variational Optimization for Combinatorial Spaces
Wu, Tony C., Flam-Shepherd, Daniel, Aspuru-Guzik, Alán
This paper focuses on Bayesian Optimization in combinatorial spaces. In many applications in the natural science. Broad applications include the study of molecules, proteins, DNA, device structures and quantum circuit designs, a on optimization over combinatorial categorical spaces is needed to find optimal or pareto-optimal solutions. However, only a limited amount of methods have been proposed to tackle this problem. Many of them depend on employing Gaussian Process for combinatorial Bayesian Optimizations. Gaussian Processes suffer from scalability issues for large data sizes as their scaling is cubic with respect to the number of data points. This is often impractical for optimizing large search spaces. Here, we introduce a variational Bayesian optimization method that combines variational optimization and continuous relaxations to the optimization of the acquisition function for Bayesian optimization. Critically, this method allows for gradient-based optimization and has the capability of optimizing problems with large data size and data dimensions. We have shown the performance of our method is comparable to state-of-the-art methods while maintaining its scalability advantages. We also applied our method in molecular optimization.