A Two-Step Graph Convolutional Decoder for Molecule Generation Machine Learning

We propose a simple auto-encoder framework for molecule generation. The molecular graph is first encoded into a continuous latent representation $z$, which is then decoded back to a molecule. The encoding process is easy, but the decoding process remains challenging. In this work, we introduce a simple two-step decoding process. In a first step, a fully connected neural network uses the latent vector $z$ to produce a molecular formula, for example CO$_2$ (one carbon and two oxygen atoms). In a second step, a graph convolutional neural network uses the same latent vector $z$ to place bonds between the atoms that were produced in the first step (for example a double bond will be placed between the carbon and each of the oxygens). This two-step process, in which a bag of atoms is first generated, and then assembled, provides a simple framework that allows us to develop an efficient molecule auto-encoder. Numerical experiments on basic tasks such as novelty, uniqueness, validity and optimized chemical property for the 250k ZINC molecules demonstrate the performances of the proposed system. Particularly, we achieve the highest reconstruction rate of 90.5\%, improving the previous rate of 76.7\%. We also report the best property improvement results when optimization is constrained by the molecular distance between the original and generated molecules.

Utilizing Edge Features in Graph Neural Networks via Variational Information Maximization Machine Learning

Graph Neural Networks (GNNs) achieve an impressive performance on structured graphs by recursively updating the representation vector of each node based on its neighbors, during which parameterized transformation matrices should be learned for the node feature updating. However, existing propagation schemes are far from being optimal since they do not fully utilize the relational information between nodes. We propose the information maximizing graph neural networks (IGNN), which maximizes the mutual information between edge states and transform parameters. We reformulate the mutual information as a differentiable objective via a variational approach. We compare our model against several recent variants of GNNs and show that our model achieves the state-of-the-art performance on multiple tasks including quantum chemistry regression on QM9 dataset, generalization capability from QM9 to larger molecular graphs, and prediction of molecular bioactivities relevant for drug discovery. The IGNN model is based on an elegant and fundamental idea in information theory as explained in the main text, and it could be easily generalized beyond the contexts of molecular graphs considered in this work. To encourage more future work in this area, all datasets and codes used in this paper will be released for public access.

A Model to Search for Synthesizable Molecules Machine Learning

Deep generative models are able to suggest new organic molecules by generating strings, trees, and graphs representing their structure. While such models allow one to generate molecules with desirable properties, they give no guarantees that the molecules can actually be synthesized in practice. We propose a new molecule generation model, mirroring a more realistic real-world process, where (a) reactants are selected, and (b) combined to form more complex molecules. More specifically, our generative model proposes a bag of initial reactants (selected from a pool of commercially-available molecules) and uses a reaction model to predict how they react together to generate new molecules. We first show that the model can generate diverse, valid and unique molecules due to the useful inductive biases of modeling reactions. Furthermore, our model allows chemists to interrogate not only the properties of the generated molecules but also the feasibility of the synthesis routes. We conclude by using our model to solve retrosynthesis problems, predicting a set of reactants that can produce a target product.

Engineers use graph networks to accurately predict properties of molecules and crystals


IMAGE: This is a schematic illustration of MEGNet models. Nanoengineers at the University of California San Diego have developed new deep learning models that can accurately predict the properties of molecules and crystals. By enabling almost instantaneous property predictions, these deep learning models provide researchers the means to rapidly scan the nearly-infinite universe of compounds to discover potentially transformative materials for various technological applications, such as high-energy-density Li-ion batteries, warm-white LEDs, and better photovoltaics. To construct their models, a team led by nanoengineering professor Shyue Ping Ong at the UC San Diego Jacobs School of Engineering used a new deep learning framework called graph networks, developed by Google DeepMind, the brains behind AlphaGo and AlphaZero. Graph networks have the potential to expand the capabilities of existing AI technology to perform complicated learning and reasoning tasks with limited experience and knowledge--something that humans are good at.

The Math Trick Behind MP3s, JPEGs, and Homer Simpson's Face - Facts So Romantic


Over a decade ago, I was sitting in a college math physics course and my professor spelt out an idea that kind of blew my mind. I think it isn't a stretch to say that this is one of the most widely applicable mathematical discoveries, with applications ranging from optics to quantum physics, radio astronomy, MP3 and JPEG compression, X-ray crystallography, voice recognition, and PET or MRI scans. This mathematical tool--named the Fourier transform, after 18th-century French physicist and mathematician Joseph Fourier--was even used by James Watson and Francis Crick to decode the double helix structure of DNA from the X-ray patterns produced by Rosalind Franklin. You probably use a descendant of Fourier's idea every day, whether you're playing an MP3, viewing an image on the web, asking Siri a question, or tuning in to a radio station. In addition to his work in theoretical physics and math, he was also the first to discover the greenhouse effect.)

Machine Learning Prediction of Accurate Atomization Energies of Organic Molecules from Low-Fidelity Quantum Chemical Calculations Machine Learning

Recent studies illustrate how machine learning (ML) can be used to bypass a core challenge of molecular modeling: the tradeoff between accuracy and computational cost. Here, we assess multiple ML approaches for predicting the atomization energy of organic molecules. Our resulting models learn the difference between low-fidelity, B3LYP, and high-accuracy, G4MP2, atomization energies, and predict the G4MP2 atomization energy to 0.005 eV (mean absolute error) for molecules with less than 9 heavy atoms and 0.012 eV for a small set of molecules with between 10 and 14 heavy atoms. Our two best models, which have different accuracy/speed tradeoffs, enable the efficient prediction of G4MP2-level energies for large molecules and are available through a simple web interface.

Cormorant: Covariant Molecular Neural Networks Machine Learning

We propose Cormorant, a rotationally covariant neural network architecture for learning the behavior and properties of complex many-body physical systems. We apply these networks to molecular systems with two goals: learning atomic potential energy surfaces for use in Molecular Dynamics simulations, and learning ground state properties of molecules calculated by Density Functional Theory. Some of the key features of our network are that (a) each neuron explicitly corresponds to a subset of atoms; (b) the activation of each neuron is covariant to rotations, ensuring that overall the network is fully rotationally invariant. Furthermore, the non-linearity in our network is based upon tensor products and the Clebsch-Gordan decomposition, allowing the network to operate entirely in Fourier space. Cormorant significantly outperforms competing algorithms in learning molecular Potential Energy Surfaces from conformational geometries in the MD-17 dataset, and is competitive with other methods at learning geometric, energetic, electronic, and thermodynamic properties of molecules on the GDB-9 dataset.

Probabilistic hypergraph grammars for efficient molecular optimization Machine Learning

We present an approach to make molecular optimization more efficient. We infer a hypergraph replacement grammar from the ChEMBL database, count the frequencies of particular rules being used to expand particular nonterminals in other rules, and use these as conditional priors for the policy model. Simulating random molecules from the resulting probabilistic grammar, we show that conditional priors result in a molecular distribution closer to the training set than using equal rule probabilities or unconditional priors. We then treat molecular optimization as a reinforcement learning problem, using a novel modification of the policy gradient algorithm - batch-advantage: using individual rewards minus the batch average reward to weight the log probability loss. The reinforcement learning agent is tasked with building molecules using this grammar, with the goal of maximizing benchmark scores available from the literature. To do so, the agent has policies both to choose the next node in the graph to expand and to select the next grammar rule to apply. The policies are implemented using the Transformer architecture with the partially expanded graph as the input at each step. We show that using the empirical priors as the starting point for a policy eliminates the need for pre-training, and allows us to reach optima faster. We achieve competitive performance on common benchmarks from the literature, such as penalized logP and QED, with only hundreds of training steps on a budget GPU instance.

All SMILES Variational Autoencoder Machine Learning

Variational autoencoders (VAEs) defined over SMILES string and graph-based representations of molecules promise to improve the optimization of molecular properties, thereby revolutionizing the pharmaceuticals and materials industries. However, these VAEs are hindered by the non-unique nature of SMILES strings and the computational cost of graph convolutions. To efficiently pass messages along all paths through the molecular graph, we encode multiple SMILES strings of a single molecule using a set of stacked recurrent neural networks, pooling hidden representations of each atom between SMILES representations, and use attentional pooling to build a final fixed-length latent representation. By then decoding to a disjoint set of SMILES strings of the molecule, our All SMILES VAE learns an almost bijective mapping between molecules and latent representations near the high-probability-mass subspace of the prior. Our SMILES-derived but molecule-based latent representations significantly surpass the state-of-the-art in a variety of fully- and semi-supervised property regression and molecular property optimization tasks.

Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules Machine Learning

Deep learning has proven to yield fast and accurate predictions of quantum-chemical properties to accelerate the discovery of novel molecules and materials. As an exhaustive exploration of the vast chemical space is still infeasible, we require generative models that guide our search towards systems with desired properties. While graph-based models have previously been proposed, they are restricted by a lack of spatial information such that they are unable to recognize spatial isomerism and non-bonded interactions. Here, we introduce a generative neural network for 3d point sets that respects the rotational invariance of the targeted structures. We apply it to the generation of molecules and demonstrate its ability to approximate the distribution of equilibrium structures using spatial metrics as well as established measures from chemoinformatics. As our model is able to capture the complex relationship between 3d geometry and electronic properties, we bias the distribution of the generator towards molecules with a small HOMO-LUMO gap - an important property for the design of organic solar cells.