Plotting

 Lubbers, Nicholas


Optimal Invariant Bases for Atomistic Machine Learning

arXiv.org Machine Learning

The representation of atomic configurations for machine learning models has led to the development of numerous descriptors, often to describe the local environment of atoms. However, many of these representations are incomplete and/or functionally dependent. Incomplete descriptor sets are unable to represent all meaningful changes in the atomic environment. Complete constructions of atomic environment descriptors, on the other hand, often suffer from a high degree of functional dependence, where some descriptors can be written as functions of the others. These redundant descriptors do not provide additional power to discriminate between different atomic environments and increase the computational burden. By employing techniques from the pattern recognition literature to existing atomistic representations, we remove descriptors that are functions of other descriptors to produce the smallest possible set that satisfies completeness. We apply this in two ways: first we refine an existing description, the Atomistic Cluster Expansion. We show that this yields a more efficient subset of descriptors. Second, we augment an incomplete construction based on a scalar neural network, yielding a new message-passing network architecture that can recognize up to 5-body patterns in each neuron by taking advantage of an optimal set of Cartesian tensor invariants. This architecture shows strong accuracy on state-of-the-art benchmarks while retaining low computational cost. Our results not only yield improved models, but point the way to classes of invariant bases that minimize cost while maximizing expressivity for a host of applications.


Ensemble Knowledge Distillation for Machine Learning Interatomic Potentials

arXiv.org Artificial Intelligence

Machine learning interatomic potentials (MLIPs) are a promising tool to accelerate atomistic simulations and molecular property prediction. The quality of MLIPs strongly depends on the quantity of available training data as well as the quantum chemistry (QC) level of theory used to generate that data. Datasets generated with high-fidelity QC methods, such as coupled cluster, are typically restricted to small molecules and may be missing energy gradients. With this limited quantity of data, it is often difficult to train good MLIP models. We present an ensemble knowledge distillation (EKD) method to improve MLIP accuracy when trained to energy-only datasets. In our EKD approach, first, multiple teacher models are trained to QC energies and then used to generate atomic forces for all configurations in the dataset. Next, a student MLIP is trained to both QC energies and to ensemble-averaged forces generated by the teacher models. We apply this workflow on the ANI-1ccx dataset which consists of organic molecules with configuration energies computed at the coupled cluster level of theory. The resulting student MLIPs achieve new state-of-the-art accuracy on the out-of-sample COMP6 benchmark and improved stability for molecular dynamics simulations. The EKD approach for MLIP is broadly applicable for chemical, biomolecular and materials science simulations.


Model-Agnostic Knowledge Guided Correction for Improved Neural Surrogate Rollout

arXiv.org Artificial Intelligence

Modeling the evolution of physical systems is critical to many applications in science and engineering. As the evolution of these systems is governed by partial differential equations (PDEs), there are a number of computational simulations which resolve these systems with high accuracy. However, as these simulations incur high computational costs, they are infeasible to be employed for large-scale analysis. A popular alternative to simulators are neural network surrogates which are trained in a data-driven manner and are much more computationally efficient. However, these surrogate models suffer from high rollout error when used autoregressively, especially when confronted with training data paucity. Existing work proposes to improve surrogate rollout error by either including physical loss terms directly in the optimization of the model or incorporating computational simulators as'differentiable layers' in the neural network. Both of these approaches have their challenges, with physical loss functions suffering from slow convergence for stiff PDEs and simulator layers requiring gradients which are not always available, especially in legacy simulators. We propose the Hybrid PDE Predictor with Reinforcement Learning (HyPER) model: a modelagnostic, RL based, cost-aware model which combines a neural surrogate, RL decision model, and a physics simulator (with or without gradients) to reduce surrogate rollout error significantly. In addition to reducing in-distribution rollout error by 47%-78%, HyPER learns an intelligent policy that is adaptable to changing physical conditions and resistant to noise corruption. Scientific simulations have been the workhorse enabling novel discoveries across many scientific disciplines. However, executing fine-grained simulations of a scientific process of interest is a costly undertaking requiring large computational resources and long execution times.


Teacher-student training improves accuracy and efficiency of machine learning inter-atomic potentials

arXiv.org Machine Learning

Machine learning inter-atomic potentials (MLIPs) are revolutionizing the field of molecular dynamics (MD) simulations. Recent MLIPs have tended towards more complex architectures trained on larger datasets. The resulting increase in computational and memory costs may prohibit the application of these MLIPs to perform large-scale MD simulations. Here, we present a teacher-student training framework in which the latent knowledge from the teacher (atomic energies) is used to augment the students' training. We show that the light-weight student MLIPs have faster MD speeds at a fraction of the memory footprint compared to the teacher models. Remarkably, the student models can even surpass the accuracy of the teachers, even though both are trained on the same quantum chemistry dataset. Our work highlights a practical method for MLIPs to reduce the resources required for large-scale MD simulations.


Thermodynamic Transferability in Coarse-Grained Force Fields using Graph Neural Networks

arXiv.org Artificial Intelligence

Coarse-graining is a molecular modeling technique in which an atomistic system is represented in a simplified fashion that retains the most significant system features that contribute to a target output, while removing the degrees of freedom that are less relevant. This reduction in model complexity allows coarse-grained molecular simulations to reach increased spatial and temporal scales compared to corresponding all-atom models. A core challenge in coarse-graining is to construct a force field that represents the interactions in the new representation in a way that preserves the atomistic-level properties. Many approaches to building coarse-grained force fields have limited transferability between different thermodynamic conditions as a result of averaging over internal fluctuations at a specific thermodynamic state point. Here, we use a graph-convolutional neural network architecture, the Hierarchically Interacting Particle Neural Network with Tensor Sensitivity (HIP-NN-TS), to develop a highly automated training pipeline for coarse grained force fields which allows for studying the transferability of coarse-grained models based on the force-matching approach. We show that this approach not only yields highly accurate force fields, but also that these force fields are more transferable through a variety of thermodynamic conditions. These results illustrate the potential of machine learning techniques such as graph neural networks to improve the construction of transferable coarse-grained force fields.


Blackout Diffusion: Generative Diffusion Models in Discrete-State Spaces

arXiv.org Artificial Intelligence

Typical generative diffusion models rely on a Gaussian diffusion process for training the backward transformations, which can then be used to generate samples from Gaussian noise. However, real world data often takes place in discrete-state spaces, including many scientific applications. Here, we develop a theoretical formulation for arbitrary discrete-state Markov processes in the forward diffusion process using exact (as opposed to variational) analysis. We relate the theory to the existing continuous-state Gaussian diffusion as well as other approaches to discrete diffusion, and identify the corresponding reverse-time stochastic process and score function in the continuous-time setting, and the reverse-time mapping in the discrete-time setting. As an example of this framework, we introduce ``Blackout Diffusion'', which learns to produce samples from an empty image instead of from noise. Numerical experiments on the CIFAR-10, Binarized MNIST, and CelebA datasets confirm the feasibility of our approach. Generalizing from specific (Gaussian) forward processes to discrete-state processes without a variational approximation sheds light on how to interpret diffusion models, which we discuss.


Simple and efficient algorithms for training machine learning potentials to force data

arXiv.org Machine Learning

Abstract Machine learning models, trained on data from ab initio quantum simulations, are yielding molecular dynamics potentials with unprecedented accuracy. One limiting factor is the quantity of available training data, which can be expensive to obtain. A quantum simulation often provides all atomic forces, in addition to the total energy of the system. These forces provide much more information than the energy alone. It may appear that training a model to this large quantity of force data would introduce significant computational costs. Actually, training to all available force data should only be a few times more expensive than training to energies alone. Here, we present a new algorithm for efficient force training, and benchmark its accuracy by training to forces from real-world datasets for organic chemistry and bulk aluminum.


Less is more: sampling chemical space with active learning

arXiv.org Machine Learning

The development of accurate and transferable machine learning (ML) potentials for predicting molecular energetics is a challenging task. The process of data generation to train such ML potentials is a task neither well understood nor researched in detail. In this work, we present a fully automated approach for the generation of datasets with the intent of training universal ML potentials. It is based on the concept of active learning (AL) via Query by Committee (QBC), which uses the disagreement between an ensemble of ML potentials to infer the reliability of the ensemble's prediction. QBC allows our AL algorithm to automatically sample regions of chemical space where the machine learned potential fails to accurately predict the potential energy. AL improves the overall fitness of ANAKIN-ME (ANI) deep learning potentials in rigorous test cases by mitigating human biases in deciding what new training data to use. AL also reduces the training set size to a fraction of the data required when using naive random sampling techniques. To provide validation of our AL approach we develop the COMP6 benchmark (publicly available on GitHub), which contains a diverse set of organic molecules. We show the use of our proposed AL technique develops a universal ANI potential (ANI-1x), which provides very accurate energy and force predictions on the entire COMP6 benchmark. This universal potential achieves a level of accuracy on par with the best ML potentials for single molecule or materials while remaining applicable to the general class of organic molecules comprised of the elements CHNO.


Hierarchical modeling of molecular energies using a deep neural network

arXiv.org Machine Learning

We introduce the Hierarchically Interacting Particle Neural Network (HIP-NN) to model molecular properties from datasets of quantum calculations. Inspired by a many-body expansion, HIP-NN decomposes properties, such as energy, as a sum over hierarchical terms. These terms are generated from a neural network--a composition of many nonlinear transformations--acting on a representation of the molecule. HIP-NN achieves state-of-the-art performance on a dataset of 131k ground state organic molecules, and predicts energies with 0.26 kcal/mol mean absolute error. With minimal tuning, our model is also competitive on a dataset of molecular dynamics trajectories. In addition to enabling accurate energy predictions, the hierarchical structure of HIP-NN helps to identify regions of model uncertainty.