Not enough data to create a plot.
Try a different view from the menu above.
Foster, Ian
Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First Data Release
Babuji, Yadu, Blaiszik, Ben, Brettin, Tom, Chard, Kyle, Chard, Ryan, Clyde, Austin, Foster, Ian, Hong, Zhi, Jha, Shantenu, Li, Zhuozhao, Liu, Xuefeng, Ramanathan, Arvind, Ren, Yi, Saint, Nicholaus, Schwarting, Marcus, Stevens, Rick, van Dam, Hubertus, Wagner, Rick
Researchers across the globe are seeking to rapidly repurpose existing drugs or discover new drugs to counter the the novel coronavirus disease (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). One promising approach is to train machine learning (ML) and artificial intelligence (AI) tools to screen large numbers of small molecules. As a contribution to that effort, we are aggregating numerous small molecules from a variety of sources, using high-performance computing (HPC) to computer diverse properties of those molecules, using the computed properties to train ML/AI models, and then using the resulting models for screening. In this first data release, we make available 23 datasets collected from community sources representing over 4.2 B molecules enriched with pre-computed: 1) molecular fingerprints to aid similarity searches, 2) 2D images of molecules to enable exploration and application of image-based deep learning methods, and 3) 2D and 3D molecular descriptors to speed development of machine learning models. This data release encompasses structural information on the 4.2 B molecules and 60 TB of pre-computed data. Future releases will expand the data to include more detailed molecular simulations, computed models, and other products.
IRNet: A General Purpose Deep Residual Regression Framework for Materials Discovery
Jha, Dipendra, Ward, Logan, Yang, Zijiang, Wolverton, Christopher, Foster, Ian, Liao, Wei-keng, Choudhary, Alok, Agrawal, Ankit
Materials discovery is crucial for making scientific advances in many domains. Collections of data from experiments and first-principle computations have spurred interest in applying machine learning methods to create predictive models capable of mapping from composition and crystal structures to materials properties. Generally, these are regression problems with the input being a 1D vector composed of numerical attributes representing the material composition and/or crystal structure. While neural networks consisting of fully connected layers have been applied to such problems, their performance often suffers from the vanishing gradient problem when network depth is increased. In this paper, we study and propose design principles for building deep regression networks composed of fully connected layers with numerical vectors as input. We introduce a novel deep regression network with individual residual learning, IRNet, that places shortcut connections after each layer so that each layer learns the residual mapping between its output and input. We use the problem of learning properties of inorganic materials from numerical attributes derived from material composition and/or crystal structure to compare IRNet's performance against that of other machine learning techniques. Using multiple datasets from the Open Quantum Materials Database (OQMD) and Materials Project for training and evaluation, we show that IRNet provides significantly better prediction performance than the state-of-the-art machine learning approaches currently used by domain scientists. We also show that IRNet's use of individual residual learning leads to better convergence during the training phase than when shortcut connections are between multi-layer stacks while maintaining the same number of parameters.
Machine Learning Prediction of Accurate Atomization Energies of Organic Molecules from Low-Fidelity Quantum Chemical Calculations
Ward, Logan, Blaiszik, Ben, Foster, Ian, Assary, Rajeev S., Narayanan, Badri, Curtiss, Larry
Recent studies illustrate how machine learning (ML) can be used to bypass a core challenge of molecular modeling: the tradeoff between accuracy and computational cost. Here, we assess multiple ML approaches for predicting the atomization energy of organic molecules. Our resulting models learn the difference between low-fidelity, B3LYP, and high-accuracy, G4MP2, atomization energies, and predict the G4MP2 atomization energy to 0.005 eV (mean absolute error) for molecules with less than 9 heavy atoms and 0.012 eV for a small set of molecules with between 10 and 14 heavy atoms. Our two best models, which have different accuracy/speed tradeoffs, enable the efficient prediction of G4MP2-level energies for large molecules and are available through a simple web interface.
DLHub: Model and Data Serving for Science
Chard, Ryan, Li, Zhuozhao, Chard, Kyle, Ward, Logan, Babuji, Yadu, Woodard, Anna, Tuecke, Steve, Blaiszik, Ben, Franklin, Michael J., Foster, Ian
Abstract--While the Machine Learning (ML) landscape is evolving rapidly, there has been a relative lag in the development of the "learning systems" needed to enable broad adoption. Furthermore, few such systems are designed to support the specialized requirements of scientific ML. Here we present the Data and Learning Hub for science (DLHub), a multi-tenant system that provides both model repository and serving capabilities witha focus on science applications. First, its selfservice modelrepository allows users to share, publish, verify, reproduce, and reuse models, and addresses concerns related to model reproducibility by packaging and distributing models and all constituent components. Second, it implements scalable and low-latency serving capabilities that can leverage parallel and distributed computing resources to democratize access to published modelsthrough a simple web interface. Unlike other model serving frameworks, DLHub can store and serve any Python 3-compatible model or processing function, plus multiple-function pipelines. We show that relative to other model serving systems including TensorFlow Serving, SageMaker, and Clipper, DLHub provides greater capabilities, comparable performance without memoization and batching, and significantly better performance when the latter two techniques can be employed. We also describe early uses of DLHub for scientific applications. I. INTRODUCTION Machine Learning (ML) is disrupting nearly every aspect of computing. Researchers now turn to ML methods to uncover patterns in vast data collections and to make decisions with little or no human input. As ML becomes increasingly pervasive, newsystems are required to support the development, adoption, and application of ML. We refer to the broad class of systems designed to support ML as "learning systems." Learning systems need to support the entire ML lifecycle (see Figure 1), including model development [1, 2]; scalable training across potentially tens of thousands of cores and GPUs [3]; model publication and sharing [4]; and low latency and highthroughput inference[5]; all while encouraging best-practice software engineering when developing models [6].