Probing the properties of molecules and complex materials using machine learning


The application of machine learning to predicting the properties of small and large discrete (single) molecules and complex materials (polymeric, extended or mixtures of molecules) has been increasing exponentially over the past few decades. Unlike physics-based and rule-based computational systems, machine learning algorithms can learn complex relationships between physicochemical and process parameters and their useful properties for an extremely diverse range of molecular entities. Both the breadth of machine learning methods and the range of physical, chemical, materials, biological, medical and many other application areas have increased markedly in the past decade. This Account summarises three decades of research into improved cheminformatics and machine learning methods and their application to drug design, regenerative medicine, biomaterials, porous and 2D materials, catalysts, biomarkers, surface science, physicochemical and phase properties, nanomaterials, electrical and optical properties, corrosion and battery research. Science has always been fascinated by change, uncovering new aspects of Nature and finding useful ways to exploit them to meet global challenges. The rate of change is accelerating, with average time between innovations decreasing exponentially (Figure 1). Computational molecular design prior to 1990 was focused on the use of computationally expensive physics-based methods like molecular modelling, molecular mechanics, molecular dynamics and quantum chemistry. The quantitative structure–activity relationship (QSAR) methods, developed by Hansch and Fujita in the 1960s, were based on the observation that changes in the constitution of small organic molecules generated a corresponding change in their biological activities. Regression methods were used to find relationships between structure, encoded by mathematical entities called descriptors or features, and biological properties of small organic molecules, also numerically encoded. QSAR use was limited to modelling of small data sets of molecules with similar scaffolds, with the primary aim of understanding the molecular basis for drug (or agrochemical) action. As they were not mechanism- or physics-based, their empirical nature created doubt as to their efficacy, the question of when correlation means causation (still an important issue), and lack of data were major barriers to their wider adoption. After that time, technological developments involving automation, computational power, algorithms, synthesis and informatics have maintained this exponential acceleration.

Machine-Learning Model Improves Gas Lift Performance and Well Integrity


The main objective of this work is to use machine-learning (ML) algorithms to develop a powerful model to predict well-integrity (WI) risk categories of gas-lifted wells. The model described in the complete paper can predict well-risk level and provide a unique method to convert associated failure risk of each element in the well envelope into tangible values. The predictive model, which predicts the risk status of wells and classifies their integrity level into five categories rather than three broad-range categories, as in qualitative risk classification. The five categories are Category 1, which is too risky Category 2, which is still too risky but less so than Category 1 Category 3, which is medium risk but can be elevated if additional barrier failures occur Category 4, which is low risk but features some impaired barriers Category 5, which is the lowest in risk The failure model, which identifies whether the well is considered to be in failure mode. In addition, the model can identify wells that require prompt mitigation.

Computer Vision - Richard Szeliski


As humans, we perceive the three-dimensional structure of the world around us with apparent ease. Think of how vivid the three-dimensional percept is when you look at a vase of flowers sitting on the table next to you. You can tell the shape and translucency of each petal through the subtle patterns of light and shading that play across its surface and effortlessly segment each flower from the background of the scene (Figure 1.1). Looking at a framed group por- trait, you can easily count (and name) all of the people in the picture and even guess at their emotions from their facial appearance. Perceptual psychologists have spent decades trying to understand how the visual system works and, even though they can devise optical illusions1 to tease apart some of its principles (Figure 1.3), a complete solution to this puzzle remains elusive (Marr 1982; Palmer 1999; Livingstone 2008).

Computer Vision and Deep Learning for Electricity - PyImageSearch


Universal access to affordable, reliable, and sustainable modern energy is a Sustainable Development Goal (SDG). However, lack of sufficient power generation, poor transmission and distribution infrastructure, affordability, uncertain climate concerns, diversification and decentralization of energy production, and changing demand patterns are creating complex challenges in power generation. According to the 2019 International Energy Agency (IEA) report, 860 million people lack access to electricity, and three billion people use open fires and simple stoves fueled by kerosene, biomass, or coal for cooking. As a result, over four million people die prematurely of the illnesses associated. Artificial intelligence (AI) offers a great potential to lower energy costs, cut energy waste, and facilitate and accelerate the use of renewable and clean energy sources in power grids worldwide. In addition, it can help improve the planning, operation, and control of power systems.

Survey and Evaluation of Causal Discovery Methods for Time Series

Journal of Artificial Intelligence Research

We introduce in this survey the major concepts, models, and algorithms proposed so far to infer causal relations from observational time series, a task usually referred to as causal discovery in time series. To do so, after a description of the underlying concepts and modelling assumptions, we present different methods according to the family of approaches they belong to: Granger causality, constraint-based approaches, noise-based approaches, score-based approaches, logic-based approaches, topology-based approaches, and difference-based approaches. We then evaluate several representative methods to illustrate the behaviour of different families of approaches. This illustration is conducted on both artificial and real datasets, with different characteristics. The main conclusions one can draw from this survey is that causal discovery in times series is an active research field in which new methods (in every family of approaches) are regularly proposed, and that no family or method stands out in all situations. Indeed, they all rely on assumptions that may or may not be appropriate for a particular dataset.

Data-Driven Modelling of Polyethylene Recycling under High-Temperature Extrusion


Two main problems are studied in this article. The first one is the use of the extrusion process for controlled thermo-mechanical degradation of polyethylene for recycling applications. The second is the data-based modelling of such reactive extrusion processes. Polyethylenes (high density polyethylene (HDPE) and ultra-high molecular weight polyethylene (UHMWPE)) were extruded in a corotating twin-screw extruder under high temperatures (350 °C < T < 420 °C) for various process conditions (flow rate and screw rotation speed). These process conditions involved a decrease in the molecular weight due to degradation reactions. A numerical method based on the Carreau-Yasuda model was developed to predict the rheological behaviour (variation of the viscosity versus shear rate) from the in-line measurement of the die pressure. The results were successfully compared to the viscosity measured from offline measurement assuming the Cox‑Merz law. Weight average molecular weights were estimated from the resulting zero-shear rate viscosity. Furthermore, the linear viscoelastic behaviours (Frequency dependence of the complex shear modulus) were also used to predict the molecular weight distributions of final products by an inverse rheological method. Size exclusion chromatography (SEC) was performed on five samples, and the resulting molecular weight distributions were compared to the values obtained with the two aforementioned techniques. The values of weight average molecular weights were similar for the three techniques. The complete molecular weight distributions obtained by inverse rheology were similar to the SEC ones for extruded HDPE samples, but some inaccuracies were observed for extruded UHMWPE samples. The Ludovic® (SC-Consultants, Saint-Etienne, France) corotating twin-screw extrusion simulation software was used as a classical process simulation. However, as the rheo-kinetic laws of this process were unknown, the software could not predict all the flow characteristics successfully. Finally, machine learning techniques, able to operate in the low-data limit, were tested to build predicting models of the process outputs and material characteristics. Support Vector Machine Regression (SVR) and sparsed Proper Generalized Decomposition (sPGD) techniques were chosen to predict the process outputs successfully. These methods were also applied to material characteristics data, and both were found to be effective in predicting molecular weights. More precisely, the sPGD gave better results than the SVR for the zero-shear viscosity prediction. Stochastic methods were also tested on some of the data and showed promising results.

Modeling High-Dimensional Data with Unknown Cut Points: A Fusion Penalized Logistic Threshold Regression Machine Learning

In traditional logistic regression models, the link function is often assumed to be linear and continuous in predictors. Here, we consider a threshold model that all continuous features are discretized into ordinal levels, which further determine the binary responses. Both the threshold points and regression coefficients are unknown and to be estimated. For high dimensional data, we propose a fusion penalized logistic threshold regression (FILTER) model, where a fused lasso penalty is employed to control the total variation and shrink the coefficients to zero as a method of variable selection. Under mild conditions on the estimate of unknown threshold points, we establish the non-asymptotic error bound for coefficient estimation and the model selection consistency. With a careful characterization of the error propagation, we have also shown that the tree-based method, such as CART, fulfill the threshold estimation conditions. We find the FILTER model is well suited in the problem of early detection and prediction for chronic disease like diabetes, using physical examination data. The finite sample behavior of our proposed method are also explored and compared with extensive Monte Carlo studies, which supports our theoretical discoveries.

Bernstein Flows for Flexible Posteriors in Variational Bayes Machine Learning

Variational inference (VI) is a technique to approximate difficult to compute posteriors by optimization. In contrast to MCMC, VI scales to many observations. In the case of complex posteriors, however, state-of-the-art VI approaches often yield unsatisfactory posterior approximations. This paper presents Bernstein flow variational inference (BF-VI), a robust and easy-to-use method, flexible enough to approximate complex multivariate posteriors. BF-VI combines ideas from normalizing flows and Bernstein polynomial-based transformation models. In benchmark experiments, we compare BF-VI solutions with exact posteriors, MCMC solutions, and state-of-the-art VI methods including normalizing flow based VI. We show for low-dimensional models that BF-VI accurately approximates the true posterior; in higher-dimensional models, BF-VI outperforms other VI methods. Further, we develop with BF-VI a Bayesian model for the semi-structured Melanoma challenge data, combining a CNN model part for image data with an interpretable model part for tabular data, and demonstrate for the first time how the use of VI in semi-structured models.

Adjoint-aided inference of Gaussian process driven differential equations Machine Learning

Linear systems occur throughout engineering and the sciences, most notably as differential equations. In many cases the forcing function for the system is unknown, and interest lies in using noisy observations of the system to infer the forcing, as well as other unknown parameters. In differential equations, the forcing function is an unknown function of the independent variables (typically time and space), and can be modelled as a Gaussian process (GP). In this paper we show how the adjoint of a linear system can be used to efficiently infer forcing functions modelled as GPs, after using a truncated basis expansion of the GP kernel. We show how exact conjugate Bayesian inference for the truncated GP can be achieved, in many cases with substantially lower computation than would be required using MCMC methods. We demonstrate the approach on systems of both ordinary and partial differential equations, and by testing on synthetic data, show that the basis expansion approach approximates well the true forcing with a modest number of basis vectors. Finally, we show how to infer point estimates for the non-linear model parameters, such as the kernel length-scales, using Bayesian optimisation.

A survey of unsupervised learning methods for high-dimensional uncertainty quantification in black-box-type problems Machine Learning

Constructing surrogate models for uncertainty quantification (UQ) on complex partial differential equations (PDEs) having inherently high-dimensional $\mathcal{O}(10^{\ge 2})$ stochastic inputs (e.g., forcing terms, boundary conditions, initial conditions) poses tremendous challenges. The curse of dimensionality can be addressed with suitable unsupervised learning techniques used as a pre-processing tool to encode inputs onto lower-dimensional subspaces while retaining its structural information and meaningful properties. In this work, we review and investigate thirteen dimension reduction methods including linear and nonlinear, spectral, blind source separation, convex and non-convex methods and utilize the resulting embeddings to construct a mapping to quantities of interest via polynomial chaos expansions (PCE). We refer to the general proposed approach as manifold PCE (m-PCE), where manifold corresponds to the latent space resulting from any of the studied dimension reduction methods. To investigate the capabilities and limitations of these methods we conduct numerical tests for three physics-based systems (treated as black-boxes) having high-dimensional stochastic inputs of varying complexity modeled as both Gaussian and non-Gaussian random fields to investigate the effect of the intrinsic dimensionality of input data. We demonstrate both the advantages and limitations of the unsupervised learning methods and we conclude that a suitable m-PCE model provides a cost-effective approach compared to alternative algorithms proposed in the literature, including recently proposed expensive deep neural network-based surrogates and can be readily applied for high-dimensional UQ in stochastic PDEs.