Takeuchi, Ichiro
Automated Materials Discovery Platform Realized: Scanning Probe Microscopy of Combinatorial Libraries
Liu, Yu, Pant, Rohit, Takeuchi, Ichiro, Spurling, R. Jackson, Maria, Jon-Paul, Ziatdinov, Maxim, Kalinin, Sergei V.
These libraries typically contain binary or ternary isothermal cross-sections of multicomponent phase diagrams, and more advanced synthesis methods can generate spatially encoded 4D and 5D compositional spaces [1]. This versatility makes them well-suited for both optimizing materials through direct exploration of compositional spaces and advancing physics discovery by exploring property and microstructure evolution [2-10]. Additionally, temperature gradients during synthesis can help reveal the effects of synthesis variables, while localized ion-or laser-based annealing enables broader exploration of the processing and chemical spaces within the selected material systems [8, 11, 12]. The first experiments in combinatorial research date back to the 1960s [13, 14], with renewed interest in the 1990s following the discovery of high-temperature superconductors [3, 4, 11, 15-17]. However, it quickly became apparent that successful combinatorial research requires not only synthesis but also detailed characterization, along with the ability to derive insights from characterization results and use these for subsequent experiment planning or transition towards different fabrication routes.
Reward driven workflows for unsupervised explainable analysis of phases and ferroic variants from atomically resolved imaging data
Barakati, Kamyar, Liu, Yu, Nelson, Chris, Ziatdinov, Maxim A., Zhang, Xiaohang, Takeuchi, Ichiro, Kalinin, Sergei V.
Rapid progress in aberration corrected electron microscopy necessitates development of robust methods for the identification of phases, ferroic variants, and other pertinent aspects of materials structure from imaging data. While unsupervised methods for clustering and classification are widely used for these tasks, their performance can be sensitive to hyperparameter selection in the analysis workflow. In this study, we explore the effects of descriptors and hyperparameters on the capability of unsupervised ML methods to distill local structural information, exemplified by discovery of polarization and lattice distortion in Sm doped BiFeO3 (BFO) thin films. We demonstrate that a reward-driven approach can be used to optimize these key hyperparameters across the full workflow, where rewards were designed to reflect domain wall continuity and straightness, ensuring that the analysis aligns with the material's physical behavior. This approach allows us to discover local descriptors that are best aligned with the specific physical behavior, providing insight into the fundamental physics of materials. We further extend the reward driven workflows to disentangle structural factors of variation via optimized variational autoencoder (VAE). Finally, the importance of well-defined rewards was explored as a quantifiable measure of success of the workflow.
Conditional Latent Space Molecular Scaffold Optimization for Accelerated Molecular Design
Boyar, Onur, Hanada, Hiroyuki, Takeuchi, Ichiro
The rapid discovery of new chemical compounds is essential for advancing global health and developing treatments. While generative models show promise in creating novel molecules, challenges remain in ensuring the real-world applicability of these molecules and finding such molecules efficiently. To address this, we introduce Conditional Latent Space Molecular Scaffold Optimization (CLaSMO), which combines a Conditional Variational Autoencoder (CVAE) with Latent Space Bayesian Optimization (LSBO) to modify molecules strategically while maintaining similarity to the original input. Our LSBO setting improves the sample-efficiency of our optimization, and our modification approach helps us to obtain molecules with higher chances of real-world applicability. CLaSMO explores substructures of molecules in a sample-efficient manner by performing BO in the latent space of a CVAE conditioned on the atomic environment of the molecule to be optimized. Our experiments demonstrate that CLaSMO efficiently enhances target properties with minimal substructure modifications, achieving state-of-the-art results with a smaller model and dataset compared to existing methods. We also provide an open-source web application that enables chemical experts to apply CLaSMO in a Human-in-the-Loop setting.
Real-time experiment-theory closed-loop interaction for autonomous materials science
Liang, Haotong, Wang, Chuangye, Yu, Heshan, Kirsch, Dylan, Pant, Rohit, McDannald, Austin, Kusne, A. Gilad, Zhao, Ji-Cheng, Takeuchi, Ichiro
Iterative cycles of theoretical prediction and experimental validation are the cornerstone of the modern scientific method. However, the proverbial "closing of the loop" in experiment-theory cycles in practice are usually ad hoc, often inherently difficult, or impractical to repeat on a systematic basis, beset by the scale or the time constraint of computation or the phenomena under study. Here, we demonstrate Autonomous MAterials Search Engine (AMASE), where we enlist robot science to perform self-driving continuous cyclical interaction of experiments and computational predictions for materials exploration. In particular, we have applied the AMASE formalism to the rapid mapping of a temperature-composition phase diagram, a fundamental task for the search and discovery of new materials. Thermal processing and experimental determination of compositional phase boundaries in thin films are autonomously interspersed with real-time updating of the phase diagram prediction through the minimization of Gibbs free energies. AMASE was able to accurately determine the eutectic phase diagram of the Sn-Bi binary thin-film system on the fly from a self-guided campaign covering just a small fraction of the entire composition - temperature phase space, translating to a 6-fold reduction in the number of necessary experiments. This study demonstrates for the first time the possibility of real-time, autonomous, and iterative interactions of experiments and theory carried out without any human intervention.
Statistical testing on generative AI anomaly detection tools in Alzheimer's Disease diagnosis
He, Rosemary, Takeuchi, Ichiro
Alzheimer's Disease is challenging to diagnose due to our limited understanding of its mechanism and large heterogeneity among patients. Neurodegeneration is studied widely as a biomarker for clinical diagnosis, which can be measured from time series MRI progression. On the other hand, generative AI has shown promise in anomaly detection in medical imaging and used for tasks including tumor detection. However, testing the reliability of such data-driven methods is non-trivial due to the issue of double-dipping in hypothesis testing. In this work, we propose to solve this issue with selective inference and develop a reliable generative AI method for Alzheimer's prediction. We show that compared to traditional statistical methods with highly inflated p-values, selective inference successfully controls the false discovery rate under the desired alpha level while retaining statistical power. In practice, our pipeline could assist clinicians in Alzheimer's diagnosis and early intervention.
Statistical Test for Auto Feature Engineering by Selective Inference
Matsukawa, Tatsuya, Shiraishi, Tomohiro, Nishino, Shuichi, Katsuoka, Teruyuki, Takeuchi, Ichiro
Auto Feature Engineering (AFE) plays a crucial role in developing practical machine learning pipelines by automating the transformation of raw data into meaningful features that enhance model performance. By generating features in a data-driven manner, AFE enables the discovery of important features that may not be apparent through human experience or intuition. On the other hand, since AFE generates features based on data, there is a risk that these features may be overly adapted to the data, making it essential to assess their reliability appropriately. Unfortunately, because most AFE problems are formulated as combinatorial search problems and solved by heuristic algorithms, it has been challenging to theoretically quantify the reliability of generated features. To address this issue, we propose a new statistical test for generated features by AFE algorithms based on a framework called selective inference. As a proof of concept, we consider a simple class of tree search-based heuristic AFE algorithms, and consider the problem of testing the generated features when they are used in a linear model. The proposed test can quantify the statistical significance of the generated features in the form of $p$-values, enabling theoretically guaranteed control of the risk of false findings.
Statistical Test for Data Analysis Pipeline by Selective Inference
Shiraishi, Tomohiro, Matsukawa, Tatsuya, Nishino, Shuichi, Takeuchi, Ichiro
A data analysis pipeline is a structured sequence of processing steps that transforms raw data into meaningful insights by effectively integrating various analysis algorithms. In this paper, we propose a novel statistical test designed to assess the statistical significance of data analysis pipelines. Our approach allows for the systematic development of valid statistical tests applicable to any data analysis pipeline configuration composed of a set of data analysis components. We have developed this framework by adapting selective inference, which has gained recent attention as a new statistical inference technique for data-driven hypotheses. The proposed statistical test is theoretically designed to control the type I error at the desired significance level in finite samples. As examples, we consider a class of pipelines composed of three missing value imputation algorithms, three outlier detection algorithms, and three feature selection algorithms. We confirm the validity of our statistical test through experiments with both synthetic and real data for this class of data analysis pipelines. Additionally, we present an implementation framework that facilitates testing across any configuration of data analysis pipelines in this class without extra implementation costs.
Distributionally Robust Safe Sample Screening
Hanada, Hiroyuki, Tatsuya, Aoyama, Satoshi, Akahane, Tanaka, Tomonari, Okura, Yoshito, Inatsu, Yu, Hashimoto, Noriaki, Takeno, Shion, Murayama, Taro, Lee, Hanju, Kojima, Shinya, Takeuchi, Ichiro
In this study, we propose a machine learning method called Distributionally Robust Safe Sample Screening (DRSSS). DRSSS aims to identify unnecessary training samples, even when the distribution of the training samples changes in the future. To achieve this, we effectively combine the distributionally robust (DR) paradigm, which aims to enhance model robustness against variations in data distribution, with the safe sample screening (SSS), which identifies unnecessary training samples prior to model training. Since we need to consider an infinite number of scenarios regarding changes in the distribution, we applied SSS because it does not require model training after the change of the distribution. In this paper, we employed the covariate shift framework to represent the distribution of training samples and reformulated the DR covariate-shift problem as a weighted empirical risk minimization problem, where the weights are subject to uncertainty within a predetermined range. By extending the existing SSS technique to accommodate this weight uncertainty, the DRSSS method is capable of reliably identifying unnecessary samples under any future distribution within a specified range. We provide a theoretical guarantee for the DRSSS method and validate its performance through numerical experiments on both synthetic and real-world datasets.
Crystal-LSBO: Automated Design of De Novo Crystals with Latent Space Bayesian Optimization
Boyar, Onur, Gu, Yanheng, Tanaka, Yuji, Tonogai, Shunsuke, Itakura, Tomoya, Takeuchi, Ichiro
Generative modeling of crystal structures is significantly challenged by the complexity of input data, which constrains the ability of these models to explore and discover novel crystals. This complexity often confines de novo design methodologies to merely small perturbations of known crystals and hampers the effective application of advanced optimization techniques. One such optimization technique, Latent Space Bayesian Optimization (LSBO) has demonstrated promising results in uncovering novel objects across various domains, especially when combined with Variational Autoencoders (VAEs). Recognizing LSBO's potential and the critical need for innovative crystal discovery, we introduce Crystal-LSBO--a de novo design framework for crystals specifically tailored to enhance explorability within LSBO frameworks. Crystal-LSBO employs multiple VAEs, each dedicated to a distinct aspect of crystal structure--lattice, coordinates, and chemical elements, orchestrated by an integrative model that synthesizes these components into a cohesive output. This setup not only streamlines the learning process but also produces explorable latent spaces thanks to the decreased complexity of the learning task for each model, enabling LSBO approaches to operate. Our study pioneers the use of LSBO for de novo crystal design, demonstrating its efficacy through optimization tasks focused mainly on formation energy values. Our results highlight the effectiveness of our methodology, offering a new perspective for de novo crystal discovery.
Distributionally Robust Safe Screening
Hanada, Hiroyuki, Akahane, Satoshi, Aoyama, Tatsuya, Tanaka, Tomonari, Okura, Yoshito, Inatsu, Yu, Hashimoto, Noriaki, Murayama, Taro, Hanju, Lee, Kojima, Shinya, Takeuchi, Ichiro
In this study, we propose a method Distributionally Robust Safe Screening (DRSS), for identifying unnecessary samples and features within a DR covariate shift setting. This method effectively combines DR learning, a paradigm aimed at enhancing model robustness against variations in data distribution, with safe screening (SS), a sparse optimization technique designed to identify irrelevant samples and features prior to model training. The core concept of the DRSS method involves reformulating the DR covariate-shift problem as a weighted empirical risk minimization problem, where the weights are subject to uncertainty within a predetermined range. By extending the SS technique to accommodate this weight uncertainty, the DRSS method is capable of reliably identifying unnecessary samples and features under any future distribution within a specified range. We provide a theoretical guarantee of the DRSS method and validate its performance through numerical experiments on both synthetic and real-world datasets.