bandgap
When Active Learning Fails, Uncalibrated Out of Distribution Uncertainty Quantification Might Be the Problem
Dale, Ashley S., Li, Kangming, DeCost, Brian, Wan, Hao, Han, Yuchen, Fehlis, Yao, Hattrick-Simpers, Jason
Efficiently and meaningfully estimating prediction uncertainty is important for exploration in active learning campaigns in materials discovery, where samples with high uncertainty are interpreted as containing information missing from the model. In this work, the effect of different uncertainty estimation and calibration methods are evaluated for active learning when using ensembles of ALIGNN, eXtreme Gradient Boost, Random Forest, and Neural Network model architectures. We compare uncertainty estimates from ALIGNN deep ensembles to loss landscape uncertainty estimates obtained for solubility, bandgap, and formation energy prediction tasks. We then evaluate how the quality of the uncertainty estimate impacts an active learning campaign that seeks model generalization to out-of-distribution data. Uncertainty calibration methods were found to variably generalize from in-domain data to out-of-domain data. Furthermore, calibrated uncertainties were generally unsuccessful in reducing the amount of data required by a model to improve during an active learning campaign on out-of-distribution data when compared to random sampling and uncalibrated uncertainties. The impact of poor-quality uncertainty persists for random forest and eXtreme Gradient Boosting models trained on the same data for the same tasks, indicating that this is at least partially intrinsic to the data and not due to model capacity alone. Analysis of the target, in-distribution uncertainty, out-of-distribution uncertainty, and training residual distributions suggest that future work focus on understanding empirical uncertainties in the feature input space for cases where ensemble prediction variances do not accurately capture the missing information required for the model to generalize.
Guiding Generative Models to Uncover Diverse and Novel Crystals via Reinforcement Learning
Discovering functional crystalline materials entails navigating an immense combinatorial design space. While recent advances in generative artificial intelligence have enabled the sampling of chemically plausible compositions and structures, a fundamental challenge remains: the objective misalignment between likelihood-based sampling in generative modelling and targeted focus on underexplored regions where novel compounds reside. Here, we introduce a reinforcement learning framework that guides latent denoising diffusion models toward diverse and novel, yet thermodynamically viable crystalline compounds. Our approach integrates group relative policy optimisation with verifiable, multi-objective rewards that jointly balance creativity, stability, and diversity. Beyond de novo generation, we demonstrate enhanced property-guided design that preserves chemical validity, while targeting desired functional properties. This approach establishes a modular foundation for controllable AI-driven inverse design that addresses the novelty-validity trade-off across scientific discovery applications of generative models.
Extended Factorization Machine Annealing for Rapid Discovery of Transparent Conducting Materials
Makino, Daisuke, Goto, Tatsuya, Suga, Yoshinori
The development of novel transparent conducting materials (TCMs) is essential for enhancing the performance and reducing the cost of next-generation devices such as solar cells and displays. In this research, we focus on the (Al$_x$Ga$_y$In$_z$)$_2$O$_3$ system and extend the FMA framework, which combines a Factorization Machine (FM) and annealing, to search for optimal compositions and crystal structures with high accuracy and low cost. The proposed method introduces (i) the binarization of continuous variables, (ii) the utilization of good solutions using a Hopfield network, (iii) the activation of global search through adaptive random flips, and (iv) fine-tuning via a bit-string local search. Validation using the (Al$_x$Ga$_y$In$_z$)$_2$O$_3$ data from the Kaggle "Nomad2018 Predicting Transparent Conductors" competition demonstrated that our method achieves faster and more accurate searches than Bayesian optimization and genetic algorithms. Furthermore, its application to multi-objective optimization showed its capability in designing materials by simultaneously considering both the band gap and formation energy. These results suggest that applying our method to larger, more complex search problems and diverse material designs that reflect realistic experimental conditions is expected to contribute to the further advancement of materials informatics.
Quotient Complex Transformer (QCformer) for Perovskite Data Analysis
You, Xinyu, Liu, Xiang, Hu, Chuan-Shen, Xia, Kelin, Sum, Tze Chien
The discovery of novel functional materials is crucial in addressing the challenges of sustainable energy generation and climate change. Hybrid organic-inorganic perovskites (HOIPs) have gained attention for their exceptional optoelectronic properties in photovoltaics. Recently, geometric deep learning, particularly graph neural networks (GNNs), has shown strong potential in predicting material properties and guiding material design. However, traditional GNNs often struggle to capture the periodic structures and higher-order interactions prevalent in such systems. To address these limitations, we propose a novel representation based on quotient complexes (QCs) and introduce the Quotient Complex Transformer (QCformer) for material property prediction. A material structure is modeled as a quotient complex, which encodes both pairwise and many-body interactions via simplices of varying dimensions and captures material periodicity through a quotient operation. Our model leverages higher-order features defined on simplices and processes them using a simplex-based Transformer module. We pretrain QCformer on benchmark datasets such as the Materials Project and JARVIS, and fine-tune it on HOIP datasets. The results show that QCformer outperforms state-of-the-art models in HOIP property prediction, demonstrating its effectiveness. The quotient complex representation and QCformer model together contribute a powerful new tool for predictive modeling of perovskite materials.
MatterChat: A Multi-Modal LLM for Material Science
Tang, Yingheng, Xu, Wenbin, Cao, Jie, Ma, Jianzhu, Gao, Weilu, Farrell, Steve, Erichson, Benjamin, Mahoney, Michael W., Nonaka, Andy, Yao, Zhi
In-silico material discovery and design have traditionally relied on high-fidelity first-principles methods such as density functional theory (DFT) [1] and ab-initio molecular dynamics (AIMD) [2] to accurately model atomic interactions and predict material properties. Despite their effectiveness, these methods face significant challenges due to their prohibitive computational cost, limiting their scalability for highthroughput screening across vast chemical spaces and for simulations over large length and time scales. Moreover, many advanced materials remain beyond the reach of widespread predictive theories due to a fundamental lack of mechanistic understanding. These challenges stem from the inherent complexity of their chemical composition, phase stability, and the intricate interplay of multiple order parameters, compounded by the lack of self-consistent integration between theoretical models and multi-modal experimental findings. As a result, breakthroughs in functional materials, such as new classes of correlated oxides, nitrides, and low-dimensional quantum materials, have largely been serendipitous or guided by phenomenological intuition rather than systematic, theory-driven design. Attempts to predict new materials and functionalities have often led to mixed results, with theoretically proposed systems failing to exhibit the desired properties when synthesized and tested.
Contrastive Language-Structure Pre-training Driven by Materials Science Literature
Suzuki, Yuta, Taniai, Tatsunori, Igarashi, Ryo, Saito, Kotaro, Chiba, Naoya, Ushiku, Yoshitaka, Ono, Kanta
Understanding structure-property relationships is an essential yet challenging aspect of materials discovery and development. To facilitate this process, recent studies in materials informatics have sought latent embedding spaces of crystal structures to capture their similarities based on properties and functionalities. However, abstract feature-based embedding spaces are human-unfriendly and prevent intuitive and efficient exploration of the vast materials space. Here we introduce Contrastive Language--Structure Pre-training (CLaSP), a learning paradigm for constructing crossmodal embedding spaces between crystal structures and texts. CLaSP aims to achieve material embeddings that 1) capture property- and functionality-related similarities between crystal structures and 2) allow intuitive retrieval of materials via user-provided description texts as queries. To compensate for the lack of sufficient datasets linking crystal structures with textual descriptions, CLaSP leverages a dataset of over 400,000 published crystal structures and corresponding publication records, including paper titles and abstracts, for training. We demonstrate the effectiveness of CLaSP through text-based crystal structure screening and embedding space visualization.
Inverse design of potential metastructures inspired from Indian medieval architectural elements
Bhattacharya, Bishakh, Gupta, Tanuj, Sharma, Arun Kumar, Dwivedi, Ankur, Gupta, Vivek, Sahana, Subhadeep, Pathak, Suryansh, Awasthi, Ashish
In this study, we immerse in the intricate world of patterns, examining the structural details of Indian medieval architecture for the discovery of motifs with great application potential from the mechanical metastructure perspective. The motifs that specifically engrossed us are derived from the tomb of I'timad-ud-Daula, situated in the city of Agra, close to the Taj Mahal. In an exploratory study, we designed nine interlaced metastructures inspired from the tomb's motifs. We fabricated the metastructures using additive manufacturing and studied their vibration characteristics experimentally and numerically. We also investigated bandgap modulation with metallic inserts in honeycomb interlaced metastructures. The comprehensive study of these metastructure panels reveals their high performance in controlling elastic wave propagation and generating suitable frequency bandgaps, hence having potential applications as waveguides for noise and vibration control. Finally, we developed a novel AI-based model trained on numerical datasets for the inverse design of metastructures with a desired bandgap.
CrysAtom: Distributed Representation of Atoms for Crystal Property Prediction
Mukherjee, Shrimon, Ghosh, Madhusudan, Basuchowdhuri, Partha
Application of artificial intelligence (AI) has been ubiquitous in the growth of research in the areas of basic sciences. Frequent use of machine learning (ML) and deep learning (DL) based methodologies by researchers has resulted in significant advancements in the last decade. These techniques led to notable performance enhancements in different tasks such as protein structure prediction, drug-target binding affinity prediction, and molecular property prediction. In material science literature, it is well-known that crystalline materials exhibit topological structures. Such topological structures may be represented as graphs and utilization of graph neural network (GNN) based approaches could help encoding them into an augmented representation space. Primarily, such frameworks adopt supervised learning techniques targeted towards downstream property prediction tasks on the basis of electronic properties (formation energy, bandgap, total energy, etc.) and crystalline structures. Generally, such type of frameworks rely highly on the handcrafted atom feature representations along with the structural representations. In this paper, we propose an unsupervised framework namely, CrysAtom, using untagged crystal data to generate dense vector representation of atoms, which can be utilized in existing GNN-based property predictor models to accurately predict important properties of crystals. Empirical results show that our dense representation embeds chemical properties of atoms and enhance the performance of the baseline property predictor models significantly.
Phononic materials with effectively scale-separated hierarchical features using interpretable machine learning
Bastawrous, Mary V., Chen, Zhi, Ogren, Alexander C., Daraio, Chiara, Rudin, Cynthia, Brinson, L. Catherine
Manipulating the dispersive characteristics of vibrational waves is beneficial for many applications, e.g., high-precision instruments. architected hierarchical phononic materials have sparked promise tunability of elastodynamic waves and vibrations over multiple frequency ranges. In this article, hierarchical unit-cells are obtained, where features at each length scale result in a band gap within a targeted frequency range. Our novel approach, the ``hierarchical unit-cell template method,'' is an interpretable machine-learning approach that uncovers global unit-cell shape/topology patterns corresponding to predefined band-gap objectives. A scale-separation effect is observed where the coarse-scale band-gap objective is mostly unaffected by the fine-scale features despite the closeness of their length scales, thus enabling an efficient hierarchical algorithm. Moreover, the hierarchical patterns revealed are not predefined or self-similar hierarchies as common in current hierarchical phononic materials. Thus, our approach offers a flexible and efficient method for the exploration of new regions in the hierarchical design space, extracting minimal effective patterns for inverse design in applications targeting multiple frequency ranges.
How big is Big Data?
Speckhard, Daniel T., Bechtel, Tim, Ghiringhelli, Luca M., Kuban, Martin, Rigamonti, Santiago, Draxl, Claudia
Big data has ushered in a new wave of predictive power using machine learning models. In this work, we assess what {\it big} means in the context of typical materials-science machine-learning problems. This concerns not only data volume, but also data quality and veracity as much as infrastructure issues. With selected examples, we ask (i) how models generalize to similar datasets, (ii) how high-quality datasets can be gathered from heterogenous sources, (iii) how the feature set and complexity of a model can affect expressivity, and (iv) what infrastructure requirements are needed to create larger datasets and train models on them. In sum, we find that big data present unique challenges along very different aspects that should serve to motivate further work.