cdvae
Establishing baselines for generative discovery of inorganic crystals
Szymanski, Nathan J., Bartel, Christopher J.
Generative artificial intelligence offers a promising avenue for materials discovery, yet its advantages over traditional methods remain unclear. In this work, we introduce and benchmark two baseline approaches - random enumeration of charge-balanced prototypes and data-driven ion exchange of known compounds - against three generative models: a variational autoencoder, a large language model, and a diffusion model. Our results show that established methods such as ion exchange perform comparably well in generating stable materials, although many of these materials tend to closely resemble known compounds. In contrast, generative models excel at proposing novel structural frameworks and, when sufficient training data is available, can more effectively target properties such as electronic band gap and bulk modulus while maintaining a high stability rate. To enhance the performance of both the baseline and generative approaches, we implement a post-generation screening step in which all proposed structures are passed through stability and property filters from pre-trained machine learning models including universal interatomic potentials. This low-cost filtering step leads to substantial improvement in the success rates of all methods, remains computationally efficient, and ultimately provides a practical pathway toward more effective generative strategies for materials discovery.
Fine-Tuned Language Models Generate Stable Inorganic Materials as Text
Gruver, Nate, Sriram, Anuroop, Madotto, Andrea, Wilson, Andrew Gordon, Zitnick, C. Lawrence, Ulissi, Zachary
We propose fine-tuning large language models for generation of stable materials. While unorthodox, fine-tuning large language models on text-encoded atomistic data is simple to implement yet reliable, with around 90% of sampled structures obeying physical constraints on atom positions and charges. Using energy above hull calculations from both learned ML potentials and gold-standard DFT calculations, we show that our strongest model (fine-tuned LLaMA-2 70B) can generate materials predicted to be metastable at about twice the rate (49% vs 28%) of CD-VAE, a competing diffusion model. Because of text prompting's inherent flexibility, our models can simultaneously be used for unconditional generation of stable material, infilling of partial structures and text-conditional generation. Finally, we show that language models' ability to capture key symmetries of crystal structures improves with model scale, suggesting that the biases of pretrained LLMs are surprisingly well-suited for atomistic data. Large language models (LLMs) are trained to compress large text datasets, but can also act as strong foundations for non-text data (Delétang et al., 2023). As compressors, LLMs extract common patterns and find simple programs that can produce them (Goldblum et al., 2023; Sutskever, 2023), regardless of the data's origin. Alongside generality, LLM pre-training also gives rise to sample efficiency, as in-context learning and fine-tuning require far fewer training examples to identify salient patterns than training a model from scratch (Brown et al., 2020). The generality and sample efficiency of LLMs make them particular promising for scientific problems, where data are often limited, collected from diverse sources, or challenging for non-experts to interpret. In materials science, for example, the number of known stable materials is relatively small, and the data describing each material are diverse, including composition, structure, and complex properties. LLMs can learn generalizable rules from a small number of examples (Zhu et al., 2023), combine modalities into a single model (Moon et al., 2023), and provide users with a text-based interface. A text interface, in particular, has the potential to improve access to scientific discovery (White, 2023); LLMs can use text to describe new observations, or, in design applications (e.g. In this work, we show that fine-tuned LLMs can generate the three-dimensional structure of stable crystals as text (Figure 1).
Scalable Diffusion for Materials Generation
Yang, Mengjiao, Cho, KwangHwan, Merchant, Amil, Abbeel, Pieter, Schuurmans, Dale, Mordatch, Igor, Cubuk, Ekin Dogus
Generative models trained on internet-scale data are capable of generating novel and realistic texts, images, and videos. A natural next question is whether these models can advance science, for example by generating novel stable materials. Traditionally, models with explicit structures (e.g., graphs) have been used in modeling structural relationships in scientific data (e.g., atoms and bonds in crystals), but generating structures can be difficult to scale to large and complex systems. Another challenge in generating materials is the mismatch between standard generative modeling metrics and downstream applications. For instance, common metrics such as the reconstruction error do not correlate well with the downstream goal of discovering stable materials. In this work, we tackle the scalability challenge by developing a unified crystal representation that can represent any crystal structure (UniMat), followed by training a diffusion probabilistic model on these UniMat representations. Our empirical results suggest that despite the lack of explicit structure modeling, UniMat can generate high fidelity crystal structures from larger and more complex chemical systems, outperforming previous graph-based approaches under various generative modeling metrics. To better connect the generation quality of materials to downstream applications, such as discovering novel stable materials, we propose additional metrics for evaluating generative models of materials, including per-composition formation energy and stability with respect to convex hulls through decomposition energy from Density Function Theory (DFT). Lastly, we show that conditional generation with UniMat can scale to previously established crystal datasets with up to millions of crystals structures, outperforming random structure search (the current leading method for structure discovery) in discovering new stable materials.
Causal Dynamic Variational Autoencoder for Counterfactual Regression in Longitudinal Data
Bouchattaoui, Mouad El, Tami, Myriam, Lepetit, Benoit, Cournède, Paul-Henry
Estimating treatment effects over time is relevant in many real-world applications, such as precision medicine, epidemiology, economy, and marketing. Many state-of-the-art methods either assume the observations of all confounders or seek to infer the unobserved ones. We take a different perspective by assuming unobserved risk factors, i.e., adjustment variables that affect only the sequence of outcomes. Under unconfoundedness, we target the Individual Treatment Effect (ITE) estimation with unobserved heterogeneity in the treatment response due to missing risk factors. We address the challenges posed by time-varying effects and unobserved adjustment variables. Led by theoretical results over the validity of the learned adjustment variables and generalization bounds over the treatment effect, we devise Causal DVAE (CDVAE). This model combines a Dynamic Variational Autoencoder (DVAE) framework with a weighting strategy using propensity scores to estimate counterfactual responses. The CDVAE model allows for accurate estimation of ITE and captures the underlying heterogeneity in longitudinal data.
- Europe > France (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > Promising Solution (0.87)
Data-driven discovery of novel 2D materials by deep generative models
Lyngby, Peder, Thygesen, Kristian Sommer
Efficient algorithms to generate candidate crystal structures with good stability properties can play a key role in data-driven materials discovery. Here we show that a crystal diffusion variational autoencoder (CDVAE) is capable of generating two-dimensional (2D) materials of high chemical and structural diversity and formation energies mirroring the training structures. Specifically, we train the CDVAE on 2615 2D materials with energy above the convex hull $\Delta H_{\mathrm{hull}}< 0.3$ eV/atom, and generate 5003 materials that we relax using density functional theory (DFT). We also generate 14192 new crystals by systematic element substitution of the training structures. We find that the generative model and lattice decoration approach are complementary and yield materials with similar stability properties but very different crystal structures and chemical compositions. In total we find 11630 predicted new 2D materials, where 8599 of these have $\Delta H_{\mathrm{hull}}< 0.3$ eV/atom as the seed structures, while 2004 are within 50 meV of the convex hull and could potentially be synthesized. The relaxed atomic structures of all the materials are available in the open Computational 2D Materials Database (C2DB). Our work establishes the CDVAE as an efficient and reliable crystal generation machine, and significantly expands the space of 2D materials.
- Research Report (0.50)
- Workflow (0.47)
CDVAE: Co-embedding Deep Variational Auto Encoder for Conditional Variational Generation
Lu, Jiajun, Deshpande, Aditya, Forsyth, David
Problems such as predicting a new shading field (Y) for an image (X) are ambiguous: many very distinct solutions are good. Representing this ambiguity requires building a conditional model P (Y X) of the prediction, conditioned on the image. Such a model is difficult to train, because we do not usually have training data containing many different shadings for the same image. As a result, we need different training examples to share data to produce good models. This presents a danger we call "code space collapse" -- the training procedure produces a model that has a very good loss score, but which represents the conditional distribution poorly. We demonstrate an improved method for building conditional models by exploiting a metric constraint on training data that prevents code space collapse. We demonstrate our model on two example tasks using real data: image saturation adjustment, image relighting. We describe quantitative metrics to evaluate ambiguous generation results. Our results quantitatively and qualitatively outperform different strong baselines.
- North America > United States > Illinois (0.04)
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)