Not enough data to create a plot.
Try a different view from the menu above.
Schwaller, Philippe
Molecular Hypergraph Neural Networks
Chen, Junwu, Schwaller, Philippe
Graph neural networks (GNNs) have demonstrated promising performance across various chemistry-related tasks. However, conventional graphs only model the pairwise connectivity in molecules, failing to adequately represent higher-order connections like multi-center bonds and conjugated structures. To tackle this challenge, we introduce molecular hypergraphs and propose Molecular Hypergraph Neural Networks (MHNN) to predict the optoelectronic properties of organic semiconductors, where hyperedges represent conjugated structures. A general algorithm is designed for irregular high-order connections, which can efficiently operate on molecular hypergraphs with hyperedges of various orders. The results show that MHNN outperforms all baseline models on most tasks of OPV, OCELOTv1 and PCQM4Mv2 datasets. Notably, MHNN achieves this without any 3D geometric information, surpassing the baseline model that utilizes atom positions. Moreover, MHNN achieves better performance than pretrained GNNs under limited training data, underscoring its excellent data efficiency. This work provides a new strategy for more general molecular representations and property prediction tasks related to high-order connections.
FSscore: A Machine Learning-based Synthetic Feasibility Score Leveraging Human Expertise
Neeser, Rebecca M., Correia, Bruno, Schwaller, Philippe
Determining whether a molecule can be synthesized is crucial for many aspects of chemistry and drug discovery, allowing prioritization of experimental work and ranking molecules in de novo design tasks. Existing scoring approaches to assess synthetic feasibility struggle to extrapolate to out-of-distribution chemical spaces or fail to discriminate based on minor differences such as chirality that might be obvious to trained chemists. This work aims to address these limitations by introducing the Focused Synthesizability score (FSscore), which learns to rank structures based on binary preferences using a graph attention network. First, a baseline trained on an extensive set of reactant-product pairs is established that subsequently is fine-tuned with expert human feedback on a chemical space of interest. Fine-tuning on focused datasets improves performance on these chemical scopes over the pre-trained model exhibiting moderate performance and generalizability. This enables distinguishing hard- from easy-to-synthesize molecules and improving the synthetic accessibility of generative model outputs. On very complex scopes with limited labels achieving satisfactory gains remains challenging. The FSscore showcases how human expert feedback can be utilized to optimize the assessment of synthetic feasibility for a variety of applications.
Holistic chemical evaluation reveals pitfalls in reaction prediction models
Gil, Victor Sabanza, Bran, Andres M., Franke, Malte, Schlama, Remi, Luterbacher, Jeremy S., Schwaller, Philippe
The prediction of chemical reactions has gained significant interest within the machine learning community in recent years, owing to its complexity and crucial applications in chemistry. However, model evaluation for this task has been mostly limited to simple metrics like top-k accuracy, which obfuscates fine details of a model's limitations. Inspired by progress in other fields, we propose a new assessment scheme that builds on top of current approaches, steering towards a more holistic evaluation. We introduce the following key components for this goal: CHORISO, a curated dataset along with multiple tailored splits to recreate chemically relevant scenarios, and a collection of metrics that provide a holistic view of a model's advantages and limitations. Application of this method to state-of-the-art models reveals important differences on sensitive fronts, especially stereoselectivity and chemical out-of-distribution generalization. Our work paves the way towards robust prediction models that can ultimately accelerate chemical discovery.
Extracting human interpretable structure-property relationships in chemistry using XAI and large language models
Wellawatte, Geemi P., Schwaller, Philippe
Explainable Artificial Intelligence (XAI) is an emerging field in AI that aims to address the opaque nature of machine learning models. Furthermore, it has been shown that XAI can be used to extract input-output relationships, making them a useful tool in chemistry to understand structure-property relationships. However, one of the main limitations of XAI methods is that they are developed for technically oriented users. We propose the XpertAI framework that integrates XAI methods with large language models (LLMs) accessing scientific literature to generate accessible natural language explanations of raw chemical data automatically. We conducted 5 case studies to evaluate the performance of XpertAI. Our results show that XpertAI combines the strengths of LLMs and XAI tools in generating specific, scientific, and interpretable explanations.
Transformers and Large Language Models for Chemistry and Drug Discovery
Bran, Andres M, Schwaller, Philippe
Language modeling has seen impressive progress over the last years, mainly prompted by the invention of the Transformer architecture, sparking a revolution in many fields of machine learning, with breakthroughs in chemistry and biology. In this chapter, we explore how analogies between chemical and natural language have inspired the use of Transformers to tackle important bottlenecks in the drug discovery process, such as retrosynthetic planning and chemical space exploration. The revolution started with models able to perform particular tasks with a single type of data, like linearised molecular graphs, which then evolved to include other types of data, like spectra from analytical instruments, synthesis actions, and human language. A new trend leverages recent developments in large language models, giving rise to a wave of models capable of solving generic tasks in chemistry, all facilitated by the flexibility of natural language. As we continue to explore and harness these capabilities, we can look forward to a future where machine learning plays an even more integral role in accelerating scientific discovery.
ODEFormer: Symbolic Regression of Dynamical Systems with Transformers
d'Ascoli, Stéphane, Becker, Sören, Mathis, Alexander, Schwaller, Philippe, Kilbertus, Niki
Recent triumphs of machine learning (ML) spark growing enthusiasm for accelerating scientific discovery [1-3]. In particular, inferring dynamical laws governing observational data is an extremely challenging task that is anticipated to benefit substantially from modern ML methods. Modeling dynamical systems for forecasting, control, and system identification has been studied by various communities within ML. Successful modern approaches are primarily based on advances in deep learning, such as neural ordinary differential equation (NODE) (see Chen et al. [4] and many extensions thereof). However, these models typically lack interpretability due to their black-box nature, which has inspired extensive research on explainable ML of overparameterized models [5, 6].
ChemCrow: Augmenting large-language models with chemistry tools
Bran, Andres M, Cox, Sam, Schilter, Oliver, Baldassari, Carlo, White, Andrew D, Schwaller, Philippe
Over the last decades, excellent computational chemistry tools have been developed. Integrating them into a single platform with enhanced accessibility could help reaching their full potential by overcoming steep learning curves. Recently, large-language models (LLMs) have shown strong performance in tasks across domains, but struggle with chemistry-related problems. Moreover, these models lack access to external knowledge sources, limiting their usefulness in scientific applications. In this study, we introduce ChemCrow, an LLM chemistry agent designed to accomplish tasks across organic synthesis, drug discovery, and materials design. By integrating 18 expert-designed tools, ChemCrow augments the LLM performance in chemistry, and new capabilities emerge. Our agent autonomously planned and executed the syntheses of an insect repellent, three organocatalysts, and guided the discovery of a novel chromophore. Our evaluation, including both LLM and expert assessments, demonstrates ChemCrow's effectiveness in automating a diverse set of chemical tasks. Surprisingly, we find that GPT-4 as an evaluator cannot distinguish between clearly wrong GPT-4 completions and Chemcrow's performance. Our work not only aids expert chemists and lowers barriers for non-experts, but also fosters scientific advancement by bridging the gap between experimental and computational chemistry.
Beam Enumeration: Probabilistic Explainability For Sample Efficient Self-conditioned Molecular Design
Guo, Jeff, Schwaller, Philippe
Generative molecular design has moved from proof-of-concept to real-world applicability, as marked by the surge in very recent papers reporting experimental validation. Key challenges in explainability and sample efficiency present opportunities to enhance generative design to directly optimize expensive high-fidelity oracles and provide actionable insights to domain experts. Here, we propose Beam Enumeration to exhaustively enumerate the most probable sub-sequences from language-based molecular generative models and show that molecular substructures can be extracted. When coupled with reinforcement learning, extracted substructures become meaningful, providing a source of explainability and improving sample efficiency through self-conditioned generation. Beam Enumeration is generally applicable to any language-based molecular generative model and notably further improves the performance of the recently reported Augmented Memory algorithm, which achieved the new state-of-the-art on the Practical Molecular Optimization benchmark for sample efficiency. The combined algorithm generates more high reward molecules and faster, given a fixed oracle budget. Beam Enumeration is the first method to jointly address explainability and sample efficiency for molecular design.
14 Examples of How LLMs Can Transform Materials Science and Chemistry: A Reflection on a Large Language Model Hackathon
Jablonka, Kevin Maik, Ai, Qianxiang, Al-Feghali, Alexander, Badhwar, Shruti, Bocarsly, Joshua D., Bran, Andres M, Bringuier, Stefan, Brinson, L. Catherine, Choudhary, Kamal, Circi, Defne, Cox, Sam, de Jong, Wibe A., Evans, Matthew L., Gastellu, Nicolas, Genzling, Jerome, Gil, María Victoria, Gupta, Ankur K., Hong, Zhi, Imran, Alishba, Kruschwitz, Sabine, Labarre, Anne, Lála, Jakub, Liu, Tao, Ma, Steven, Majumdar, Sauradeep, Merz, Garrett W., Moitessier, Nicolas, Moubarak, Elias, Mouriño, Beatriz, Pelkie, Brenden, Pieler, Michael, Ramos, Mayk Caldas, Ranković, Bojana, Rodriques, Samuel G., Sanders, Jacob N., Schwaller, Philippe, Schwarting, Marcus, Shi, Jiale, Smit, Berend, Smith, Ben E., Van Herck, Joren, Völker, Christoph, Ward, Logan, Warren, Sean, Weiser, Benjamin, Zhang, Sylvester, Zhang, Xiaoqi, Zia, Ghezal Ahmad, Scourtas, Aristana, Schmidt, KJ, Foster, Ian, White, Andrew D., Blaiszik, Ben
Large-language models (LLMs) such as GPT-4 caught the interest of many scientists. Recent studies suggested that these models could be useful in chemistry and materials science. To explore these possibilities, we organized a hackathon. This article chronicles the projects built as part of this hackathon. Participants employed LLMs for various applications, including predicting properties of molecules and materials, designing novel interfaces for tools, extracting knowledge from unstructured data, and developing new educational applications. The diverse topics and the fact that working prototypes could be generated in less than two days highlight that LLMs will profoundly impact the future of our fields. The rich collection of ideas and projects also indicates that the applications of LLMs are not limited to materials science and chemistry but offer potential benefits to a wide range of scientific disciplines.
Augmented Memory: Capitalizing on Experience Replay to Accelerate De Novo Molecular Design
Guo, Jeff, Schwaller, Philippe
Sample efficiency is a fundamental challenge in de novo molecular design. Ideally, molecular generative models should learn to satisfy a desired objective under minimal oracle evaluations (computational prediction or wet-lab experiment). This problem becomes more apparent when using oracles that can provide increased predictive accuracy but impose a significant cost. Consequently, these oracles cannot be directly optimized under a practical budget. Molecular generative models have shown remarkable sample efficiency when coupled with reinforcement learning, as demonstrated in the Practical Molecular Optimization (PMO) benchmark. Here, we propose a novel algorithm called Augmented Memory that combines data augmentation with experience replay. We show that scores obtained from oracle calls can be reused to update the model multiple times. We compare Augmented Memory to previously proposed algorithms and show significantly enhanced sample efficiency in an exploitation task and a drug discovery case study requiring both exploration and exploitation. Our method achieves a new state-of-the-art in the PMO benchmark which enforces a computational budget, outperforming the previous best performing method on 19/23 tasks.