Supplementary Information: TARTARUS: Practical and Realistic Benchmarks for Inverse Molecular Design

Neural Information Processing Systems 

S1. INTRODUCTION Traditionally, property-guided optimization has relied on expert intuition [1] and several rounds of trial, error, and human-inspired optimization, occasionally supported by computer simulations. Alternatively, computer-assisted approaches have employed virtual screening [2] or optimization algorithms such as genetic algorithms (GAs) [3-5]. More recently, with the surge of deep learning, deep generative models have emerged, specifically designed to operate in chemical space and tackle inverse molecular design [6-8]. This has led to the development of numerous algorithmic approaches for this purpose, with the most popular including variational autoencoders (VAEs) [9, 10], generative adversarial networks (GANs) [11, 12], and reinforcement learning (RL) [13, 14]. METHODSOVERVIEW In this section, we provide an overview of the molecular generative models employed throughout this work and summarize the associated design choices we needed to make during their implementation. The molecular design algorithms we considered are VAEs, long short-term memory hill climbing (LSTM-HC) models [15-17], REINVENT [18], JANUS [19], and a graph-based genetic algorithm (GB-GA) [20]. At the core of the majority of these approaches are molecular string representations, the most commonly used of which is the Simplified Molecular Input Line Entry System (SMILES) [21]. Accordingly, many of the algorithms tested rely on predicting subsequent characters from partial strings to propose structures. However, algorithms based on SMILES can regularly produce invalid strings that do not represent molecules, which is problematic both in terms of robustness and interpretability of the corresponding methodologies [22, 23]. Consequently, this issue was addressed systematically by introducing Self-Referencing Embedded Strings (SELFIES) [22], a molecular string representation that guarantees validity. Thus, unlike for SMILES, every arbitrary combination of SELFIES characters represents a molecule. Nevertheless, its impact on structure optimization has not yet been evaluated systematically [23]. To address this issue, we modify some of the existing generative models relying on SMILES to be also compatible with SELFIES and test their performance depending on representation, similar to how it has been done recently [24]. Among the models tested, REINVENT, the VAEs, and the LSTM-HC models use recurrent neural networks (RNNs) [25] to learn the conditional probability distributions of the characters that represent molecules. RNNs are a class of artificial neural networks (ANNs) that utilize sequential information from their previous predictions and states.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found