Retrosynthesis prediction enhanced by in-silico reaction data augmentation

Zhang, Xu, Mo, Yiming, Wang, Wenguan, Yang, Yi

arXiv.org Artificial Intelligence 

Retrosynthesis, the process of identifying precursors for a target molecule, is essential for material design and drug discovery (Blakemore et al, 2018). However, the huge search space for possible chemical transformations and enormous time required even for experts make this challenging. Thus, efficient computerassisted synthesis (Corey and Wipke, 1969; Corey et al, 1985; Coley et al, 2017) has been explored for long periods. Thanks to recent advances in artificial intelligence, machine learning (ML)-based methods (Segler et al, 2018; Mikulak-Klucznik et al, 2020; Schwaller et al, 2021; Toniato et al, 2021; Yu et al, 2023; Born and Manica, 2023) have emerged to assist chemists to design experiments and gain insights that might not be solely achievable through traditional methods, bringing retrosynthesis research to a new pivotal moment. The ML-based methods for single-step retrosynthesis can be roughly categorized into three groups: Template-based methods predict reactants using reaction templates that encode core reactive rules. LHASA (Corey et al, 1985), the first retrosynthesis program, utilizes manual-encoding templates to predict retrosynthetic routes. To scale to exponentially growing knowledge (Segler et al, 2018), data-driven methods (Segler and Waller, 2017; Coley et al, 2017; Dai et al, 2019; Baylon et al, 2019; Chen and Jung, 2021) extract a large number of reaction templates from data and formulate retrosynthesis as a template retrieval/classification task. Semi-template methods (Shi et al, 2020; Yan et al, 2020; Somnath et al, 2021; Wang et al, 2021) decompose retrosynthesis into two stages: they typically (1) identify the reactive sites to convert the product into synthons and (2) complete the synthons into reactant(s), which utilize "reaction centers" in templates to supervise the training procedure (Sun et al, 2021). Template-free methods view single-step retrosynthesis prediction as a machine translation task, where deep generative models directly translate the given product into reactant(s).