The Latent Road to Atoms: Backmapping Coarse-grained Protein Structures with Latent Diffusion

Han, Xu, Sun, Yuancheng, Chen, Kai, Liu, Kang, Ye, Qiwei

arXiv.org Artificial Intelligence 

Coarse-grained(CG) molecular dynamics simulations offer computational efficiency for exploring protein conformational ensembles and thermodynamic properties. Though coarse representations enable large-scale simulations across extended temporal and spatial ranges, the sacrifice of atomic-level details limits their utility in tasks such as ligand docking and protein-protein interaction prediction. Backmapping, the process of reconstructing all-atom structures from coarsegrained representations, is crucial for recovering these fine details. While recent machine learning methods have made strides in protein structure generation, challenges persist in reconstructing diverse atomistic conformations that maintain geometric accuracy and chemical validity. In this paper, we present Latent Diffusion Backmapping (LDB), a novel approach leveraging denoising diffusion within latent space to address these challenges. By combining discrete latent encoding with diffusion, LDB bypasses the need for equivariant and internal coordinate manipulation, significantly simplifying the training and sampling processes as well as facilitating better and wider exploration in configuration space. We evaluate LDB's state-of-the-art performance on three distinct protein datasets, demonstrating its ability to efficiently reconstruct structures with high structural accuracy and chemical validity. Moreover, LDB shows exceptional versatility in capturing diverse protein ensembles, highlighting its capability to explore intricate conformational spaces. Coarse-Grained Molecular Dynamics (CG-MD) simulation has become an indispensable tool in computational biology for simulating large biomolecular systems (Das & Baker, 2008; Liwo et al., 2014; Kmiecik et al., 2016; Souza et al., 2021; Majewski et al., 2023; Arts et al., 2023). Through grouping atoms into super-atoms or beads, CG models significantly decrease computational requirements and allow the observation of long-time processes such as folding, aggregation, and selfassembly (Lequieu et al., 2019; Shmilovich et al., 2020; Mohr et al., 2022). However, CG representations inherently sacrifice atomistic details of protein structures, limiting their application to a bunch of important downstream tasks in drug discovery, such as molecular recognition, signaling pathways deciphering, and allosteric sites prediction (Badaczewska-Dawid et al., 2020; Vickery & Stansfeld, 2021; Zambaldi et al., 2024).