Qiang, Bo
NaFM: Pre-training a Foundation Model for Small-Molecule Natural Products
Ding, Yuheng, Wang, Yusong, Qiang, Bo, Yu, Jie, Li, Qi, Zhou, Yiran, Liu, Zhenmin
Natural products, as metabolites from microorganisms, animals, or plants, exhibit diverse biological activities, making them crucial for drug discovery. Nowadays, existing deep learning methods for natural products research primarily rely on supervised learning approaches designed for specific downstream tasks. However, such one-model-for-a-task paradigm often lacks generalizability and leaves significant room for performance improvement. Additionally, existing molecular characterization methods are not well-suited for the unique tasks associated with natural products. To address these limitations, we have pre-trained a foundation model for natural products based on their unique properties. Our approach employs a novel pretraining strategy that is especially tailored to natural products. By incorporating contrastive learning and masked graph learning objectives, we emphasize evolutional information from molecular scaffolds while capturing side-chain information. Our framework achieves state-of-the-art (SOTA) results in various downstream tasks related to natural product mining and drug discovery. We first compare taxonomy classification with synthesized molecule-focused baselines to demonstrate that current models are inadequate for understanding natural synthesis. Furthermore, by diving into a fine-grained analysis at both the gene and microbial levels, NaFM demonstrates the ability to capture evolutionary information. Eventually, our method is experimented with virtual screening, illustrating informative natural product representations that can lead to more effective identification of potential drug candidates.
PROflow: An iterative refinement model for PROTAC-induced structure prediction
Qiang, Bo, Shi, Wenxian, Song, Yuxuan, Wu, Menghua
Proteolysis targeting chimeras (PROTACs) are small molecules that trigger the breakdown of traditionally "undruggable" proteins by binding simultaneously to their targets and degradation-associated proteins. A key challenge in their rational design is understanding their structural basis of activity. Due to the lack of crystal structures (18 in the PDB), existing PROTAC docking methods have been forced to simplify the problem into a distance-constrained protein-protein docking task. To address the data issue, we develop a novel pseudo-data generation scheme that requires only binary protein-protein complexes. Its inference speed enables the large-scale screening of PROTAC designs, and computed properties of predicted structures achieve statistically significant correlations with published degradation activities. Targeted protein degradation is an emerging paradigm in rational drug design that induces the breakdown of "undruggable" proteins (Zhao et al., 2022). Proteolysis targeting chimeras (PROTACs) are small molecules that achieve this by simultaneously binding a protein of interest (POI) and a degradation-associated protein (e.g. In contrast to small molecule drugs, which attach to predefined sites on their protein targets, PROTACs operate by inducing a stable, ternary complex between themselves and two proteins which don't typically interact.
Latent Chemical Space Searching for Plug-in Multi-objective Molecule Generation
Liu, Ningfeng, Yu, Jie, Xiu, Siyu, Zhao, Xinfang, Lin, Siyu, Qiang, Bo, Zheng, Ruqiu, Jin, Hongwei, Zhang, Liangren, Liu, Zhenming
Molecular generation, an essential method for identifying new drug structures, has been supported by advancements in machine learning and computational technology. However, challenges remain in multi-objective generation, model adaptability, and practical application in drug discovery. In this study, we developed a versatile 'plug-in' molecular generation model that incorporates multiple objectives related to target affinity, drug-likeness, and synthesizability, facilitating its application in various drug development contexts. We improved the Particle Swarm Optimization (PSO) in the context of drug discoveries, and identified PSO-ENP as the optimal variant for multi-objective molecular generation and optimization through comparative experiments. The model also incorporates a novel target-ligand affinity predictor, enhancing the model's utility by supporting three-dimensional information and improving synthetic feasibility. Case studies focused on generating and optimizing drug-like big marine natural products were performed, underscoring PSO-ENP's effectiveness and demonstrating its considerable potential for practical drug discovery applications.
Rethinking Specificity in SBDD: Leveraging Delta Score and Energy-Guided Diffusion
Gao, Bowen, Ren, Minsi, Ni, Yuyan, Huang, Yanwen, Qiang, Bo, Ma, Zhi-Ming, Ma, Wei-Ying, Lan, Yanyan
In the field of Structure-based Drug Design (SBDD), deep learning-based generative models have achieved outstanding performance in terms of docking score. However, further study shows that the existing molecular generative methods and docking scores both have lacked consideration in terms of specificity, which means that generated molecules bind to almost every protein pocket with high affinity. To address this, we introduce the Delta Score, a new metric for evaluating the specificity of molecular binding. To further incorporate this insight for generation, we develop an innovative energy-guided approach using contrastive learning, with active compounds as decoys, to direct generative models toward creating molecules with high specificity. Our empirical results show that this method not only enhances the delta score but also maintains or improves traditional docking scores, successfully bridging the gap between SBDD and real-world needs.
Delta Score: Improving the Binding Assessment of Structure-Based Drug Design Methods
Ren, Minsi, Gao, Bowen, Qiang, Bo, Lan, Yanyan
Structure-based drug design (SBDD) stands at the forefront of drug discovery, emphasizing the creation of molecules that target specific binding pockets. Recent advances in this area have witnessed the adoption of deep generative models and geometric deep learning techniques, modeling SBDD as a conditional generation task where the target structure serves as context. Historically, evaluation of these models centered on docking scores, which quantitatively depict the predicted binding affinity between a molecule and its target pocket. Though state-of-the-art models purport that a majority of their generated ligands exceed the docking score of ground truth ligands in test sets, it begs the question: Do these scores align with real-world biological needs? In this paper, we introduce the delta score, a novel evaluation metric grounded in tangible pharmaceutical requisites. Our experiments reveal that molecules produced by current deep generative models significantly lag behind ground truth reference ligands when assessed with the delta score. This novel metric not only complements existing benchmarks but also provides a pivotal direction for subsequent research in the domain.
DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening
Gao, Bowen, Qiang, Bo, Tan, Haichuan, Ren, Minsi, Jia, Yinjun, Lu, Minsi, Liu, Jingjing, Ma, Weiying, Lan, Yanyan
Virtual screening, which identifies potential drugs from vast compound databases to bind with a particular protein pocket, is a critical step in AI-assisted drug discovery. Traditional docking methods are highly time-consuming, and can only work with a restricted search library in real-life applications. Recent supervised learning approaches using scoring functions for binding-affinity prediction, although promising, have not yet surpassed docking methods due to their strong dependency on limited data with reliable binding-affinity labels. In this paper, we propose a novel contrastive learning framework, DrugCLIP, by reformulating virtual screening as a dense retrieval task and employing contrastive learning to align representations of binding protein pockets and molecules from a large quantity of pairwise data without explicit binding-affinity scores. We also introduce a biological-knowledge inspired data augmentation strategy to learn better protein-molecule representations. Extensive experiments show that DrugCLIP significantly outperforms traditional docking and supervised learning methods on diverse virtual screening benchmarks with highly reduced computation time, especially in zero-shot setting.
Bridging the Gap between Chemical Reaction Pretraining and Conditional Molecule Generation with a Unified Model
Qiang, Bo, Zhou, Yiran, Ding, Yuheng, Liu, Ningfeng, Song, Song, Zhang, Liangren, Huang, Bo, Liu, Zhenming
Chemical reactions are the fundamental building blocks of drug design and organic chemistry research. In recent years, there has been a growing need for a large-scale deep-learning framework that can efficiently capture the basic rules of chemical reactions. In this paper, we have proposed a unified framework that addresses both the reaction representation learning and molecule generation tasks, which allows for a more holistic approach. Inspired by the organic chemistry mechanism, we develop a novel pretraining framework that enables us to incorporate inductive biases into the model. Our framework achieves state-of-the-art results on challenging downstream tasks. By possessing chemical knowledge, our generative framework overcome the limitations of current molecule generation models that rely on a small number of reaction templates. In the extensive experiments, our model generates synthesizable drug-like structures of high quality. Overall, our work presents a significant step toward a large-scale deep-learning framework for a variety of reaction-based applications. Deep learning models have found applications across a multitude of scientific research domains [1-3]. Pretraining frameworks [4, 5] facilitate the seamless integration of new tasks, thereby expediting the modeling process, especially for scenarios with limited labeled data. Chemical reactions are the foundation of drug design and organic chemistry studies. Currently, data-mining works [6, 7] have enabled deep learning models to be applied to chemical reactions. Based on these data, there have been plenty of data-driven works that intend to delve into the representation learning of chemical reactions. Representation learning refers to automatically learning useful features from the data, which can then be used for various downstream tasks [8]. In earlier works, traditional molecular fingerprints were applied directly for reaction representations[9, 10]. Inspired by natural language processing (NLP) methods, researchers also applied attention-based network[11, 12] or contrastive learning techniques[13, 14] in chemical reaction pretraining networks. These representations have been tested on classification tasks[15] or regression tasks[16]. However, these methods ignore the fundamental theories in organic chemistry, which limits their performance. For example, electronic effects and inductive effects will be ignored if bonds or atoms outside the reactive centers are masked [13]. Except for reaction classification tasks, molecule generation based on chemical reactions is also an important application. This branch of models has been proven to be capable of generating synthetically accessible molecules.[17-20]. Earlier works always applied a step-wise template-based molecule generation strategy.
Coarse-to-Fine: a Hierarchical Diffusion Model for Molecule Generation in 3D
Qiang, Bo, Song, Yuxuan, Xu, Minkai, Gong, Jingjing, Gao, Bowen, Zhou, Hao, Ma, Weiying, Lan, Yanyan
Generating desirable molecular structures in 3D is a fundamental problem for drug discovery. Despite the considerable progress we have achieved, existing methods usually generate molecules in atom resolution and ignore intrinsic local structures such as rings, which leads to poor quality in generated structures, especially when generating large molecules. Fragment-based molecule generation is a promising strategy, however, it is nontrivial to be adapted for 3D non-autoregressive generations because of the combinational optimization problems. In this paper, we utilize a coarse-to-fine strategy to tackle this problem, in which a Hierarchical Diffusion-based model (i.e.~HierDiff) is proposed to preserve the validity of local segments without relying on autoregressive modeling. Specifically, HierDiff first generates coarse-grained molecule geometries via an equivariant diffusion process, where each coarse-grained node reflects a fragment in a molecule. Then the coarse-grained nodes are decoded into fine-grained fragments by a message-passing process and a newly designed iterative refined sampling module. Lastly, the fine-grained fragments are then assembled to derive a complete atomic molecular structure. Extensive experiments demonstrate that HierDiff consistently improves the quality of molecule generation over existing methods