Zhang, Odin
Reaction-conditioned De Novo Enzyme Design with GENzyme
Hua, Chenqing, Lu, Jiarui, Liu, Yong, Zhang, Odin, Tang, Jian, Ying, Rex, Jin, Wengong, Wolf, Guy, Precup, Doina, Zheng, Shuangjia
The introduction of models like RFDiffusionAA, AlphaFold3, AlphaProteo, and Chai1 has revolutionized protein structure modeling and interaction prediction, primarily from a binding perspective, focusing on creating ideal lock-and-key models. However, these methods can fall short for enzyme-substrate interactions, where perfect binding models are rare, and induced fit states are more common. To address this, we shift to a functional perspective for enzyme design, where the enzyme function is defined by the reaction it catalyzes. Here, we introduce \textsc{GENzyme}, a \textit{de novo} enzyme design model that takes a catalytic reaction as input and generates the catalytic pocket, full enzyme structure, and enzyme-substrate binding complex. \textsc{GENzyme} is an end-to-end, three-staged model that integrates (1) a catalytic pocket generation and sequence co-design module, (2) a pocket inpainting and enzyme inverse folding module, and (3) a binding and screening module to optimize and predict enzyme-substrate complexes. The entire design process is driven by the catalytic reaction being targeted. This reaction-first approach allows for more accurate and biologically relevant enzyme design, potentially surpassing structure-based and binding-focused models in creating enzymes capable of catalyzing specific reactions. We provide \textsc{GENzyme} code at https://github.com/WillHua127/GENzyme.
EnzymeFlow: Generating Reaction-specific Enzyme Catalytic Pockets through Flow Matching and Co-Evolutionary Dynamics
Hua, Chenqing, Liu, Yong, Zhang, Dinghuai, Zhang, Odin, Luan, Sitao, Yang, Kevin K., Wolf, Guy, Precup, Doina, Zheng, Shuangjia
Enzyme design is a critical area in biotechnology, with applications ranging from drug development to synthetic biology. Traditional methods for enzyme function prediction or protein binding pocket design often fall short in capturing the dynamic and complex nature of enzyme-substrate interactions, particularly in catalytic processes. To address the challenges, we introduce EnzymeFlow, a generative model that employs flow matching with hierarchical pre-training and enzyme-reaction co-evolution to generate catalytic pockets for specific substrates and catalytic reactions. Additionally, we introduce a large-scale, curated, and validated dataset of enzyme-reaction pairs, specifically designed for the catalytic pocket generation task, comprising a total of $328,192$ pairs. By incorporating evolutionary dynamics and reaction-specific adaptations, EnzymeFlow becomes a powerful model for designing enzyme pockets, which is capable of catalyzing a wide range of biochemical reactions. Experiments on the new dataset demonstrate the model's effectiveness in designing high-quality, functional enzyme catalytic pockets, paving the way for advancements in enzyme engineering and synthetic biology. We provide EnzymeFlow code at https://github.com/WillHua127/EnzymeFlow with notebook demonstration at https://github.com/WillHua127/EnzymeFlow/blob/main/enzymeflow_demo.ipynb.
Token-Mol 1.0: Tokenized drug design with large language model
Wang, Jike, Qin, Rui, Wang, Mingyang, Fang, Meijing, Zhang, Yangyang, Zhu, Yuchen, Su, Qun, Gou, Qiaolin, Shen, Chao, Zhang, Odin, Wu, Zhenxing, Jiang, Dejun, Zhang, Xujun, Zhao, Huifeng, Wan, Xiaozhe, Wu, Zhourui, Liu, Liwei, Kang, Yu, Hsieh, Chang-Yu, Hou, Tingjun
Significant interests have recently risen in leveraging sequence-based large language models (LLMs) for drug design. However, most current applications of LLMs in drug discovery lack the ability to comprehend three-dimensional (3D) structures, thereby limiting their effectiveness in tasks that explicitly involve molecular conformations. In this study, we introduced Token-Mol, a token-only 3D drug design model. This model encodes all molecular information, including 2D and 3D structures, as well as molecular property data, into tokens, which transforms classification and regression tasks in drug discovery into probabilistic prediction problems, thereby enabling learning through a unified paradigm. Token-Mol is built on the transformer decoder architecture and trained using random causal masking techniques. Additionally, we proposed the Gaussian cross-entropy (GCE) loss function to overcome the challenges in regression tasks, significantly enhancing the capacity of LLMs to learn continuous numerical values. Through a combination of fine-tuning and reinforcement learning (RL), Token-Mol achieves performance comparable to or surpassing existing task-specific methods across various downstream tasks, including pocket-based molecular generation, conformation generation, and molecular property prediction. Compared to existing molecular pre-trained models, Token-Mol exhibits superior proficiency in handling a wider range of downstream tasks essential for drug design. Notably, our approach improves regression task accuracy by approximately 30% compared to similar token-only methods. Token-Mol overcomes the precision limitations of token-only models and has the potential to integrate seamlessly with general models such as ChatGPT, paving the way for the development of a universal artificial intelligence drug design model that facilitates rapid and high-quality drug design by experts.
Rethinking the Diffusion Models for Numerical Tabular Data Imputation from the Perspective of Wasserstein Gradient Flow
Chen, Zhichao, Li, Haoxuan, Wang, Fangyikang, Zhang, Odin, Xu, Hu, Jiang, Xiaoyu, Song, Zhihuan, Wang, Eric H.
Diffusion models (DMs) have gained attention in Missing Data Imputation (MDI), but there remain two long-neglected issues to be addressed: (1). Inaccurate Imputation, which arises from inherently sample-diversification-pursuing generative process of DMs. (2). Difficult Training, which stems from intricate design required for the mask matrix in model training stage. To address these concerns within the realm of numerical tabular datasets, we introduce a novel principled approach termed Kernelized Negative Entropy-regularized Wasserstein gradient flow Imputation (KnewImp). Specifically, based on Wasserstein gradient flow (WGF) framework, we first prove that issue (1) stems from the cost functionals implicitly maximized in DM-based MDI are equivalent to the MDI's objective plus diversification-promoting non-negative terms. Based on this, we then design a novel cost functional with diversification-discouraging negative entropy and derive our KnewImp approach within WGF framework and reproducing kernel Hilbert space. After that, we prove that the imputation procedure of KnewImp can be derived from another cost functional related to the joint distribution, eliminating the need for the mask matrix and hence naturally addressing issue (2). Extensive experiments demonstrate that our proposed KnewImp approach significantly outperforms existing state-of-the-art methods.
PPFlow: Target-aware Peptide Design with Torsional Flow Matching
Lin, Haitao, Zhang, Odin, Zhao, Huifeng, Jiang, Dejun, Wu, Lirong, Liu, Zicheng, Huang, Yufei, Li, Stan Z.
Therapeutic peptides have proven to have great pharmaceutical value and potential in recent decades. However, methods of AI-assisted peptide drug discovery are not fully explored. To fill the gap, we propose a target-aware peptide design method called \textsc{PPFlow}, based on conditional flow matching on torus manifolds, to model the internal geometries of torsion angles for the peptide structure design. Besides, we establish a protein-peptide binding dataset named PPBench2024 to fill the void of massive data for the task of structure-based peptide drug design and to allow the training of deep learning methods. Extensive experiments show that PPFlow reaches state-of-the-art performance in tasks of peptide drug generation and optimization in comparison with baseline models, and can be generalized to other tasks including docking and side-chain packing.
Deep Lead Optimization: Leveraging Generative AI for Structural Modification
Zhang, Odin, Lin, Haitao, Zhang, Hui, Zhao, Huifeng, Huang, Yufei, Huang, Yuansheng, Jiang, Dejun, Hsieh, Chang-yu, Pan, Peichen, Hou, Tingjun
The idea of using deep-learning-based molecular generation to accelerate discovery of drug candidates has attracted extraordinary attention, and many deep generative models have been developed for automated drug design, termed molecular generation. In general, molecular generation encompasses two main strategies: de novo design, which generates novel molecular structures from scratch, and lead optimization, which refines existing molecules into drug candidates. Among them, lead optimization plays an important role in real-world drug design. For example, it can enable the development of me-better drugs that are chemically distinct yet more effective than the original drugs. It can also facilitate fragment-based drug design, transforming virtual-screened small ligands with low affinity into first-in-class medicines. Despite its importance, automated lead optimization remains underexplored compared to the well-established de novo generative models, due to its reliance on complex biological and chemical knowledge. To bridge this gap, we conduct a systematic review of traditional computational methods for lead optimization, organizing these strategies into four principal sub-tasks with defined inputs and outputs. This review delves into the basic concepts, goals, conventional CADD techniques, and recent advancements in AIDD. Additionally, we introduce a unified perspective based on constrained subgraph generation to harmonize the methodologies of de novo design and lead optimization. Through this lens, de novo design can incorporate strategies from lead optimization to address the challenge of generating hard-to-synthesize molecules; inversely, lead optimization can benefit from the innovations in de novo design by approaching it as a task of generating molecules conditioned on certain substructures.
Deep Geometry Handling and Fragment-wise Molecular 3D Graph Generation
Zhang, Odin, Huang, Yufei, Cheng, Shichen, Yu, Mengyao, Zhang, Xujun, Lin, Haitao, Zeng, Yundian, Wang, Mingyang, Wu, Zhenxing, Zhao, Huifeng, Zhang, Zaixi, Hua, Chenqing, Kang, Yu, Cui, Sunliang, Pan, Peichen, Hsieh, Chang-Yu, Hou, Tingjun
Most earlier 3D structure-based molecular generation approaches follow an atom-wise paradigm, incrementally adding atoms to a partially built molecular fragment within protein pockets. These methods, while effective in designing tightly bound ligands, often overlook other essential properties such as synthesizability. The fragment-wise generation paradigm offers a promising solution. However, a common challenge across both atom-wise and fragment-wise methods lies in their limited ability to co-design plausible chemical and geometrical structures, resulting in distorted conformations. In response to this challenge, we introduce the Deep Geometry Handling protocol, a more abstract design that extends the design focus beyond the model architecture. Through a comprehensive review of existing geometry-related models and their protocols, we propose a novel hybrid strategy, culminating in the development of FragGen - a geometry-reliable, fragment-wise molecular generation method. FragGen marks a significant leap forward in the quality of generated geometry and the synthesis accessibility of molecules. The efficacy of FragGen is further validated by its successful application in designing type II kinase inhibitors at the nanomolar level.
Re-Dock: Towards Flexible and Realistic Molecular Docking with Diffusion Bridge
Huang, Yufei, Zhang, Odin, Wu, Lirong, Tan, Cheng, Lin, Haitao, Gao, Zhangyang, Li, Siyuan, Li, Stan. Z.
Accurate prediction of protein-ligand binding structures, a task known as molecular docking is crucial for drug design but remains challenging. While deep learning has shown promise, existing methods often depend on holo-protein structures (docked, and not accessible in realistic tasks) or neglect pocket sidechain conformations, leading to limited practical utility and unrealistic conformation predictions. To fill these gaps, we introduce an under-explored task, named flexible docking to predict poses of ligand and pocket sidechains simultaneously and introduce Re-Dock, a novel diffusion bridge generative model extended to geometric manifolds. Specifically, we propose energy-to-geometry mapping inspired by the Newton-Euler equation to co-model the binding energy and conformations for reflecting the energy-constrained docking generative process. Comprehensive experiments on designed benchmark datasets including apo-dock and cross-dock demonstrate our model's superior effectiveness and efficiency over current methods.
Functional-Group-Based Diffusion for Pocket-Specific Molecule Generation and Elaboration
Lin, Haitao, Huang, Yufei, Zhang, Odin, Wu, Lirong, Li, Siyuan, Chen, Zhiyuan, Li, Stan Z.
In recent years, AI-assisted drug design methods have been proposed to generate molecules given the pockets' structures of target proteins. Most of them are atom-level-based methods, which consider atoms as basic components and generate atom positions and types. In this way, however, it is hard to generate realistic fragments with complicated structures. To solve this, we propose D3FG, a functional-group-based diffusion model for pocket-specific molecule generation and elaboration. D3FG decomposes molecules into two categories of components: functional groups defined as rigid bodies and linkers as mass points. And the two kinds of components can together form complicated fragments that enhance ligand-protein interactions. To be specific, in the diffusion process, D3FG diffuses the data distribution of the positions, orientations, and types of the components into a prior distribution; In the generative process, the noise is gradually removed from the three variables by denoisers parameterized with designed equivariant graph neural networks. In the experiments, our method can generate molecules with more realistic 3D structures, competitive affinities toward the protein targets, and better drug properties. Besides, D3FG as a solution to a new task of molecule elaboration, could generate molecules with high affinities based on existing ligands and the hotspots of target proteins.
Protein 3D Graph Structure Learning for Robust Structure-based Protein Property Prediction
Huang, Yufei, Li, Siyuan, Su, Jin, Wu, Lirong, Zhang, Odin, Lin, Haitao, Qi, Jingqi, Liu, Zihan, Gao, Zhangyang, Liu, Yuyang, Zheng, Jiangbin, Li, Stan. ZQ.
Protein structure-based property prediction has emerged as a promising approach for various biological tasks, such as protein function prediction and sub-cellular location estimation. The existing methods highly rely on experimental protein structure data and fail in scenarios where these data are unavailable. Predicted protein structures from AI tools (e.g., AlphaFold2) were utilized as alternatives. However, we observed that current practices, which simply employ accurately predicted structures during inference, suffer from notable degradation in prediction accuracy. While similar phenomena have been extensively studied in general fields (e.g., Computer Vision) as model robustness, their impact on protein property prediction remains unexplored. In this paper, we first investigate the reason behind the performance decrease when utilizing predicted structures, attributing it to the structure embedding bias from the perspective of structure representation learning. To study this problem, we identify a Protein 3D Graph Structure Learning Problem for Robust Protein Property Prediction (PGSL-RP3), collect benchmark datasets, and present a protein Structure embedding Alignment Optimization framework (SAO) to mitigate the problem of structure embedding bias between the predicted and experimental protein structures. Extensive experiments have shown that our framework is model-agnostic and effective in improving the property prediction of both predicted structures and experimental structures. The benchmark datasets and codes will be released to benefit the community.