Not enough data to create a plot.
Try a different view from the menu above.
Xue, Dongyu
ProteinWeaver: A Divide-and-Assembly Approach for Protein Backbone Design
Ma, Yiming, Ye, Fei, Zhou, Yi, Zheng, Zaixiang, Xue, Dongyu, Gu, Quanquan
Nature creates diverse proteins through a'divide and assembly' strategy. Inspired by this idea, we introduce ProteinWeaver, a two-stage framework for protein backbone design. Our method first generates individual protein domains and then employs an SE(3) diffusion model to flexibly assemble these domains. A key challenge lies in the assembling step, given the complex and rugged nature of the interdomain interaction landscape. To address this challenge, we employ preference alignment to discern complex relationships between structure and interaction landscapes through comparative analysis of generated samples. Comprehensive experiments demonstrate that ProteinWeaver: (1) generates high-quality, novel protein backbones through versatile domain assembly; (2) outperforms RFdiffusion, the current state-of-the-art in backbone design, by 13% and 39% for long-chain proteins; (3) shows the potential for cooperative function design through illustrative case studies. To sum up, by introducing a'divide-and-assembly' paradigm, ProteinWeaver advances protein engineering and opens new avenues for functional protein design. Nature employs a sophisticated'divide and assemble' strategy to create large and intricate protein structures that meet diverse biological functional needs (Figure 1A) (Pawson & Nash, 2003; Huddy et al., 2024; P Bagowski et al., 2010). This process primarily involves the recombination of existing structural blocks, particularly protein domains, which serve as the fundamental, recurring units in protein structures. Remarkably, a limited number of protein domains (approximately 500 as classified in CATH) suffice to create more than hundreds of thousands of structures satisfying a wide array of functions (Orengo et al., 1997). This strategy enables the creation of multi-domain protein backbones, facilitating the emergence of cooperative functions. However, our analysis reveals a significant limitation: designability decreases markedly as the backbone length increases (Figure 1E).
DPLM-2: A Multimodal Diffusion Protein Language Model
Wang, Xinyou, Zheng, Zaixiang, Ye, Fei, Xue, Dongyu, Huang, Shujian, Gu, Quanquan
Proteins are essential macromolecules defined by their amino acid sequences, which determine their three-dimensional structures and, consequently, their functions in all living organisms. Therefore, generative protein modeling necessitates a multimodal approach to simultaneously model, understand, and generate both sequences and structures. However, existing methods typically use separate models for each modality, limiting their ability to capture the intricate relationships between sequence and structure. This results in suboptimal performance in tasks that requires joint understanding and generation of both modalities. In this paper, we introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures. To enable structural learning with the language model, 3D coordinates are converted to discrete tokens using a lookup-free quantization-based tokenizer. By training on both experimental and high-quality synthetic structures, DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals. We also implement an efficient warm-up strategy to exploit the connection between large-scale evolutionary data and structural inductive biases from pre-trained sequence-based protein language models. Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures eliminating the need for a two-stage generation approach. Moreover, DPLM-2 demonstrates competitive performance in various conditional generation tasks, including folding, inverse folding, and scaffolding with multimodal motif inputs, as well as providing structure-aware representations for predictive tasks.
Antigen-Specific Antibody Design via Direct Energy-based Preference Optimization
Zhou, Xiangxin, Xue, Dongyu, Chen, Ruizhe, Zheng, Zaixiang, Wang, Liang, Gu, Quanquan
Antibody design, a crucial task with significant implications across various disciplines such as therapeutics and biology, presents considerable challenges due to its intricate nature. In this paper, we tackle antigen-specific antibody sequence-structure co-design as an optimization problem towards specific preferences, considering both rationality and functionality. Leveraging a pre-trained conditional diffusion model that jointly models sequences and structures of antibodies with equivariant neural networks, we propose direct energy-based preference optimization to guide the generation of antibodies with both rational structures and considerable binding affinities to given antigens. Our method involves fine-tuning the pre-trained diffusion model using a residue-level decomposed energy preference. Additionally, we employ gradient surgery to address conflicts between various types of energy, such as attraction and repulsion. Experiments on RAbD benchmark show that our approach effectively optimizes the energy of generated antibodies and achieves state-of-the-art performance in designing high-quality antibodies with low total energy and high binding affinity simultaneously, demonstrating the superiority of our approach.
Diffusion Language Models Are Versatile Protein Learners
Wang, Xinyou, Zheng, Zaixiang, Ye, Fei, Xue, Dongyu, Huang, Shujian, Gu, Quanquan
Drawing inspiration from the remarkable This paper introduces diffusion protein language progress in NLP achieved by language models (LMs; Devlin model (DPLM), a versatile protein language et al., 2019; Radford et al., 2018; OpenAI, 2023) thanks to model that demonstrates strong generative and the scalability of Transformers (Vaswani et al., 2017) and predictive capabilities for protein sequences. We the existence of large-scale text data, recent explorations in first pre-train scalable DPLMs from evolutionaryscale protein has also demonstrated the impressive capabilities of protein sequences within a generative selfsupervised protein language models (Rives et al., 2019; Lin et al., 2022; discrete diffusion probabilistic framework, Hu et al., 2022), learned from the universe of evolutionaryscale which generalizes language modeling for protein sequences. As a result, protein LMs have proteins in a principled way. After pre-training, become one of the most important cornerstones in AI for DPLM exhibits the ability to generate structurally protein research, serving a pivotal role not only in predictive plausible, novel and diverse protein sequences tasks (e.g., probing functional properties, and predicting for unconditional generation. We further protein structures from single sequences without explicit demonstrate the proposed diffusion generative evolutionary homologs) but also in generative tasks (e.g., pre-training make DPLM possess a better redesigning sequences given protein backbone structures, or understanding of proteins, making it a superior synthesizing completely new protein sequences).
Structure-informed Language Models Are Protein Designers
Zheng, Zaixiang, Deng, Yifan, Xue, Dongyu, Zhou, Yi, YE, Fei, Gu, Quanquan
This paper demonstrates that language models are strong structure-based protein designers. We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs), that have learned massive sequential evolutionary knowledge from the universe of natural protein sequences, to acquire an immediate capability to design preferable protein sequences for given folds. We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness. During inference, iterative refinement is performed to effectively optimize the generated protein sequences. Experiments show that LM-Design improves the state-of-the-art results by a large margin, leading to up to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65%/56.63% on CATH 4.2/4.3 single-chain benchmarks, and >60% when designing protein complexes). We provide extensive and in-depth analyses, which verify that LM-Design can (1) indeed leverage both structural and sequential knowledge to accurately handle structurally non-deterministic regions, (2) benefit from scaling data and model size, and (3) generalize to other proteins (e.g., antibodies and de novo proteins)