Goto

Collaborating Authors

 rfdiffusion



Controllable protein design through Feynman-Kac steering

Hartman, Erik, Wallin, Jonas, Malmström, Johan, Olsson, Jimmy

arXiv.org Machine Learning

Diffusion-based models have recently enabled the generation of realistic and diverse protein structures, yet they remain limited in their ability to steer outcomes toward specific functional or biochemical objectives, such as binding affinity or sequence composition. Here we extend the Feynman-Kac (FK) steering framework, an inference-time control approach, to diffusion-based protein design. By coupling FK steering with structure generation, the method guides sampling toward desirable structural or energetic features while maintaining the diversity of the underlying diffusion process. To enable simultaneous generation of both sequence and structure properties, rewards are computed on models refined through ProteinMPNN and all-atom relaxation. Applied to binder design, FK steering consistently improves predicted interface energetics across diverse targets with minimal computational overhead. More broadly, this work demonstrates that inference-time FK control generalizes diffusion-based protein design to arbitrary, non-differentiable, and reward-agnostic objectives, providing a unified and model-independent framework for guided molecular generation.


Protein generation with embedding learning for motif diversification

Michalewicz, Kevin, Jin, Chen, Teare, Philip Alexander, Diethe, Tom, Barahona, Mauricio, Bravi, Barbara, Mullokandov, Asher

arXiv.org Machine Learning

A fundamental challenge in protein design is the trade-off between generating structural diversity while preserving motif biological function. Current state-of-the-art methods, such as partial diffusion in RFdiffusion, often fail to resolve this trade-off: small perturbations yield motifs nearly identical to the native structure, whereas larger perturbations violate the geometric constraints necessary for biological function. We introduce Protein Generation with Embedding Learning (PGEL), a general framework that learns high-dimensional embeddings encoding sequence and structural features of a target motif in the representation space of a diffusion model's frozen denoiser, and then enhances motif diversity by introducing controlled perturbations in the embedding space. PGEL is thus able to loosen geometric constraints while satisfying typical design metrics, leading to more diverse yet viable structures. We demonstrate PGEL on three representative cases: a monomer, a protein-protein interface, and a cancer-related transcription factor complex. In all cases, PGEL achieves greater structural diversity, better designability, and improved self-consistency, as compared to partial diffusion. Our results establish PGEL as a general strategy for embedding-driven protein generation allowing for systematic, viable diversification of functional motifs.


Constrained Diffusion for Protein Design with Hard Structural Constraints

Christopher, Jacob K., Seamann, Austin, Cui, Jingyi, Khare, Sagar, Fioretto, Ferdinando

arXiv.org Artificial Intelligence

Diffusion models offer a powerful means of capturing the manifold of realistic protein structures, enabling rapid design for protein engineering tasks. However, existing approaches observe critical failure modes when precise constraints are necessary for functional design. To this end, we present a constrained diffusion framework for structure-guided protein design, ensuring strict adherence to functional requirements while maintaining precise stereochemical and geometric feasibility. We evaluate on challenging protein design tasks, including motif scaffolding and vacancy-constrained pocket design, while introducing a novel curated benchmark dataset for motif scaffolding in the PDZ domain. Our approach achieves state-of-the-art, providing perfect satisfaction of bonding and geometric constraints with no degradation in structural diversity. Diffusion models have revolutionized protein engineering with notable successes demonstrated in the design of protein monomers, assemblies, and protein binders against biomolecular targets (Watson et al., 2023). In many cases, predefined binding or catalytic motifs are introduced into designed proteins via motif scaffolding but there are no guarantees that the generated backbones will accurately include the motif (Trippe et al., 2022; Didi et al., 2023). Furthermore, the motifs are typically pre-defined as structural fragments, rather than more physically-based (e.g. These obstacles restrict the scope of design goals accessible to current methods.


Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation

Neural Information Processing Systems

In proteins, the amino-acid sequences determine the interaction between protein backbones and side chains, which fold into a distribution of protein structures. Consequently, the functional properties of protein structures can be inferred from its sequence.


Protein-SE(3): Benchmarking SE(3)-based Generative Models for Protein Structure Design

Yu, Lang, Gao, Zhangyang, Tan, Cheng, Chen, Qin, Zhou, Jie, He, Liang

arXiv.org Artificial Intelligence

SE(3)-based generative models have shown great promise in protein geometry modeling and effective structure design. However, the field currently lacks a modularized benchmark to enable comprehensive investigation and fair comparison of different methods. In this paper, we propose Protein-SE(3), a new benchmark based on a unified training framework, which comprises protein scaffolding tasks, integrated generative models, high-level mathematical abstraction, and diverse evaluation metrics. Recent advanced generative models designed for protein scaffolding, from multiple perspectives like DDPM (Genie1 and Genie2), Score Matching (FrameDiff and RfDiffusion) and Flow Matching (FoldFlow and FrameFlow) are integrated into our framework. All integrated methods are fairly investigated with the same training dataset and evaluation metrics. Furthermore, we provide a high-level abstraction of the mathematical foundations behind the generative models, enabling fast prototyping of future algorithms without reliance on explicit protein structures. Accordingly, we release the first comprehensive benchmark built upon unified training framework for SE(3)-based protein structure design, which is publicly accessible at https://github.com/BruthYU/protein-se3.


A Model-Centric Review of Deep Learning for Protein Design

Kyro, Gregory W., Qiu, Tianyin, Batista, Victor S.

arXiv.org Artificial Intelligence

Deep learning has transformed protein design, enabling accurate structure prediction, sequence optimization, and de novo protein generation. Advances in single - chain protein structure prediction via AlphaFold2, RoseTTAFold, ESM Fold, and others have achieved near - experimental accuracy, inspiring successive work extended to biomolecular complexes via AlphaFold Multimer, RoseTTAFold All - Atom, AlphaFold 3, Chai - 1, Boltz - 1 and others . Generative models such as Prot GPT 2, ProteinMPNN, and RFdiffusion have enabled sequence and backbone design beyond natural evolution - based limitations . More recently, joint sequence - structure co - design models, including ESM 3, have integrated both modalities into a unified framework, resulting in improved designability. Despite these advances, challenges still exist pertaining to modeling sequence - structure - function relationships and ensuring robust generalization beyond the regions of protein space spanned by the training data . Future advances wi ll likely focus on joint sequence - structure - function co - design frameworks that are able to model the fitness landscape more effectively than models that treat these modalities independently . Current capabilities, coupled with the dizzying rate of progress, suggest that the field will soon enable rapid, rational design of proteins with tailored structures and functions that transcend the limitations imposed by natural evolution. In this review, we discuss the current capabilities of deep learning methods for protein design, f ocusing on some of the most revolutionary and capable models with respect to their functionality and the applications that they enable, leading up to the current challenges of the field and the optimal path forward.


MotifBench: A standardized protein design benchmark for motif-scaffolding problems

Zheng, Zhuoqi, Zhang, Bo, Didi, Kieran, Yang, Kevin K., Yim, Jason, Watson, Joseph L., Chen, Hai-Feng, Trippe, Brian L.

arXiv.org Artificial Intelligence

The motif-scaffolding problem is a central task in computational protein design: Given the coordinates of atoms in a geometry chosen to confer a desired biochemical function (a motif), the task is to identify diverse protein structures (scaffolds) that include the motif and maintain its geometry. Significant recent progress on motif-scaffolding has been made due to computational evaluation with reliable protein structure prediction and fixed-backbone sequence design methods [1-17]. However, significant variability in evaluation strategies across publications has hindered comparability of results, challenged reproducibility, and impeded robust progress. In response we introduce MotifBench, comprising (1) a precisely specified pipeline and evaluation metrics, (2) a collection of 30 benchmark problems, and (3) an implementation of this benchmark and leaderboard at github.com/blt2114/MotifBench. The MotifBench test cases are more difficult compared to earlier benchmarks (e.g. A motif-scaffolding method takes a motif as input and returns a set of putatively compatible scaffolds as output. This section details how motifs and scaffolds in MotifBench are specified, proposes metrics by which a scaffold set is evaluated, and describes how these metrics are computed. Appendix A describes considerations upon which these specifications and metrics were chosen. Motif specification (inputs): A motif is specified by the coordinates of the backbone atoms of several residues and (in some cases) the amino acid types of a subset of those residues.


ProteinWeaver: A Divide-and-Assembly Approach for Protein Backbone Design

Ma, Yiming, Ye, Fei, Zhou, Yi, Zheng, Zaixiang, Xue, Dongyu, Gu, Quanquan

arXiv.org Artificial Intelligence

Nature creates diverse proteins through a'divide and assembly' strategy. Inspired by this idea, we introduce ProteinWeaver, a two-stage framework for protein backbone design. Our method first generates individual protein domains and then employs an SE(3) diffusion model to flexibly assemble these domains. A key challenge lies in the assembling step, given the complex and rugged nature of the interdomain interaction landscape. To address this challenge, we employ preference alignment to discern complex relationships between structure and interaction landscapes through comparative analysis of generated samples. Comprehensive experiments demonstrate that ProteinWeaver: (1) generates high-quality, novel protein backbones through versatile domain assembly; (2) outperforms RFdiffusion, the current state-of-the-art in backbone design, by 13% and 39% for long-chain proteins; (3) shows the potential for cooperative function design through illustrative case studies. To sum up, by introducing a'divide-and-assembly' paradigm, ProteinWeaver advances protein engineering and opens new avenues for functional protein design. Nature employs a sophisticated'divide and assemble' strategy to create large and intricate protein structures that meet diverse biological functional needs (Figure 1A) (Pawson & Nash, 2003; Huddy et al., 2024; P Bagowski et al., 2010). This process primarily involves the recombination of existing structural blocks, particularly protein domains, which serve as the fundamental, recurring units in protein structures. Remarkably, a limited number of protein domains (approximately 500 as classified in CATH) suffice to create more than hundreds of thousands of structures satisfying a wide array of functions (Orengo et al., 1997). This strategy enables the creation of multi-domain protein backbones, facilitating the emergence of cooperative functions. However, our analysis reveals a significant limitation: designability decreases markedly as the backbone length increases (Figure 1E).


Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation

Huguet, Guillaume, Vuckovic, James, Fatras, Kilian, Thibodeau-Laufer, Eric, Lemos, Pablo, Islam, Riashat, Liu, Cheng-Hao, Rector-Brooks, Jarrid, Akhound-Sadegh, Tara, Bronstein, Michael, Tong, Alexander, Bose, Avishek Joey

arXiv.org Artificial Intelligence

Proteins are essential for almost all biological processes and derive their diverse functions from complex 3D structures, which are in turn determined by their amino acid sequences. In this paper, we exploit the rich biological inductive bias of amino acid sequences and introduce FoldFlow-2, a novel sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation. FoldFlow-2 presents substantial new architectural features over the previous FoldFlow family of models including a protein large language model to encode sequence, a new multi-modal fusion trunk that combines structure and sequence representations, and a geometric transformer based decoder. To increase diversity and novelty of generated samples -- crucial for de-novo drug design -- we train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works, containing both known proteins in PDB and high-quality synthetic structures achieved through filtering. We further demonstrate the ability to align FoldFlow-2 to arbitrary rewards, e.g. increasing secondary structures diversity, by introducing a Reinforced Finetuning (ReFT) objective. We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models, improving over RFDiffusion in terms of unconditional generation across all metrics including designability, diversity, and novelty across all protein lengths, as well as exhibiting generalization on the task of equilibrium conformation sampling. Finally, we demonstrate that a fine-tuned FoldFlow-2 makes progress on challenging conditional design tasks such as designing scaffolds for the VHH nanobody.