Improving Inverse Folding for Peptide Design with Diversity-regularized Direct Preference Optimization

Park, Ryan, Hsu, Darren J., Roland, C. Brian, Korshunova, Maria, Tessler, Chen, Mannor, Shie, Viessmann, Olivia, Trentini, Bruno

arXiv.org Artificial Intelligence 

Inverse folding models play an important role in structure-based design by predicting amino acid sequences that fold into desired reference structures. Models like ProteinMPNN, a message-passing encoder-decoder model, are trained to reliably produce new sequences from a reference structure. However, when applied to peptides, these models are prone to generating repetitive sequences that do not fold into the reference structure. To address this, we fine-tune ProteinMPNN to produce diverse and structurally consistent peptide sequences via Direct Preference Optimization (DPO). We derive two enhancements to DPO: online diversity regularization and domain-specific priors. Additionally, we develop a new understanding on improving diversity in decoder models. When conditioned on Open-Fold generated structures, our fine-tuned models achieve state-of-the-art structural similarity scores, improving base ProteinMPNN by at least 8%. Compared to standard DPO, our regularized method achieves up to 20% higher sequence diversity with no loss in structural similarity score. Engineering biopolymers that fold into desired 3D structures, a computational challenge known as inverse protein folding problem, has broad applications in drug discovery and material science (Yang et al., 2023; Dill et al., 2008; Abascal & Regan, 2018). Several approaches for inverse folding have been adopted over the past decades, from molecular dynamics simulations to machine learning approaches (Dauparas et al., 2022b; Shanker et al., 2023; Hsu et al., 2022a; Yi et al., 2023; Correa, 1990).