Goto

Collaborating Authors

 residue type



Generative design and validation of therapeutic peptides for glioblastoma based on a potential target ATP5A

Qian, Hao, You, Pu, Zeng, Lin, Zhou, Jingyuan, Huang, Dengdeng, Li, Kaicheng, Tu, Shikui, Xu, Lei

arXiv.org Artificial Intelligence

Glioblastoma (GBM) remains the most aggressive tumor, urgently requiring novel therapeutic strategies. Here, we present a dry-to-wet framework combining generative modeling and experimental validation to optimize peptides targeting ATP5A, a potential peptide-binding protein for GBM. Our framework introduces the first lead-conditioned generative model, which focuses exploration on geometrically relevant regions around lead peptides and mitigates the combinatorial complexity of de novo methods. Specifically, we propose POTFlow, a \underline{P}rior and \underline{O}ptimal \underline{T}ransport-based \underline{Flow}-matching model for peptide optimization. POTFlow employs secondary structure information (e.g., helix, sheet, loop) as geometric constraints, which are further refined by optimal transport to produce shorter flow paths. With this design, our method achieves state-of-the-art performance compared with five popular approaches. When applied to GBM, our method generates peptides that selectively inhibit cell viability and significantly prolong survival in a patient-derived xenograft (PDX) model. As the first lead peptide-conditioned flow matching model, POTFlow holds strong potential as a generalizable framework for therapeutic peptide design.





Guided Sequence-Structure Generative Modeling for Iterative Antibody Optimization

Raghu, Aniruddh, Ober, Sebastian, Kazman, Maxwell, Elliott, Hunter

arXiv.org Artificial Intelligence

Therapeutic antibody candidates often require extensive engineering to improve key functional and developability properties before clinical development. This can be achieved through iterative design, where starting molecules are optimized over several rounds of in vitro experiments. While protein structure can provide a strong inductive bias, it is rarely used in iterative design due to the lack of structural data for continually evolving lead molecules over the course of optimization. In this work, we propose a strategy for iterative antibody optimization that leverages both sequence and structure as well as accumulating lab measurements of binding and developability. Building on prior work, we first train a sequence-structure diffusion generative model that operates on antibody-antigen complexes. We then outline an approach to use this model, together with carefully predicted antibody-antigen complexes, to optimize lead candidates throughout the iterative design process. Further, we describe a guided sampling approach that biases generation toward desirable properties by integrating models trained on experimental data from iterative design. We evaluate our approach in multiple in silico and in vitro experiments, demonstrating that it produces high-affinity binders at multiple stages of an active antibody optimization campaign. Therapeutic antibodies are a flexible and rapidly-growing class of drugs that have already successfully been used to treat a wide range of diseases (Carter & Lazar, 2018).


Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning

Wu, Lirong, Tian, Yijun, Lin, Haitao, Huang, Yufei, Li, Siyuan, Chawla, Nitesh V, Li, Stan Z.

arXiv.org Artificial Intelligence

Protein-protein bindings play a key role in a variety of fundamental biological processes, and thus predicting the effects of amino acid mutations on protein-protein binding is crucial. To tackle the scarcity of annotated mutation data, pre-training with massive unlabeled data has emerged as a promising solution. However, this process faces a series of challenges: (1) complex higher-order dependencies among multiple (more than paired) structural scales have not yet been fully captured; (2) it is rarely explored how mutations alter the local conformation of the surrounding microenvironment; (3) pre-training is costly, both in data size and computational burden. In this paper, we first construct a hierarchical prompt codebook to record common microenvironmental patterns at different structural scales independently. Then, we develop a novel codebook pre-training task, namely masked microenvironment modeling, to model the joint distribution of each mutation with their residue types, angular statistics, and local conformational changes in the microenvironment. With the constructed prompt codebook, we encode the microenvironment around each mutation into multiple hierarchical prompts and combine them to flexibly provide information to wild-type and mutated protein complexes about their microenvironmental differences. Such a hierarchical prompt learning framework has demonstrated superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction and a case study of optimizing human antibodies against SARS-CoV-2.


Harmonic Self-Conditioned Flow Matching for Multi-Ligand Docking and Binding Site Design

Stärk, Hannes, Jing, Bowen, Barzilay, Regina, Jaakkola, Tommi

arXiv.org Artificial Intelligence

A significant amount of protein function requires binding small molecules, including enzymatic catalysis. As such, designing binding pockets for small molecules has several impactful applications ranging from drug synthesis to energy storage. Designing proteins that can bind small molecules has many applications, ranging from drug synthesis to energy storage or gene editing. Indeed, a key part of any protein's function derives from its ability to bind and interact with other molecular species. For example, we may design proteins that act as antidotes that sequester toxins or design enzymes that enable chemical reactions through catalysis, which plays a major role in most biological processes. Specifically, we aim to design protein pockets to bind a certain small molecule (called ligand). We assume that we are given a protein pocket via the 3D backbone atom locations of its residues as well as the 2D chemical graph of the ligand. We do not assume any knowledge of the 3D structure or the binding pose of the ligand. Based on this information, our goal is to predict the amino acid identities for the given backbone locations (see Figure 1). We also consider the more challenging task of designing pockets that simultaneously bind multiple molecules and ions (which we call multi-ligand). Such multi-ligand binding proteins are important, for example, in enzyme design, where the ligands correspond to reactants.


Protein Sequence and Structure Co-Design with Equivariant Translation

Shi, Chence, Wang, Chuanrui, Lu, Jiarui, Zhong, Bozitao, Tang, Jian

arXiv.org Artificial Intelligence

Proteins are macromolecules that perform essential functions in all living organisms. Designing novel proteins with specific structures and desired functions has been a long-standing challenge in the field of bioengineering. Existing approaches generate both protein sequence and structure using either autoregressive models or diffusion models, both of which suffer from high inference costs. In this paper, we propose a new approach capable of protein sequence and structure co-design, which iteratively translates both protein sequence and structure into the desired state from random initialization, based on context features given a priori. Our model consists of a trigonometry-aware encoder that reasons geometrical constraints and interactions from context features, and a roto-translation equivariant decoder that translates protein sequence and structure interdependently. Notably, all protein amino acids are updated in one shot in each translation step, which significantly accelerates the inference process. Experimental results across multiple tasks show that our model outperforms previous state-of-the-art baselines by a large margin, and is able to design proteins of high fidelity as regards both sequence and structure, with running time orders of magnitude less than sampling-based methods. Proteins are macromolecules that mediate the fundamental processes of all living organisms. For decades, people are seeking to design novel proteins with desired properties (Huang et al., 2016), a problem known as de novo protein design.