dplm
CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models
Yin, Junbo, Zha, Chao, He, Wenjia, Xu, Chencheng, Gao, Xin
Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-Gen, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-Gen facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the Residue-Controlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-Gen enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins. Code and data available at https://github.com/yinjunbo/cfpgen.
Diffusion Language Models Are Versatile Protein Learners
Wang, Xinyou, Zheng, Zaixiang, Ye, Fei, Xue, Dongyu, Huang, Shujian, Gu, Quanquan
Drawing inspiration from the remarkable This paper introduces diffusion protein language progress in NLP achieved by language models (LMs; Devlin model (DPLM), a versatile protein language et al., 2019; Radford et al., 2018; OpenAI, 2023) thanks to model that demonstrates strong generative and the scalability of Transformers (Vaswani et al., 2017) and predictive capabilities for protein sequences. We the existence of large-scale text data, recent explorations in first pre-train scalable DPLMs from evolutionaryscale protein has also demonstrated the impressive capabilities of protein sequences within a generative selfsupervised protein language models (Rives et al., 2019; Lin et al., 2022; discrete diffusion probabilistic framework, Hu et al., 2022), learned from the universe of evolutionaryscale which generalizes language modeling for protein sequences. As a result, protein LMs have proteins in a principled way. After pre-training, become one of the most important cornerstones in AI for DPLM exhibits the ability to generate structurally protein research, serving a pivotal role not only in predictive plausible, novel and diverse protein sequences tasks (e.g., probing functional properties, and predicting for unconditional generation. We further protein structures from single sequences without explicit demonstrate the proposed diffusion generative evolutionary homologs) but also in generative tasks (e.g., pre-training make DPLM possess a better redesigning sequences given protein backbone structures, or understanding of proteins, making it a superior synthesizing completely new protein sequences).
Unsupervised Vehicle Re-Identification via Self-supervised Metric Learning using Feature Dictionary
The key challenge of unsupervised vehicle re-identification (Re-ID) is learning discriminative features from unlabelled vehicle images. Numerous methods using domain adaptation have achieved outstanding performance, but those methods still need a labelled dataset as a source domain. This paper addresses an unsupervised vehicle Re-ID method, which no need any types of a labelled dataset, through a Self-supervised Metric Learning (SSML) based on a feature dictionary. Our method initially extracts features from vehicle images and stores them in a dictionary. Thereafter, based on the dictionary, the proposed method conducts dictionary-based positive label mining (DPLM) to search for positive labels. Pair-wise similarity, relative-rank consistency, and adjacent feature distribution similarity are jointly considered to find images that may belong to the same vehicle of a given probe image. The results of DPLM are applied to dictionary-based triplet loss (DTL) to improve the discriminativeness of learnt features and to refine the quality of the results of DPLM progressively. The iterative process with DPLM and DTL boosts the performance of unsupervised vehicle Re-ID. Experimental results demonstrate the effectiveness of the proposed method by producing promising vehicle Re-ID performance without a pre-labelled dataset. The source code for this paper is publicly available on `https://github.com/andreYoo/VeRI_SSML_FD.git'.