A Text-guided Protein Design Framework

Liu, Shengchao, Li, Yanjing, Li, Zhuoxinran, Gitter, Anthony, Zhu, Yutao, Lu, Jiarui, Xu, Zhao, Nie, Weili, Ramanathan, Arvind, Xiao, Chaowei, Tang, Jian, Guo, Hongyu, Anandkumar, Anima

Dec-3-2023–arXiv.org Machine Learning

Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level functionalities. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 10 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks. Machine learning (ML) has recently shown profound potential for protein discovery. These ML tools have been quickly adapted as auxiliary and accelerating roles in scientific pipelines, including but not limited to protein engineering [1], structure prediction [2], structure reconstruction [3], and inverse folding [4].

bioinformatics, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

Dec-3-2023

arXiv.org PDF

Add feedback

Country:
- Asia (0.67)
- North America
  - Canada (0.93)
  - United States > California (0.67)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)
    - Natural Language
      - Large Language Model (0.89)
      - Text Processing (0.67)
    - Representation & Reasoning (1.00)
    - Vision (0.93)
  - Biomedical Informatics > Translational Bioinformatics (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found