MMSite: A Multi-modal Framework for the Identification of Active Sites in Proteins

May-24-2025, 09:03:15 GMT–Neural Information Processing Systems

The accurate identification of active sites in proteins is essential for the advancement of life sciences and pharmaceutical development, as these sites are of critical importance for enzyme activity and drug design. Recent advancements in protein language models (PLMs), trained on extensive datasets of amino acid sequences, have significantly improved our understanding of proteins. However, compared to the abundant protein sequence data, functional annotations, especially precise per-residue annotations, are scarce, which limits the performance of PLMs. On the other hand, textual descriptions of proteins, which could be annotated by human experts or a pretrained protein sequence-to-text model, provide meaningful context that could assist in the functional annotations, such as the localization of active sites. This motivates us to construct a ProTein-Attribute text Dataset (ProTAD), comprising over 570,000 pairs of protein sequences and multi-attribute textual descriptions.

bioinformatics, large language model, machine learning, (23 more...)

Neural Information Processing Systems

May-24-2025, 09:03:15 GMT

Conferences PDF

Add feedback

Country:
- Asia > China
  - Hubei Province (0.14)
- North America
  - Canada > Quebec (0.14)
  - United States > Louisiana (0.14)

Genre:
- Research Report
  - Experimental Study (0.93)
  - New Finding (0.93)

Industry:
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning
      - Neural Networks > Deep Learning (1.00)
      - Performance Analysis > Accuracy (0.93)
      - Statistical Learning (0.67)
    - Natural Language
      - Chatbot (0.67)
      - Large Language Model (1.00)
      - Text Processing (0.67)
  - Biomedical Informatics (0.88)
  - Data Science (0.93)