MMSite: A Multi-modal Framework for the Identification of Active Sites in Proteins
–Neural Information Processing Systems
The accurate identification of active sites in proteins is essential for the advancement of life sciences and pharmaceutical development, as these sites are of critical importance for enzyme activity and drug design. Recent advancements in protein language models (PLMs), trained on extensive datasets of amino acid sequences, have significantly improved our understanding of proteins. However, compared to the abundant protein sequence data, functional annotations, especially precise per-residue annotations, are scarce, which limits the performance of PLMs. On the other hand, textual descriptions of proteins, which could be annotated by human experts or a pretrained protein sequence-to-text model, provide meaningful context that could assist in the functional annotations, such as the localization of active sites. This motivates us to construct a ProTein-Attribute text Dataset (ProTAD), comprising over 570,000 pairs of protein sequences and multi-attribute textual descriptions.
Neural Information Processing Systems
May-24-2025, 09:03:15 GMT
- Country:
- Asia > China
- Hubei Province (0.14)
- North America
- Canada > Quebec (0.14)
- United States > Louisiana (0.14)
- Asia > China
- Genre:
- Research Report
- Experimental Study (0.93)
- New Finding (0.93)
- Research Report
- Industry:
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning
- Neural Networks > Deep Learning (1.00)
- Performance Analysis > Accuracy (0.93)
- Statistical Learning (0.67)
- Natural Language
- Chatbot (0.67)
- Large Language Model (1.00)
- Text Processing (0.67)
- Machine Learning
- Biomedical Informatics (0.88)
- Data Science (0.93)
- Artificial Intelligence
- Information Technology