AITopics | protein transfer learning

Evaluating Protein Transfer Learning with TAPE

Neural Information Processing SystemsDec-25-2025, 05:52:01 GMT

Protein modeling is an increasingly popular area of machine learning research. Semi-supervised learning has emerged as an important paradigm in protein modeling due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems.

name change, proceedings, protein transfer learning, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Reviews: Evaluating Protein Transfer Learning with TAPE

Neural Information Processing SystemsJan-22-2025, 22:23:01 GMT

The manuscript presents a set of diverse protein prediction tasks, with the purpose of establishing a benchmark for testing representation/transfer learning on protein sequence data. In addition, it establishes a strong baseline for the field by implementing a range of different standard sequence models, and demonstrating their performance on a benchmark set. I expect both the benchmark set, and the results reported in this paper to have a substantial impact on the community. Below are some comments and suggestions for changes Page 3. Since the goal is to "ensure that no test proteins are closely related to train proteins", it would be informative if the authors could state the expected (or maximum) sequence identity between PFAM families. Wouldn't it have made sense to do the split at the clan level, to reduce the chance of information leakage between families within the same superfamily? About task 2: much of recent progress in protein structure prediction comes from prediction of distance distributions rather than a simple binary classification of contact presence.

benchmark, protein transfer learning, representation, (13 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.61)

Add feedback

Reviews: Evaluating Protein Transfer Learning with TAPE

Neural Information Processing SystemsJan-22-2025, 22:22:50 GMT

The contributions of this paper are multi-dimensional and highly significant: (i) developing a set of benchmarks for a diverse prediction tasks, (ii) demonstrating the utility of incorporating the vast amount of unlabeled protein data to pre-train models via semi-supervised learning, and (iii) the unlabeled data and pre-trained models made publicly available. This work will make a significant impact on the field by establishing solid benchmarks and facilitate the introduction of challenging protein prediction tasks to the machine learning community. The paper is extremely clearly written, well-structured and very concise. All reviewers are satisfied by the author response.

prediction task, protein transfer learning

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.73)

Add feedback

Evaluating Protein Transfer Learning with TAPE

Neural Information Processing SystemsOct-9-2024, 20:11:43 GMT

Protein modeling is an increasingly popular area of machine learning research. Semi-supervised learning has emerged as an important paradigm in protein modeling due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques.

paradigm, protein transfer learning, sequence

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.40)

Add feedback

PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

Tan, Yang, Li, Mingchen, Tan, Pan, Zhou, Ziyi, Yu, Huiqun, Fan, Guisheng, Hong, Liang

arXiv.org Artificial IntelligenceOct-26-2023

Large protein language models are adept at capturing the underlying evolutionary information in primary structures, offering significant practical value for protein engineering. Compared to natural language models, protein amino acid sequences have a smaller data volume and a limited combinatorial space. Choosing an appropriate vocabulary size to optimize the pre-trained model is a pivotal issue. Moreover, despite the wealth of benchmarks and studies in the natural language community, there remains a lack of a comprehensive benchmark for systematically evaluating protein language model quality. Given these challenges, PETA trained language models with 14 different vocabulary sizes under three tokenization methods. It conducted thousands of tests on 33 diverse downstream datasets to assess the models' transfer learning capabilities, incorporating two classification heads and three random seeds to mitigate potential biases. Extensive experiments indicate that vocabulary sizes between 50 and 200 optimize the model, whereas sizes exceeding 800 detrimentally affect the model's representational performance. Our code, model weights and datasets are available at https://github.com/ginnm/ProteinPretraining.

downstream application, protein transfer learning, sub-word tokenization, (1 more...)

arXiv.org Artificial Intelligence

2310.17415

Genre: Research Report (0.40)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.53)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.60)

Add feedback

Evaluating Protein Transfer Learning with TAPE

Rao, Roshan, Bhattacharya, Nicholas, Thomas, Neil, Duan, Yan, Chen, Peter, Canny, John, Abbeel, Pieter, Song, Yun

Neural Information Processing SystemsMar-19-2020, 00:31:45 GMT

Protein modeling is an increasingly popular area of machine learning research. Semi-supervised learning has emerged as an important paradigm in protein modeling due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques.

paradigm, protein transfer learning, sequence

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.40)

Add feedback

Filters

Collaborating Authors

protein transfer learning

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Evaluating Protein Transfer Learning with TAPE

Reviews: Evaluating Protein Transfer Learning with TAPE

Reviews: Evaluating Protein Transfer Learning with TAPE

Evaluating Protein Transfer Learning with TAPE

PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

Evaluating Protein Transfer Learning with TAPE