Goto

Collaborating Authors

 protein transfer learning


Evaluating Protein Transfer Learning with TAPE

Neural Information Processing Systems

Protein modeling is an increasingly popular area of machine learning research. Semi-supervised learning has emerged as an important paradigm in protein modeling due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems.


Reviews: Evaluating Protein Transfer Learning with TAPE

Neural Information Processing Systems

The manuscript presents a set of diverse protein prediction tasks, with the purpose of establishing a benchmark for testing representation/transfer learning on protein sequence data. In addition, it establishes a strong baseline for the field by implementing a range of different standard sequence models, and demonstrating their performance on a benchmark set. I expect both the benchmark set, and the results reported in this paper to have a substantial impact on the community. Below are some comments and suggestions for changes Page 3. Since the goal is to "ensure that no test proteins are closely related to train proteins", it would be informative if the authors could state the expected (or maximum) sequence identity between PFAM families. Wouldn't it have made sense to do the split at the clan level, to reduce the chance of information leakage between families within the same superfamily? About task 2: much of recent progress in protein structure prediction comes from prediction of distance distributions rather than a simple binary classification of contact presence.


Reviews: Evaluating Protein Transfer Learning with TAPE

Neural Information Processing Systems

The contributions of this paper are multi-dimensional and highly significant: (i) developing a set of benchmarks for a diverse prediction tasks, (ii) demonstrating the utility of incorporating the vast amount of unlabeled protein data to pre-train models via semi-supervised learning, and (iii) the unlabeled data and pre-trained models made publicly available. This work will make a significant impact on the field by establishing solid benchmarks and facilitate the introduction of challenging protein prediction tasks to the machine learning community. The paper is extremely clearly written, well-structured and very concise. All reviewers are satisfied by the author response.


Evaluating Protein Transfer Learning with TAPE

Neural Information Processing Systems

Protein modeling is an increasingly popular area of machine learning research. Semi-supervised learning has emerged as an important paradigm in protein modeling due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques.


PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

arXiv.org Artificial Intelligence

Large protein language models are adept at capturing the underlying evolutionary information in primary structures, offering significant practical value for protein engineering. Compared to natural language models, protein amino acid sequences have a smaller data volume and a limited combinatorial space. Choosing an appropriate vocabulary size to optimize the pre-trained model is a pivotal issue. Moreover, despite the wealth of benchmarks and studies in the natural language community, there remains a lack of a comprehensive benchmark for systematically evaluating protein language model quality. Given these challenges, PETA trained language models with 14 different vocabulary sizes under three tokenization methods. It conducted thousands of tests on 33 diverse downstream datasets to assess the models' transfer learning capabilities, incorporating two classification heads and three random seeds to mitigate potential biases. Extensive experiments indicate that vocabulary sizes between 50 and 200 optimize the model, whereas sizes exceeding 800 detrimentally affect the model's representational performance. Our code, model weights and datasets are available at https://github.com/ginnm/ProteinPretraining.


Evaluating Protein Transfer Learning with TAPE

Neural Information Processing Systems

Protein modeling is an increasingly popular area of machine learning research. Semi-supervised learning has emerged as an important paradigm in protein modeling due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques.