Reviews: Evaluating Protein Transfer Learning with TAPE

Neural Information Processing Systems 

The manuscript presents a set of diverse protein prediction tasks, with the purpose of establishing a benchmark for testing representation/transfer learning on protein sequence data. In addition, it establishes a strong baseline for the field by implementing a range of different standard sequence models, and demonstrating their performance on a benchmark set. I expect both the benchmark set, and the results reported in this paper to have a substantial impact on the community. Below are some comments and suggestions for changes Page 3. Since the goal is to "ensure that no test proteins are closely related to train proteins", it would be informative if the authors could state the expected (or maximum) sequence identity between PFAM families. Wouldn't it have made sense to do the split at the clan level, to reduce the chance of information leakage between families within the same superfamily? About task 2: much of recent progress in protein structure prediction comes from prediction of distance distributions rather than a simple binary classification of contact presence.