Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information
Min, Seonwoo, Park, Seunghyun, Kim, Siwon, Choi, Hyun-Soo, Yoon, Sungroh
Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information Seonwoo Min, 1 Seunghyun Park, 2 Siwon Kim, 1 Hyun-Soo Choi, 1 Sungroh Y oon 1, 3, † 1 Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, Korea 2 Clova AI Research, NA VER Corp., Seongnam 13561, Korea 3 Interdisciplinary Program in Bioinformatics, ASRI, INMC, and ISRC, Seoul National University, Seoul 08826, Korea † Correspondence to: sryoon@snu.ac.kr Abstract A structure of a protein has a direct impact on its properties and functions. However, identification of structural similarity directly from amino acid sequences remains as a challenging problem in computational biology. In this paper, we introduce a novel BERT -wise pre-training scheme for a protein sequence representation model called PLUS, which stands for Protein sequence representations L earned U sing Structural information. As natural language representation models capture syntactic and semantic information of words from a large unlabeled text corpus, PLUS captures structural information of amino acids from a large weakly labeled protein database. Since the Transformer encoder, BERT's original model architecture, has a severe computational requirement to handle long sequences, we first propose to combine a bidirectional recurrent neural network with the BERT -wise pre-training scheme. PLUS is designed to learn protein representations with two pre-training objectives, i.e., masked language modeling and same family prediction. Then, the pre-trained model can be fine-tuned for a wide range of tasks without training randomly initialized task-specific models from scratch. Introduction Proteins consisting of linear chains of amino acids are the most versatile molecules in living organisms. They serve vital functions in almost every biological mechanism, e.g., transmitting nerve pulses, storing and transporting other molecules, and providing immune protection (Berg, Ty-moczko, and Stryer 2006).
Nov-25-2019