Protein language models trained on multiple sequence alignments learn phylogenetic relationships

Lupo, Umberto, Sgarbossa, Damiano, Bitbol, Anne-Florence

arXiv.org Artificial Intelligence 

The explosion of available biological sequence data has led to multiple computational approaches aiming to infer three-dimensional structure, biological function, fitness, and evolutionary history of proteins from sequence data [1, 2]. Recently, self-supervised deep learning models based on natural language processing methods, especially attention [3] and transformers [4], have been trained on large ensembles of protein sequences by means of the masked language modeling objective of filling in masked amino acids in a sequence, given the surrounding ones [5-10]. These models, which capture longrange dependencies, learn rich representations of protein sequences, and can be employed for multiple tasks. In particular, they can predict structural contacts from single sequences in an unsupervised way [7], presumably by transferring knowledge from their large training set [11]. Neural network architectures based on attention are also employed in the Evoformer blocks in AlphaFold [12], as well as in RoseTTAFold [13] and RGN2 [14], and they contributed to the recent breakthrough in the supervised prediction of protein structure. Protein sequences can be classified in families of homologous proteins, that descend from an ancestral protein and share a similar structure and function. Analyzing multiple sequence alignments (MSAs) of homologous proteins thus provides substantial information about functional and structural constraints [1]. The statistics of MSA columns, representing amino-acid sites, allow to identify functional residues that are conserved during evolution, and correlations of amino-acid usage between columns contain key information about functional sectors and structural contacts [15-18]. Indeed, through the course of evolution, contacting amino acids need to maintain their physico-chemical complementarity, which leads to correlated amino-acid usages at these sites: this is known as coevolution.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found