(Re)Discovering Protein Structure and Function Through Language Modeling

#artificialintelligence 

In our study, we show how a Transformer language model, trained simply to predict a masked (hidden) amino acid in a protein sequence, recovers high-level structural and functional properties of proteins through its attention mechanism. We demonstrate that attention (1) captures the folding structure of proteins, connecting regions that are apart in the underlying sequence but spatially close in the protein structure, and (2) targets binding sites, a key functional component of proteins. We also introduce a three-dimensional visualization of the interaction between attention and protein structure. Our findings align with biological processes and provide a tool to aid scientific discovery. Proteins are complex molecules that play a critical functional and structural role for all forms of life on this planet. The study of proteins has led to many advances in disease therapies, and the application of machine learning to proteins has the potential for far-reaching applications in medicine and beyond.