There is an escalating race to get to the bottom of predicting the 3D structures of proteins from their amino-acid sequences. It would not be wrong if it is said that it is one of the biggest challenges that the biological world face. Here again, thanks to the new artificial intelligence (AI) who comes to the rescue. At the completion of last year, Google's AI firm DeepMind introduced an algorithm called AlphaFold, which merged two techniques that were evolving in the field and defeated established contestants in a competition on a protein-structure prediction by an unexpected margin. And this year, in April, a US researcher discovered an algorithm that practices an entirely different approach.
Proteins may be small and unassuming, but these molecules are essential for a variety of biological functions in all living organisms, including digestion, immune response and even intracellular communication. Consisting of long chains of smaller organic compounds called amino acids, the different functions of various proteins are determined by the way they fold up in three-dimensional space. Not surprisingly, the folded structures of these protein chains can get immensely complex, and scientists have yet to fully figure out the mysteries behind how and why certain proteins fold the way they do, and how diseases like Alzheimer's might be caused when they misfold. While using modern technologies like cryo-electron microscopes, nuclear magnetic resonance and X-ray crystallography can help us understand protein folding a little better, it's an unfortunately time-consuming and costly process. Accurately predicting the folded structures of proteins could be the key to unlocking many medical mysteries, and thanks to recent developments in integrating artificial intelligence in the field of computational biology, that slow process may very well be accelerated -- allowing us to discover or even design new and useful proteins.
Composed of long chains of amino acids, proteins perform these myriad tasks by folding themselves into precise 3D structures that govern how they interact with other molecules. Because a protein's shape determines its function and the extent of its dysfunction in disease, efforts to illuminate protein structures are central to all of molecular biology -- and in particular, therapeutic science and the development of lifesaving and life-altering medicines. In recent years, computational methods have made significant strides in predicting how proteins fold based on knowledge of their amino acid sequence. If fully realized, these methods have the potential to transform virtually all facets of biomedical research. Current approaches, however, are limited in the scale and scope of the proteins that can be determined.
Predicting protein structures from their sequences would aid drug design.Credit: Edward Kinsman/Science Photo Library The race to crack one of biology's grandest challenges -- predicting the 3D structures of proteins from their amino-acid sequences -- is intensifying, thanks to new artificial-intelligence (AI) approaches. At the end of last year, Google's AI firm DeepMind debuted an algorithm called AlphaFold, which combined two techniques that were emerging in the field and beat established contenders in a competition on protein-structure prediction by a surprising margin. And in April this year, a US researcher revealed an algorithm that uses a totally different approach. He claims his AI is up to one million times faster at predicting structures than DeepMind's, although probably not as accurate in all situations. More broadly, biologists are wondering how else deep learning -- the AI technique used by both approaches -- might be applied to the prediction of protein arrangements, which ultimately dictate a protein's function.
Within computational biology, algorithms are constructed with the aim of extracting knowledge from biological data, in particular, data generated by the large genome projects, where gene and protein sequences are produced in high volume. In this article, we explore new ways of representing protein-sequence information, using machine learning strategies, where the primary goal is the discovery of novel powerful representations for use in AI techniques. In the case of proteins and the 20 different amino acids they typically contain, it is also a secondary goal to discover how the current selection of amino acids--which now are common in proteins--might have emerged from simpler selections, or alphabets, in use earlier during the evolution of living organisms. Protein sequences are constructed from this alphabet of 20 amino acids, and most proteins with a sequence length of 200 amino acids or more contain all 20, albeit with large differences in frequency. Some amino acids are very common, but others are rare. A key problem when constructing computational methods for analysis of protein data is how to represent the sequence information (Baldi and Brunak 2001). The literature contains many different examples of how to deal with the fact that the 20 amino acids are related to one another in terms of biochemical properties--very much in analogy to natural language alphabets where two vowels might be more "similar" than any vowel-consonant pair, for example, when constructing speechsynthesis algorithms. In this article, we do not want to cover all attempts to represent protein sequences computationally but restrict the review to recent developments in the area of amino acid subalphabets, where the idea is to discover groups of amino acids that can be lumped together, thus giving rise to alphabets with fewer than 20 symbols. These subalphabets can then be used to rewrite or reencode the original protein sequence, hopefully giving rise to better performance of an AI algorithm designed to detect a particular functional feature when receiving the simplified input. The idea is completely general, and similar approaches might be relevant in other symbol-sequence data domains, for example, in natural language processing. It should be mentioned that alphabet expansion in some cases can also be advantageous, that is, to rewrite sequences in expanded, longer alphabets covering more than one symbol, thus encoding significant correlations between individual symbols directly into the rewritten sequence. For example, deoxyribonucleic acid (DNA) sequences contain four different nucleotides (ACGT), but a rewrite as dinucleotides (AA, AC, AG,...), or trinucleotides (AAA, AAC, AAG,...) might lead to a DNA representation where functional patterns are easier to detect by machine learning algo-