New deep-learning approach predicts protein structure from amino acid sequence

#artificialintelligence

Composed of long chains of amino acids, proteins perform these myriad tasks by folding themselves into precise 3D structures that govern how they interact with other molecules. Because a protein's shape determines its function and the extent of its dysfunction in disease, efforts to illuminate protein structures are central to all of molecular biology -- and in particular, therapeutic science and the development of lifesaving and life-altering medicines. In recent years, computational methods have made significant strides in predicting how proteins fold based on knowledge of their amino acid sequence. If fully realized, these methods have the potential to transform virtually all facets of biomedical research. Current approaches, however, are limited in the scale and scope of the proteins that can be determined.


AI protein-folding algorithms solve structures faster than ever

#artificialintelligence

Predicting protein structures from their sequences would aid drug design.Credit: Edward Kinsman/Science Photo Library The race to crack one of biology's grandest challenges -- predicting the 3D structures of proteins from their amino-acid sequences -- is intensifying, thanks to new artificial-intelligence (AI) approaches. At the end of last year, Google's AI firm DeepMind debuted an algorithm called AlphaFold, which combined two techniques that were emerging in the field and beat established contenders in a competition on protein-structure prediction by a surprising margin. And in April this year, a US researcher revealed an algorithm that uses a totally different approach. He claims his AI is up to one million times faster at predicting structures than DeepMind's, although probably not as accurate in all situations. More broadly, biologists are wondering how else deep learning -- the AI technique used by both approaches -- might be applied to the prediction of protein arrangements, which ultimately dictate a protein's function.


Representation of Protein-Sequence Information by Amino Acid Subalphabets

AI Magazine

Within computational biology, algorithms are constructed with the aim of extracting knowledge from biological data, in particular, data generated by the large genome projects, where gene and protein sequences are produced in high volume. In this article, we explore new ways of representing protein-sequence information, using machine learning strategies, where the primary goal is the discovery of novel powerful representations for use in AI techniques. In the case of proteins and the 20 different amino acids they typically contain, it is also a secondary goal to discover how the current selection of amino acids--which now are common in proteins--might have emerged from simpler selections, or alphabets, in use earlier during the evolution of living organisms. Protein sequences are constructed from this alphabet of 20 amino acids, and most proteins with a sequence length of 200 amino acids or more contain all 20, albeit with large differences in frequency. Some amino acids are very common, but others are rare. A key problem when constructing computational methods for analysis of protein data is how to represent the sequence information (Baldi and Brunak 2001). The literature contains many different examples of how to deal with the fact that the 20 amino acids are related to one another in terms of biochemical properties--very much in analogy to natural language alphabets where two vowels might be more "similar" than any vowel-consonant pair, for example, when constructing speechsynthesis algorithms. In this article, we do not want to cover all attempts to represent protein sequences computationally but restrict the review to recent developments in the area of amino acid subalphabets, where the idea is to discover groups of amino acids that can be lumped together, thus giving rise to alphabets with fewer than 20 symbols. These subalphabets can then be used to rewrite or reencode the original protein sequence, hopefully giving rise to better performance of an AI algorithm designed to detect a particular functional feature when receiving the simplified input. The idea is completely general, and similar approaches might be relevant in other symbol-sequence data domains, for example, in natural language processing. It should be mentioned that alphabet expansion in some cases can also be advantageous, that is, to rewrite sequences in expanded, longer alphabets covering more than one symbol, thus encoding significant correlations between individual symbols directly into the rewritten sequence. For example, deoxyribonucleic acid (DNA) sequences contain four different nucleotides (ACGT), but a rewrite as dinucleotides (AA, AC, AG,...), or trinucleotides (AAA, AAC, AAG,...) might lead to a DNA representation where functional patterns are easier to detect by machine learning algo-


Representation of Protein-Sequence Information by Amino Acid Subalphabets

AI Magazine

Within computational biology, algorithms are constructed with the aim of extracting knowledge from biological data, in particular, data generated by the large genome projects, where gene and protein sequences are produced in high volume. In this article, we explore new ways of representing protein-sequence information, using machine learning strategies, where the primary goal is the discovery of novel powerful representations for use in AI techniques. In the case of proteins and the 20 different amino acids they typically contain, it is also a secondary goal to discover how the current selection of amino acids -- which now are common in proteins -- might have emerged from simpler selections, or alphabets, in use earlier during the evolution of living organisms.


Protein folding class predictor for SCOP: approach based on global descriptors

AAAI Conferences

This work demonstrates new techniques developed for the prediction of protein folding class in the context of the most comprehensive Structural Classification of Proteins (SCOP). The prediction method uses global descriptors of a protein in terms of the physical, chemical and structural properties of its constituent amino acids. Neural networks are utilized to combine these descriptors in a specific way to discriminate members of a given folding class from members of all other classes. It is shown that a specific amino acid's properties work completely differently on different folding classes. This creates the possibility of finding an individual set of descriptors that works best on a particular folding class.