Second, the messenger RNA is translated into amino acid residues. In eukaryotes, the messenger RNA transcribed from the DNA does not all necessarily end up being expressed as protein. After an mRNA sequence is transcribed from a DNA sequence, and before it is translated into amino acid residues, contiguous subsequences of the mRNA sequence are spliced out. The subsequences that are removed are called introns; the intervening subsequences that get expressed as protein are called exons. Singer and Berg (Singer & Berg 1991) discuss eukaryotic translation in detail. In this paper an evolutionary computation technique, genetic programming, is shown to produce programs that can distinguish between exons and introns.
Automated methods of machine learning may prove to be useful in discovering biologically meaningful information hidden in the rapidly growing databases of DNA sequences and protein sequences. Genetic programming is an extension of the genetic algorithm in which a population of computer programs is bred, over a series of generations, in order to solve a problem. Genetic programming is capable of evolving complicated problem-solving expressions of unspecified size and shape. Moreover, when automatically defined functions are added to genetic programming, genetic programming becomes capable of efficiently capturing and exploiting recurring sub-patterns. This chapter describes how genetic programming with automatically defined functions successfully evolved motifs for detecting the DE-AD box family of proteins and for detecting the manganese superoxide dismutase family. Both motifs were evolved without prespecifying their length. Both evolved motifs employed automatically defined functions to capture the repeated use of common subexpressions. When tested against the SWISS-PROT database of proteins, the two genetically evolved consensus motifs detect the two families either as well, or slightly better than, the comparable human-written motifs found in the PROSITE database.
One of the most fundamental problems in molecular biology is the prediction of tertiary structure from primary structure: the protein folding problem. The goal of protein folding is the prediction of one feature of a folded protein (the 3D coordinates of its backbone atoms) from another feature (the sequence of amino acid residues that make up the protein). The protein folding problem is of enormous practical importance because the latter feature (the primary structure) is much easier to establish than the former (the tertiary structure). A related problem is the buriedness problem: the prediction of the degree of exposure to the solvent (the buriedness) of each amino acid residue in a folded protein. Some amino acid residues will have a buriedness of 0%: these are in the core of the protein and are likely hydrophobic. Other residues will have a buriedness of 100%: these are on the surface of the protein and are probably hydrophilic. The buriedness problem is interesting because it is a simplified version of the protein folding problem. In this paper I will show that genetic programming (Koza 1992; Koza 1994) does find programs that predict the buriedness of residues. These programs work better than would be expected of randomly generated programs and there is very little externally imposed bias towards any particular sizes, shapes/architectures or compositions.