Second, the messenger RNA is translated into amino acid residues. In eukaryotes, the messenger RNA transcribed from the DNA does not all necessarily end up being expressed as protein. After an mRNA sequence is transcribed from a DNA sequence, and before it is translated into amino acid residues, contiguous subsequences of the mRNA sequence are spliced out. The subsequences that are removed are called introns; the intervening subsequences that get expressed as protein are called exons. Singer and Berg (Singer & Berg 1991) discuss eukaryotic translation in detail. In this paper an evolutionary computation technique, genetic programming, is shown to produce programs that can distinguish between exons and introns.
Automated methods of machine learning may prove to be useful in discovering biologically meaningful information hidden in the rapidly growing databases of DNA sequences and protein sequences. Genetic programming is an extension of the genetic algorithm in which a population of computer programs is bred, over a series of generations, in order to solve a problem. Genetic programming is capable of evolving complicated problem-solving expressions of unspecified size and shape. Moreover, when automatically defined functions are added to genetic programming, genetic programming becomes capable of efficiently capturing and exploiting recurring sub-patterns. This chapter describes how genetic programming with automatically defined functions successfully evolved motifs for detecting the DE-AD box family of proteins and for detecting the manganese superoxide dismutase family. Both motifs were evolved without prespecifying their length. Both evolved motifs employed automatically defined functions to capture the repeated use of common subexpressions. When tested against the SWISS-PROT database of proteins, the two genetically evolved consensus motifs detect the two families either as well, or slightly better than, the comparable human-written motifs found in the PROSITE database.
We apply HMMs to the problem of modeling exons, intronsand detecting splice sites in the human genome. Our most interesting result so far is the detection of particular oscillatory patterns,with a minimal period ofroughly 10 nucleotides, that seem to be characteristic of exon regions and may have significant biological implications.