Protein motifs can be classified into four categories. Sequence motifs are linear strings of residue identifiers with an implicit topological ordering. Sequence-structure motifs are sequence motifs with predefined secondary structural elements attached to one or more residues in the motif. The sequence is assumed to be predictive of the associated structure. Structure motifs are 3d structural objects, described by positions of residue objects in 3d Euclidean space.
In the analysis, the molecular scene is reconstructed and interpreted in an iterative procedure which proceeds from an initially low resolution uninterpreted image to a fully interpreted high resolution map. To accomplish the goals of molecular scene analysis, however, requires the representation of protein structures in a knowledge base that can be easily accessed to retrieve general and specific properties of protein structure at different levels of abstraction (amino acid, secondary structure, molecule, etc.).
Finding patterns, or motifs, in protein sequences involves two essential steps: assembling a training set of sequences with common structure or function and then analyzing the training set for regions of conserved amino acid residues. Hence, the resulting motif depends critically on the training set used. In fact, there is a fundamental relationship between the coherency of a training set and the specificity of a motif. When a training set contains incoherent sequences, a motif must become less specific in order to describe the entire set. A training set should ideally contain a representative sample from a coherent class of proteins, but obtaining a coherent set is complicated by several characteristics of protein sequence data: 1. Protein classes may contain subclasses. Each subclass may have a specific motif, whereas the entire class may have no motif or only a very general one. This work was supported in part by grants LM05716 and LM 07033 from the National Library of Medicine 2. The training set may be contaminated. Structural or functional evidence for including a sequence in a training set is often imprecise, so some sequences in the training set may not belong with the others.
Nucleic acid and protein sequence motifs are popular research objects of computational biologists for various reasons. Machine-readable motif descriptions can be used for automatic structure and function prediction. The exercise of defining a motif may provide insights into molecular mechanisms of gene expression, from transcriptional activation via RNA processing and protein folding to physiological activity. Finally, there are exciting potentials of synergism with other fields such as speech recognition, exemplified by dynamic programming algorithms and hidden Markov models. The concepts of a sequence motif itself evades exact definition. It necessarily implies some kind of structured similarity but may have functional aspects too. In the biological literature, the term motif often refers to short regions of sequence similarity. Here, it is used in a broader sense encompassing also larger objects such as protein families.