Having fragments in the profile ordered by size is a convention as well as major convenience, as we shall see. Let ae be the abstraction function that transforms a restriction map s into a restriction profile by sorting its entries, cr -1 (p) assigns to profile p the set of maps obtained by arbitrary permutations of p. Definition 2 (fragment length score) The score function fl is defined by fl(x, y) Ix -y[. It is called the fragment length score. For comparing fragment lengths, this seems a natural, albeit simplistic definition. Note that fl(x, O) x, i.e. a missing fragment is scored according to its length. Definition 3 (fragment length distance) Let s and t be both either restriction maps or restriction profiles. The fragment length distance of s and t l(Kim et al. 1996) reserves the term restriction pattern for (unordered) multisets of fragment lengths.
The tool consists of the following main parts: 1) An integrated database for genomic regulatory sequences. The integrated database was designed on the basis of the databases TRANSFAC (Wingender 1994) and TRRD (Kel al. 1995) that are currently under development. The following functions are performed: i) linkage to the EMBL database; ii) preparing samples of definite types of functional sites with their flaking sequences; iii) preparing samples of promoter sequences; iv) preparing samples of transcription factors classified with regard to structural and functional features of DNA binding and activating domains, functional families of the factors, their tissue specificity and other functional features; v) access to data on mutual disposition of cis-elements within the regulatory regions.
Screening for potential ligands and docking them into the binding sites of proteins is one of the main tasks in computer-aided drug design. Despite the progress in computational power, it remains infeasible to model all the factors involved in molecular recognition, especially when screening databases of more than 100,000 compounds. While ligand flexibility is considered in most approaches, the model of the binding site is rather simplistic, with neither solvation nor induced complementary usually taken into consideration. We present results for screening different databases for HIV-1 protease ligands with our tool Slide, and investigate the extent to which binding-site conformation, solvation, and template representation generate bias. The results suggest a strategy for selecting the optimal bindingsite conformation, for cases in which more than one independent structure is available, and selecting a representation of that binding site that yields reproducible results and the identification of known ligands.
Gene duplication events have played a major role in the evolution of the human genome (Miklos and Rubin 1996). Genes related in this way are called paralogous; groups of these paralogs form superfamilies of related genes. Each duplication event allows a freeing of functional constraints on one copy, so that over time and large evolutionary distances, a plethora of functions and structures can evolve from a single ancestor gene. To the protein sequence analyst, these superfamilies contain a wealth of hidden information, and pose a multitude of questions. How did this family evolve? What was the ancestor protein like? What was its original function? Within this large superfamily are there subgroups defined by common functions or other attributes? If the proteins interact with other molecules, can we identify Copyright (c) 1998, American Association for Artificial Intelligence (www.aaai.org).
Metal binding is important for the structural and functional characterization of proteins. Previous prediction efforts have only focused on bonding state, i.e. deciding which protein residues act as metal ligands in some binding site. Identifying the geometry of metal-binding sites, i.e. deciding which residues are jointly involved in the coordination of a metal ion is a new prediction problem that has been never attempted before from protein sequence alone. In this paper, we formulate it in the framework of learning with structured outputs. Our solution relies on the fact that, from a graph theoretical perspective, metal binding has the algebraic properties of a matroid, enabling the application of greedy algorithms for learning structured outputs.