Country
Representation of Protein-Sequence Information by Amino Acid Subalphabets
Andersen, Claus A. F., Brunak, Soren
Within computational biology, algorithms are constructed with the aim of extracting knowledge from biological data, in particular, data generated by the large genome projects, where gene and protein sequences are produced in high volume. In this article, we explore new ways of representing protein-sequence information, using machine learning strategies, where the primary goal is the discovery of novel powerful representations for use in AI techniques. In the case of proteins and the 20 different amino acids they typically contain, it is also a secondary goal to discover how the current selection of amino acids -- which now are common in proteins -- might have emerged from simpler selections, or alphabets, in use earlier during the evolution of living organisms.
Applications of Case-Based Reasoning in Molecular Biology
Jurisica, Igor, Glasgow, Janice
Thus, one of the primary goals of a CBR system is to find the most similar, or most relevant, cases for new input problems. The effectiveness of CBR depends on the quality and quantity of cases in a case base. In some domains, even a small number of cases provide good solutions, but in other domains, an increased number of unique cases improves problemsolving capabilities of CBR systems because there are more experiences to draw on. The reader can find detailed complete theories, and rapid evolution; reasoning descriptions of the CBR process and systems in is often based on experience rather Kolodner (1993). Experts remember are presented in Leake (1996), and practically positive experiences for possible reuse of solutions; negative experiences are used to avoid oriented descriptions of CBR can be potentially unsuccessful outcomes.
Using Machine Learning to Design and Interpret Gene-Expression Microarrays
Molla, Michael, Waddell, Michael, Page, David, Shavlik, Jude
Gene-expression microarrays, commonly called gene chips, make it possible to simultaneously measure the rate at which a cell or tissue is expressing -- translating into a protein -- each of its thousands of genes. One can use these comprehensive snapshots of biological activity to infer regulatory pathways in cells; identify novel targets for drug design; and improve the diagnosis, prognosis, and treatment planning for those suffering from disease. However, the amount of data this new technology produces is more than one can manually analyze. Hence, the need for automated analysis of microarray data offers an opportunity for machine learning to have a significant impact on biology and medicine. This article describes microarray technology, the data it produces, and the types of machine learning tasks that naturally arise with these data. It also reviews some of the recent prominent applications of machine learning to gene-chip data, points to related tasks where machine learning might have a further impact on biology and medicine, and describes additional types of interesting data that recent advances in biotechnology allow biomedical researchers to collect.
Annotating Protein Function through Lexical Analysis
We now know the full genomes of more than 60 organisms. The experimental characterization of the newly sequenced proteins is deemed to lack behind this explosion of naked sequences (sequencefunction gap). The rate at which expert annotators add the experimental information into more or less controlled vocabularies of databases snails along at an even slower pace. Most methods that annotate protein function exploit sequence similarity by transferring experimental information for homologues. A crucial development aiding such transfer is large-scale, work- and management-intensive projects aimed at developing a comprehensive ontology for gene-protein function, such as the Gene Ontology project. In parallel, fully automatic or semiautomatic methods have successfully begun to mine the existing data through lexical analysis. Some of these tools target parsing controlled vocabulary from databases; others venture at mining free texts from MEDLINE abstracts or full scientific papers. Automated text analysis has become a rapidly expanding discipline in bioinformatics. A few of these tools have already been embedded in research projects.
Toward Automated Discovery in the Biological Sciences
Buchanan, Bruce G., Livingston, Gary R.
Knowledge discovery programs in the biological sciences require flexibility in the use of symbolic data and semantic information. Because of the volume of nonnumeric, as well as numeric, data, the programs must be able to explore a large space of possibly interesting relationships to discover those that are novel and interesting. Thus, the framework for the discovery program must facilitate proposing and selecting the next task to perform and performing the selected tasks. The framework we describe, called the agenda- and justificationbased framework, has several properties that are desirable in semiautonomous discovery systems: It provides a mechanism for estimating the plausibility of tasks, it uses heuristics to propose and perform tasks, and it facilitates the encoding of general discovery strategies and the use of background knowledge. We have implemented the framework and our heuristics in a prototype program, HAMB, and have evaluated them in the domain of protein crystallization. Our results demonstrate that both reasons given for performing tasks and estimates of the interestingness of the concepts and hypotheses examined by HAMB contribute to its performance and that the program can discover novel, interesting relationships in biological data.
Applying Inductive Logic Programming to Predicting Gene Function
One of the fastest advancing areas of modern science is functional genomics. This science seeks to understand how the complete complement of molecular components of living organisms (nucleic acid, protein, small molecules, and so on) interact together to form living organisms. Functional genomics is of interest to AI because the relationship between machines and living organisms is central to AI and because the field is an instructive and fun domain to apply and sharpen AI tools and ideas, requiring complex knowledge representation, reasoning, learning, and so on. This article describes two machine learning (inductive logic programming [ILP])-based approaches to the bioinformatic problem of predicting protein function from amino acid sequence. The first approach is based on using ILP as a way of bootstrapping from conventional sequence-based homology methods. The second approach used protein-functional ontologies to provide function classes and a hybrid ILP method to predict function directly from sequence. Both ILP approaches were successful in producing accurate prediction rules that could biologically be interpreted. The work was also of interest to machine learning research because it highlighted the flexibility of ILP systems in dealing with heterogeneous data, the importance of problems where classes are related hierarchically, and problems where examples have more than one functional class.
AI and Bioinformatics
Glasgow, Janice, Jurisica, Igor, Rost, Burkhard
Undoubtedly, bioinformatics is Michael Waddell, David Page, and Jude a truly interdisciplinary field: Although some Shavlik ("Using Machine Learning to Design researchers continuously affect wet labs in life and Interpret Gene-Expression Microarrays") science through collaborations or provision of introduces some background information and tools, others are rooted in the theory departments provides a comprehensive description of how of exact sciences (physics, chemistry, or techniques from machine learning can be used engineering) or computer sciences. This wide to help understand this high-dimensional and variety creates many different perspectives and prolific gene-expression data.
Distribution of Mutual Information from Complete and Incomplete Data
Hutter, Marcus, Zaffalon, Marco
Mutual information is widely used, in a descriptive way, to measure the stochastic dependence of categorical random variables. In order to address questions such as the reliability of the descriptive value, one must consider sample-to-population inferential approaches. This paper deals with the posterior distribution of mutual information, as obtained in a Bayesian framework by a second-order Dirichlet prior distribution. The exact analytical expression for the mean, and analytical approximations for the variance, skewness and kurtosis are derived. These approximations have a guaranteed accuracy level of the order O(1/n^3), where n is the sample size. Leading order approximations for the mean and the variance are derived in the case of incomplete samples. The derived analytical expressions allow the distribution of mutual information to be approximated reliably and quickly. In fact, the derived expressions can be computed with the same order of complexity needed for descriptive mutual information. This makes the distribution of mutual information become a concrete alternative to descriptive mutual information in many applications which would benefit from moving to the inductive side. Some of these prospective applications are discussed, and one of them, namely feature selection, is shown to perform significantly better when inductive mutual information is used.
A Personalized System for Conversational Recommendations
Thompson, C. A., Goker, M. H., Langley, P.
Searching for and making decisions about information is becoming increasingly difficult as the amount of information and number of choices increases. Recommendation systems help users find items of interest of a particular type, such as movies or restaurants, but are still somewhat awkward to use. Our solution is to take advantage of the complementary strengths of personalized recommendation systems and dialogue systems, creating personalized aides. We present a system -- the Adaptive Place Advisor -- that treats item selection as an interactive, conversational process, with the program inquiring about item attributes and the user responding. Individual, long-term user preferences are unobtrusively obtained in the course of normal recommendation dialogues and used to direct future conversations with the same user. We present a novel user model that influences both item search and the questions asked during a conversation. We demonstrate the effectiveness of our system in significantly reducing the time and number of interactions required to find a satisfactory item, as compared to a control group of users interacting with a non-adaptive version of the system.
Representation Dependence in Probabilistic Inference
Non-deductive reasoning systems are often representation dependent: representing the same situation in two different ways may cause such a system to return two different answers. Some have viewed this as a significant problem. For example, the principle of maximum entropyhas been subjected to much criticism due to its representation dependence. There has, however, been almost no work investigating representation dependence. In this paper, we formalize this notion and show that it is not a problem specific to maximum entropy. In fact, we show that any representation-independent probabilistic inference procedure that ignores irrelevant information is essentially entailment, in a precise sense. Moreover, we show that representation independence is incompatible with even a weak default assumption of independence. We then show that invariance under a restricted class of representation changes can form a reasonable compromise between representation independence and other desiderata, and provide a construction of a family of inference procedures that provides such restricted representation independence, using relative entropy.