Information Technology
Annotating Protein Function through Lexical Analysis
We now know the full genomes of more than 60 organisms. The experimental characterization of the newly sequenced proteins is deemed to lack behind this explosion of naked sequences (sequencefunction gap). The rate at which expert annotators add the experimental information into more or less controlled vocabularies of databases snails along at an even slower pace. Most methods that annotate protein function exploit sequence similarity by transferring experimental information for homologues. A crucial development aiding such transfer is large-scale, work- and management-intensive projects aimed at developing a comprehensive ontology for gene-protein function, such as the Gene Ontology project. In parallel, fully automatic or semiautomatic methods have successfully begun to mine the existing data through lexical analysis. Some of these tools target parsing controlled vocabulary from databases; others venture at mining free texts from MEDLINE abstracts or full scientific papers. Automated text analysis has become a rapidly expanding discipline in bioinformatics. A few of these tools have already been embedded in research projects.
Toward Automated Discovery in the Biological Sciences
Buchanan, Bruce G., Livingston, Gary R.
Knowledge discovery programs in the biological sciences require flexibility in the use of symbolic data and semantic information. Because of the volume of nonnumeric, as well as numeric, data, the programs must be able to explore a large space of possibly interesting relationships to discover those that are novel and interesting. Thus, the framework for the discovery program must facilitate proposing and selecting the next task to perform and performing the selected tasks. The framework we describe, called the agenda- and justificationbased framework, has several properties that are desirable in semiautonomous discovery systems: It provides a mechanism for estimating the plausibility of tasks, it uses heuristics to propose and perform tasks, and it facilitates the encoding of general discovery strategies and the use of background knowledge. We have implemented the framework and our heuristics in a prototype program, HAMB, and have evaluated them in the domain of protein crystallization. Our results demonstrate that both reasons given for performing tasks and estimates of the interestingness of the concepts and hypotheses examined by HAMB contribute to its performance and that the program can discover novel, interesting relationships in biological data.
Representation of Protein-Sequence Information by Amino Acid Subalphabets
Andersen, Claus A. F., Brunak, Soren
Within computational biology, algorithms are constructed with the aim of extracting knowledge from biological data, in particular, data generated by the large genome projects, where gene and protein sequences are produced in high volume. In this article, we explore new ways of representing protein-sequence information, using machine learning strategies, where the primary goal is the discovery of novel powerful representations for use in AI techniques. In the case of proteins and the 20 different amino acids they typically contain, it is also a secondary goal to discover how the current selection of amino acids -- which now are common in proteins -- might have emerged from simpler selections, or alphabets, in use earlier during the evolution of living organisms.
Calendar of Events
NASA Ames Research Center Polish Academy of Sciences URL: www.taai.org.tw/announce/ (PRICAI 2004). (ICKEDS 2004). This book looks at some of the results of the synergy among AI, cognitive science, and education. Examples include virtual students whose misconceptions force students to reflect on their own knowledge, intelligent tutoring systems, and speech recognition technology that helps students learn to read.
Report on the Second International Joint Conference on Autonomous Agents and Multiagent Systems
Rosenschein, Jeffrey S., Wooldridge, Michael
The Second International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-03) was held in Melbourne, Australia, in July 2003. Attracting nearly 500 delegates, the event confirmed AAMAS as the academic main event for researchers with an interest in multiagent systems. We summarize the conference highlights and report on the associated workshops, tutorials, and emerging trends.
Applying Inductive Logic Programming to Predicting Gene Function
One of the fastest advancing areas of modern science is functional genomics. This science seeks to understand how the complete complement of molecular components of living organisms (nucleic acid, protein, small molecules, and so on) interact together to form living organisms. Functional genomics is of interest to AI because the relationship between machines and living organisms is central to AI and because the field is an instructive and fun domain to apply and sharpen AI tools and ideas, requiring complex knowledge representation, reasoning, learning, and so on. This article describes two machine learning (inductive logic programming [ILP])-based approaches to the bioinformatic problem of predicting protein function from amino acid sequence. The first approach is based on using ILP as a way of bootstrapping from conventional sequence-based homology methods. The second approach used protein-functional ontologies to provide function classes and a hybrid ILP method to predict function directly from sequence. Both ILP approaches were successful in producing accurate prediction rules that could biologically be interpreted. The work was also of interest to machine learning research because it highlighted the flexibility of ILP systems in dealing with heterogeneous data, the importance of problems where classes are related hierarchically, and problems where examples have more than one functional class.
The Semantic Web and Language Technology, Its Potential and Practicalities: EUROLAN-2003
Cristea, Dan, Ide, Nancy, Tufis, Dan
Later in the school, the focus turned to ontologies, which is where the true power of the semantic web lies. EUROLAN lecturers treated its potential in terms of what the topic of ontology development it might--and might not--bring to us in the future. This year's and how great its impact will really start somewhere, somehow, even if school was organized by the Faculty be. Although it is not yet clear what emerges is a variety of ontological of Computer Science at the A. I. Cuza whether the current vision of the semantic stores from which to choose. University of Iasi, the Research Institute web will indeed reach its expectations, The EUROLAN summer school also for Artificial Intelligence at the there are more and more included a workshop on ontologies Romanian Academy in Bucharest, opinions that it represents a major and information extraction, a student and the Department of Computer technological step that will permanently workshop on applied natural Science at Vassar College.
Semantic Integration Workshop at the Second International Semantic Web Conference (ISWC-2003)
Doan, AnHai, Halevy, Alon Y., Noy, Natalya F.
In numerous distributed environments, including today's World Wide Web, enterprise data management systems, large science projects, and the emerging semantic web, applications will inevitably use the information described by multiple ontologies and schemas. We organized the Workshop on Semantic Integration at the Second International Semantic Web Conference to bring together different communities working on the issues of enabling integration among different resources. The workshop generated a lot of interest and attracted more than 70 participants.
Using Machine Learning to Design and Interpret Gene-Expression Microarrays
Molla, Michael, Waddell, Michael, Page, David, Shavlik, Jude
Gene-expression microarrays, commonly called gene chips, make it possible to simultaneously measure the rate at which a cell or tissue is expressing -- translating into a protein -- each of its thousands of genes. One can use these comprehensive snapshots of biological activity to infer regulatory pathways in cells; identify novel targets for drug design; and improve the diagnosis, prognosis, and treatment planning for those suffering from disease. However, the amount of data this new technology produces is more than one can manually analyze. Hence, the need for automated analysis of microarray data offers an opportunity for machine learning to have a significant impact on biology and medicine. This article describes microarray technology, the data it produces, and the types of machine learning tasks that naturally arise with these data. It also reviews some of the recent prominent applications of machine learning to gene-chip data, points to related tasks where machine learning might have a further impact on biology and medicine, and describes additional types of interesting data that recent advances in biotechnology allow biomedical researchers to collect.
Representation Dependence in Probabilistic Inference
Non-deductive reasoning systems are often representation dependent: representing the same situation in two different ways may cause such a system to return two different answers. Some have viewed this as a significant problem. For example, the principle of maximum entropyhas been subjected to much criticism due to its representation dependence. There has, however, been almost no work investigating representation dependence. In this paper, we formalize this notion and show that it is not a problem specific to maximum entropy. In fact, we show that any representation-independent probabilistic inference procedure that ignores irrelevant information is essentially entailment, in a precise sense. Moreover, we show that representation independence is incompatible with even a weak default assumption of independence. We then show that invariance under a restricted class of representation changes can form a reasonable compromise between representation independence and other desiderata, and provide a construction of a family of inference procedures that provides such restricted representation independence, using relative entropy.