The Computational Linguistics of Biological Sequences

Classics (Collection 2)

Shortly after Watson and Crick's discovery of the structure of DNA, and at about the same time that the genetic code and the essential facts of gene expression were being elucidated, the field of linguistics was being similarly revolutionized by the work of Noam Chomsky [Chomsky, 1955, 1957, 1959, 1963, 1965]. Observing that a seemingly infinite variety of language was available to individual human beings based on clearly finite resources and experience, he proposed a formal representation of the rules or syntax of language, called generative grammar, that could provide finite--indeed, concise--characterizations of such infinite languages. Just as the breakthroughs in molecular biology in that era served to anchor genetic concepts in physical structures and opened up entirely novel experimental paradigms, so did Chomsky's insight serve to energize the field of linguistics, with putative correlates of cognitive processes that could for the first time be reasoned about 48 ARTIFICIAL INTELLIGENCE & MOLECULAR BIOLOGY While Chomsky and his followers built extensively upon this foundation in the field of linguistics, generative grammars were also soon integrated into the framework of the theory of computation, and in addition now form the basis for efforts of computational linguists to automate the processing and understanding of human language. Since it is quite commonly asserted that DNA is a richly-expressive language for specifying the structures and processes of life, also with the potential for a seemingly infinite variety, it is surprising that relatively little has been done to apply to biological sequences the extensive results and methods developed over the intervening decades in the field of formal language theory. While such an approach has been proposed [Brendel and Busse, 1984], most investigations along these lines have used grammar formalisms as tools for what are essentially information-theoretic studies [Ebeling and Jimenez-Montano, 1980; Jimenez-Montano, 1984], or have involved statistical analyses at the level of vocabularies (reflecting a more traditional notion of comparative linguistics) [Brendel et al., 1986; Pevzner et al., 1989a,b; Pietrokovski et al., 1990].

Semantic categories of nominals for conceptual dependency analysis of natural language


Abstract: A system for the semantic categorization of conceptual objects (nominals) is provided. The system is intended to aid computer understanding of natural language. Specific implementations for noun-pairs and prepositional phrases are offered.

Understanding natural language


This paper describes a computer system for understanding English. It is based on the belief that in modeling language understanding, we must deal in an integrated way with all of the aspects of language--syntax, semantics, and inference. It enters into a dialog with a person, responding to English sentences with actions and English replies, asking for clarification when its heuristic programs cannot understand a sentence through the use of syntactic, semantic, contextual, and physical knowledge. By developing special procedural representations for syntax, semantics, and inference, we gain flexibility and power.

Automatic translation of languages since 1960: A linguist's view


Language was considered just a "bunch of words" and the primary task for early machine translation (MT) was to build machines large enough to hold all the words necessary in the translation process. These means included the printing out of the several possible solutions of ambiguous text segments to let the reader decide for himself the correct meaning, printing out the ambiguous source language text, and other temporary expedients. Particularly one must understand the rules under which such a complex system as human language operates and how the mechanism of this operation can be simulated by automatic means, i.e., without any human intervention at all. The second problem, the simulation of human language behavior by automatic means, is almost impossible to achieve, since language is an open and dynamic system in constant change and because the operation of the system is not yet completely understood.

Question-answering in English


To illustrate how this may be done in very simple cases we give rules which translate certain declarative sentences and questions involving the quantifiers'some', 'every', 'any', and'no' into a modified first-order predicate calculus, and answer the questions by comparing their translated forms with those of the declaratives. John kissed Mary (1) Did John kiss Mary? (5) We begin by describing a method for translating a modest subset of English into a slightly modified first-order predicate calculus -- modified just enough to provide a representation for questions. We would like to have rules which transcribe such declarative sentences into predicate calculus formulae, such as VxMxj (7') 3x-- The matrix will be preceded by a string of quantifiers and negations -- and possibly a question mark; we have found that the transcription rules which appear below produce unique and acceptable orderings of these symbols from unambiguous sentences of the specified type.

Transition Network Grammars for Natural Language Analysis


Full text available for a fee."The use of augmented transition network grammars for the analysis of natural language sentences is described. Structure-building actions associated with the arcs of the grammar network allow for the reordering, restructuring, and copying of constituents necessary to produce deep-structure representations of the type normally obtained from a transformational analysis, and conditions on the arcs allow for a powerful selectivity which can rule out meaningless analyses and take advantage of semantic information to guide the parsing. The advantages of this model for natural language analysis are discussed in detail and illustrated by examples. An implementation of an experimental parsing system for transition network grammars is briefly described."Communications of the ACM, Vol. 13, No. 10, October, 1970, pp. 591-606 (reprinted in RNLP: 71-88)

Natural language question-answering systems: 1969


In the meantime, Chomsky (1965) devised a paradigm for linguistic analysis that includes syntactic, semantic, and phonological components to account for the generation of natural language statements. This theory can be interpreted to imply that the meaning of a sentence can be represented as a semantically interpreted deep structure--i.e, From computer science's preoccupation with formal programming languages and compilers, there emerged another paradigm. The adoption and combination of these two new paradigms have resulted in a vigorous new generation of language processing systems characterized by sophisticated linguistic and logical processing of well-defined formal data structures. These included a social-conversation machine, systems that translated from English into limited logical calculi, and programs that attempted to answer questions from English text.

User's guide to QA3


A question-answering system. Tech. Note 15, AI Group, Stanford Research Institute, Menlo Park, Calif.