In previous work, we experimented with various approaches to organizing collocational properties into features in a probabilistic classifier. It was found that one type of organization in particular, which is rarely used in NLP, allows one to take advantage of infrequent but high quality properties for an abstract discourse interpretation task. Based on an analysis of the experimental results, this paper suggests criteria for recognizing beneficial ways to include collocational information in probabilistic classifiers. Introduction Properties can be mapped to features in a machine learning algorithm in different ways, potentially yielding different results (see, e.g., Hu and Kibler 1996 and Pagallo and Haussler 1990). In previous work (Wiebe, Bruce, and Duan 1997), we experimented with various approaches to organizing collocational properties into features in a probabilistic classifier. We found that one type of organization in particular, which is rarely used in NLP, allows us to take advantage of infrequent but high quality properties, in order to classify utterances at an abstract level of interpretation. The interpretation problem we address is highly dependent on the discourse context, and was automated to provide key information for performing a future discourse segmentation task in newspaper articles. In addition, many other discourse tasks are at a similar level of abstraction, and the types of properties analyzed in this paper are important for them as well. In this paper, we suggest criteria for recognizing which organization might yield the best results in a new application, based on an analysis of the properties, organizations, and experiments presented in the earlier paper.
These collocations are used by native speakers of a language almost without thought, yet they must be learned by nonnative speakers of the language. A native speaker of English might say that he/she drinks "strong coffee," but a nonnative speaker might say either "powerful coffee" or "sturdy coffee." Collocations tend to vary among languages and topic domains. Unfortunately, the task of correctly identifying lexical collocations, even by native speakers of the language, has been shown to be difficult. Computer systems that translate natural languages, or machine-translation systems, need to know about lexical collocation information to produce natural-sounding or colloquially proper text.