Industry
Extending an Information Extraction tool set to Central and Eastern European languages
Ignat, Camelia, Pouliquen, Bruno, Ribeiro, Antonio, Steinberger, Ralf
In a highly multilingual and multicultural environment such as in the European Commission with soon over twenty official languages, there is an urgent need for text analysis tools that use minimal linguistic knowledge so that they can be adapted to many languages without much human effort. We are presenting two such Information Extraction tools that have already been adapted to various Western and Eastern European languages: one for the recognition of date expressions in text, and one for the detection of geographical place names and the visualisation of the results in geographical maps. An evaluation of the performance has produced very satisfying results.
Navigating multilingual news collections using automatically extracted information
Steinberger, Ralf, Pouliquen, Bruno, Ignat, Camelia
We are presenting a text analysis tool set that allows analysts in various fields to sieve through large collections of multilingual news items quickly and to find information that is of relevance to them. For a given document collection, the tool set automatically clusters the texts into groups of similar articles, extracts names of places, people and organisations, lists the user-defined specialist terms found, links clusters and entities, and generates hyperlinks. Through its daily news analysis operating on thousands of articles per day, the tool also learns relationships between people and other entities. The fully functional prototype system allows users to explore and navigate multilingual document collections across languages and time.
Multilingual person name recognition and transliteration
Pouliquen, Bruno, Steinberger, Ralf, Ignat, Camelia, Temnikova, Irina, Widiger, Anna, Zaghouani, Wajdi, Zizka, Jan
We present an exploratory tool that extracts person names from multilingual news collections, matches name variants referring to the same person, and infers relationships between people based on the co-occurrence of their names in related news. A novel feature is the matching of name variants across languages and writing systems, including names written with the Greek, Cyrillic and Arabic writing system. Due to our highly multilingual setting, we use an internal standard representation for name representation and matching, instead of adopting the traditional bilingual approach to transliteration. This work is part of the news analysis system NewsExplorer that clusters an average of 25,000 news articles per day to detect related news within the same and across different languages.
Improving Term Extraction with Terminological Resources
Studies of different term extractors on a corpus of the biomedical domain revealed decreasing performances when applied to highly technical texts. The difficulty or impossibility of customising them to new domains is an additional limitation. In this paper, we propose to use external terminologies to influence generic linguistic data in order to augment the quality of the extraction. The tool we implemented exploits testified terms at different steps of the process: chunking, parsing and extraction of term candidates. Experiments reported here show that, using this method, more term candidates can be acquired with a higher level of reliability. We further describe the extraction process involving endogenous disambiguation implemented in the term extractor YaTeA.
A Massive Local Rules Search Approach to the Classification Problem
Malyshkin, Vladislav, Bakhramov, Ray, Gorodetsky, Andrey
An approach to the classification problem of machine learning, based on building local classification rules, is developed. The local rules are considered as projections of the global classification rules to the event we want to classify. A massive global optimization algorithm is used for optimization of quality criterion. The algorithm, which has polynomial complexity in typical case, is used to find all high--quality local rules. The other distinctive feature of the algorithm is the integration of attributes levels selection (for ordered attributes) with rules searching and original conflicting rules resolution strategy. The algorithm is practical; it was tested on a number of data sets from UCI repository, and a comparison with the other predicting techniques is presented.
Searching for Globally Optimal Functional Forms for Inter-Atomic Potentials Using Parallel Tempering and Genetic Programming
Slepoy, A., Thompson, A. P., Peters, M. D.
We develop a Genetic Programming-based methodology that enables discovery of novel functional forms for classical inter-atomic force-fields, used in molecular dynamics simulations. Unlike previous efforts in the field, that fit only the parameters to the fixed functional forms, we instead use a novel algorithm to search the space of many possible functional forms. While a follow-on practical procedure will use experimental and {\it ab inito} data to find an optimal functional form for a forcefield, we first validate the approach using a manufactured solution. This validation has the advantage of a well-defined metric of success. We manufactured a training set of atomic coordinate data with an associated set of global energies using the well-known Lennard-Jones inter-atomic potential. We performed an automatic functional form fitting procedure starting with a population of random functions, using a genetic programming functional formulation, and a parallel tempering Metropolis-based optimization algorithm. Our massively-parallel method independently discovered the Lennard-Jones function after searching for several hours on 100 processors and covering a miniscule portion of the configuration space. We find that the method is suitable for unsupervised discovery of functional forms for inter-atomic potentials/force-fields. We also find that our parallel tempering Metropolis-based approach significantly improves the optimization convergence time, and takes good advantage of the parallel cluster architecture.
Expressing Implicit Semantic Relations without Supervision
We present an unsupervised learning algorithm that mines large text corpora for patterns that express implicit semantic relations. For a given input word pair X:Y with some unspecified semantic relations, the corresponding output list of patterns
Lexical Adaptation of Link Grammar to the Biomedical Sublanguage: a Comparative Evaluation of Three Approaches
Pyysalo, Sampo, Salakoski, Tapio, Aubin, Sophie, Nazarenko, Adeline
We study the adaptation of Link Grammar Parser to the biomedical sublanguage with a focus on domain terms not found in a general parser lexicon. Using two biomedical corpora, we implement and evaluate three approaches to addressing unknown words: automatic lexicon expansion, the use of morphological clues, and disambiguation using a part-of-speech tagger. We evaluate each approach separately for its effect on parsing performance and consider combinations of these approaches. In addition to a 45% increase in parsing efficiency, we find that the best approach, incorporating information from a domain part-of-speech tagger, offers a statistically significant 10% relative decrease in error. The adapted parser is available under an open-source license at http://www.it.utu.fi/biolg .
Predictions as statements and decisions
This paper is based on my invited talk at the 19th Annual Conference on Learning Theory (Pittsburgh, PA, June 24, 2006). In recent years COL T invited talks have tended to aim at establishing connections between the traditio nal concerns of the learning community and the work done by other communities (s uch as game theory, statistics, information theory, and optimization). F ollowing this tradition, I will argue that some ideas from the foundations of prob ability can be fruitfully applied in competitive on-line learning. In this paper I will use the following informal taxonomy of predictions (reminiscent of Shafer's [36], Figure 2, taxonomy of probabilities): D-predictions are mere Decisions. They can never be true or false but can be good or bad.
Evolutionary Design: Philosophy, Theory, and Application Tactics
Kryssanov, V. V., Tamaki, H., Kitamura, S.
Although it has contributed to remarkable improvements in some specific areas, attempts to develop a universal design theory are generally characterized by failure. This paper sketches arguments for a new approach to engineering design based on Semiotics - the science about signs. The approach is to combine different design theories over all the product life cycle stages into one coherent and traceable framework. Besides, it is to bring together the designer's and user's understandings of the notion of 'good product'. Building on the insight from natural sciences that complex systems always exhibit a self-organizing meaning-influential hierarchical dynamics, objective laws controlling product development are found through an examination of design as a semiosis process. These laws are then applied to support evolutionary design of products. An experiment validating some of the theoretical findings is outlined, and concluding remarks are given.