Machine Translation
Learning Inter-Related Statistical Query Translation Models for English-Chinese Bi-Directional CLIR
Zhang, Yuejie (Fudan University) | Cen, Lei (Fudan University) | Jin, Cheng (Fudan University) | Xue, Xiangyang (Fudan University) | Fan, Jianping (The University of North Carolina at Charlotte)
To support more precise query translation for English-Chinese Bi-Directional Cross-Language Information Retrieval (CLIR), we have developed a novel framework by integrating a semantic network to characterize the correlations between multiple inter-related text terms of interest and learn their inter-related statistical query translation models. First, a semantic network is automatically generated from large-scale English-Chinese bilingual parallel corpora to characterize the correlations between a large number of text terms of interest. Second, the semantic network is exploited to learn the statistical query translation models for such text terms of interest. Finally, these inter-related query translation models are used to translate the queries more precisely and achieve more effective CLIR. Our experiments on a large number of official public data have obtained very positive results.
Constraint Optimization Approach to Context Based Word Selection
Matsuno, Jun (Kyoto University) | Ishida, Toru (Kyoto University)
Consistent word selection in machine translation is currently realized by resolving word sense ambiguity through the context of a single sentence or neighboring sentences. However, consistent word selection over the whole article has yet to be achieved. Consistency over the whole article is extremely important when applying machine translation to collectively developed documents like Wikipedia. In this paper, we propose to consider constraints between words in the whole article based on their semantic relatedness and contextual distance. The proposed method is successfully implemented in both statistical and rule-based translators. We evaluate those systems by translating 100 articles in the English Wikipedia into Japanese. The results show that the ratio of appropriate word selection for common nouns increased to around 75% with our method, while it was around 55% without our method.
SMT Versus AI Redux: How Semantic Frames Evaluate MT More Accurately
Lo, Chi-kiu (Hong Kong University of Science and Technology) | Wu, Dekai (Hong Kong University of Science and Technology)
We argue for an alternative paradigm in evaluating machine translation quality that is strongly empirical but more accurately reflects the utility of translations, by returning to a representational foundation based on AI oriented lexical semantics, rather than the superficial flat n-gram and string representations recently dominating the field. Driven by such metrics as BLEU and WER, current SMT frequently produces unusable translations where the semantic event structure is mistranslated: who did what to whom, when, where, why, and how? We argue that it is time for a new generation of more “intelligent” automatic and semi-automatic metrics, based clearly on getting the structure right at the lexical semantics level. We show empirically that it is possible to use simple PropBank style semantic frame representations to surpass all currently widespread metrics' correlation to human adequacy judgments, including even HTER. We also show that replacing human annotators with automatic semantic role labeling still yields much of the advantage of the approach. We combine the best of both worlds: from an SMT perspective, we provide superior yet low-cost quantitative objective functions for translation quality; and yet from an AI perspective, we regain the representational transparency and clear reflection of semantic utility of structural frame-based knowledge representations.
Translation of Pronominal Anaphora between English and Spanish: Discrepancies and Evaluation
This paper evaluates the different tasks carried out in the translation of pronominal anaphora in a machine translation (MT) system. The MT interlingua approach named AGIR (Anaphora Generation with an Interlingua Representation) improves upon other proposals presented to date because it is able to translate intersentential anaphors, detect co-reference chains, and translate Spanish zero pronouns into English---issues hardly considered by other systems. The paper presents the resolution and evaluation of these anaphora problems in AGIR with the use of different kinds of knowledge (lexical, morphological, syntactic, and semantic). The translation of English and Spanish anaphoric third-person personal pronouns (including Spanish zero pronouns) into the target language has been evaluated on unrestricted corpora. We have obtained a precision of 80.4% and 84.8% in the translation of Spanish and English pronouns, respectively. Although we have only studied the Spanish and English languages, our approach can be easily extended to other languages such as Portuguese, Italian, or Japanese.
Correction of Noisy Sentences using a Monolingual Corpus
Correction of Noisy Natural Language Text is an important and well studied problem in Natural Language Processing. It has a number of applications in domains like Statistical Machine Translation, Second Language Learning and Natural Language Generation. In this work, we consider some statistical techniques for Text Correction. We define the classes of errors commonly found in text and describe algorithms to correct them. The data has been taken from a poorly trained Machine Translation system. The algorithms use only a language model in the target language in order to correct the sentences. We use phrase based correction methods in both the algorithms. The phrases are replaced and combined to give us the final corrected sentence. We also present the methods to model different kinds of errors, in addition to results of the working of the algorithms on the test set. We show that one of the approaches fail to achieve the desired goal, whereas the other succeeds well. In the end, we analyze the possible reasons for such a trend in performance.
Statistical Machine Translation with Factored Translation Model: MWEs, Separation of Affixes, and Others
Okita, Tsuyoshi (Dublin City University) | Ceausu, Alexandru (Dublin City University) | Way, Andy (Dublin City University)
Expressions (MWEs) (Okita et al. 2010), this may improve the overall translation. For example in EN-JP, the empirical evidences 2007; Koehn 2010) intends to handle morphologically rich suggest that we separate affix(es) and word stem(s) since it languages in the target side by integrating additional linguistic obtains better BLEU score than the case when we do not separate markup at the word level, where each type of additional them although the adequacy decreases.
Given Bilingual Terminology in Statistical Machine Translation: MWE-Sensitve Word Alignment and Hierarchical Pitman-Yor Process-Based Translation Model Smoothing
Okita, Tsuyoshi (Dublin City University) | Way, Andy (Dublin City University)
This paper considers a scenario when we are given almost perfect knowledge about bilingual terminology in terms of a test corpus in Statistical Machine Translation (SMT). When the given terminology is part of a training corpus, one natural strategy in SMT is to use the trained translation model ignoring the given terminology. Then, two questions arises here. 1) Can a word aligner capture the given terminology? This is since even if the terminology is in a training corpus, it is often the case that a resulted translation model may not include these terminology. 2) Are probabilities in a translation model correctly calculated? In order to answer these questions, we did experiment introducing a Multi-Word Expression-sensitive (MWE-sensitive) word aligner and a hierarchical Pitman-Yor process-based translation model smoothing. Using 200k JP--EN NTCIR corpus, our experimental results show that if we introduce an MWE-sensitive word aligner and a new translation model smoothing, the overall improvement was 1.35 BLEU point absolute and 6.0% relative compared to the case we do not introduce these two.
Semantic Vector Machines
We first present our work in machine translation, during which we used aligned sentences to train a neural network to embed n-grams of different languages into an $d$-dimensional space, such that n-grams that are the translation of each other are close with respect to some metric. Good n-grams to n-grams translation results were achieved, but full sentences translation is still problematic. We realized that learning semantics of sentences and documents was the key for solving a lot of natural language processing problems, and thus moved to the second part of our work: sentence compression. We introduce a flexible neural network architecture for learning embeddings of words and sentences that extract their semantics, propose an efficient implementation in the Torch framework and present embedding results comparable to the ones obtained with classical neural language models, while being more powerful.
Notes on a New Philosophy of Empirical Science
This book presents a methodology and philosophy of empirical science based on large scale lossless data compression. In this view a theory is scientific if it can be used to build a data compression program, and it is valuable if it can compress a standard benchmark database to a small size, taking into account the length of the compressor itself. This methodology therefore includes an Occam principle as well as a solution to the problem of demarcation. Because of the fundamental difficulty of lossless compression, this type of research must be empirical in nature: compression can only be achieved by discovering and characterizing empirical regularities in the data. Because of this, the philosophy provides a way to reformulate fields such as computer vision and computational linguistics as empirical sciences: the former by attempting to compress databases of natural images, the latter by attempting to compress large text databases. The book argues that the rigor and objectivity of the compression principle should set the stage for systematic progress in these fields. The argument is especially strong in the context of computer vision, which is plagued by chronic problems of evaluation. The book also considers the field of machine learning. Here the traditional approach requires that the models proposed to solve learning problems be extremely simple, in order to avoid overfitting. However, the world may contain intrinsically complex phenomena, which would require complex models to understand. The compression philosophy can justify complex models because of the large quantity of data being modeled (if the target database is 100 Gb, it is easy to justify a 10 Mb model). The complex models and abstractions learned on the basis of the raw data (images, language, etc) can then be reused to solve any specific learning problem, such as face recognition or machine translation.
True Knowledge: Open-Domain Question Answering Using Structured Knowledge and Inference
Tunstall-Pedoe, William (True Knowledge Ltd)
This article gives a detailed description of True Knowledge: a commercial, open-domain question answering platform. The system combines a large and growing structured knowledge base of common sense, factual and lexical knowledge; a natural language translation system that turns user questions into internal language-independent queries and an inference system that can answer those queries using both directly represented and inferred knowledge. The system is live and answers millions of questions per month asked by internet users.