Machine Translation
Identifying Useful Human Correction Feedback from an On-Line Machine Translation Service
Barrón-Cedeño, Alberto (Universitat Politècnica de Catalunya) | Màrquez, Lluís (Universitat Politècnica de Catalunya) | Q., Carlos A. Henríquez (Universitat Politècnica de Catalunya) | Formiga, Lluís (Universitat Politècnica de Catalunya) | Romero, Enrique (Universitat Politècnica de Catalunya) | May, Jonathan (SDL Language Weaver)
Post-editing feedback provided by users of on-line translation services offers an excellent opportunity for automatic improvement of statistical machine translation (SMT) systems. However, feedback provided by casual users is very noisy, and must be automatically filtered in order to identify the potentially useful cases. We present a study on automatic feedback filtering in a real weblog collected from Reverso.net. We extend and re-annotate a training corpus, define an extended set of simple features and approach the problem as a binary classification task, experimenting with linear and kernelbased classifiers and feature selection. Results on the feedback filtering task show a significant improvement over the majority class, but also a precision ceiling around 70-80%. This reflects the inherent difficulty of the problemand indicates that shallow features cannot fully capture the semantic nature of the problem. Despite the modest results on the filtering task, the classifiers are proven effective in an application-based evaluation. The incorporation of a filtered set of feedback instances selected from a larger corpus significantly improves the performance of a phrase-based SMT system, according to a set of standard evaluation metrics.
A Topic-Based Coherence Model for Statistical Machine Translation
Xiong, Deyi (Soochow University) | Zhang, Min (Soochow University)
Coherence that ties sentences of a text into a meaningfully connected structure is of great importance to text generation and translation. In this paper, we propose a topic-based coherence model to produce coherence for document translation, in terms of the continuity of sentence topics in a text. We automatically extract a coherence chain for each source text to be translated. Based on the extracted source coherence chain, we adopt a maximum entropy classifier to predict the target coherence chain that defines a linear topic structure for the target document. The proposed topic-based coherence model then uses the predicted target coherence chain to help decoder select coherent word/phrase translations. Our experiments show that incorporating the topic-based coherence model into machine translation achieves substantial improvement over both the baseline and previous methods that integrate document topics rather than coherence chains into machine translation.
Artificial Intelligence on Mobile Devices: An Introduction to the Special Issue
Yang, Qiang (Huawei Noah’s Ark Lab) | Zhao, Feng (Microsoft Research Asia)
We will see more and more applications of AI on the mobile devices. This special issue of AI Magazine is devoted to some exemplary works of AI on mobile devices. We include four works that range from mobile activity recognition and air-quality detection to machine translation and image compression. These works were chosen from a variety of sources, including the International Joint Conference on Artificial Intelligence 2011 Special Track on Integrated and Embedded AI Systems, held in Barcelona, Spain, in July 2011.
Speaking Louder than Words with Pictures Across Languages
Finch, Andrew (NICT) | Song, Wei (Canon Inc.) | Tanaka-Ishii, Kumiko (Kyushu University) | Sumita, Eiichiro (NICT)
In this article, we investigate the possibility of cross-language communication using a synergy of words and pictures on mobile devices. Communicating with only pictures is in itself a very powerful strategy, but is limited in expressiveness. On the other hand, words can express everything you could wish to say, but they are cumbersome to work with on mobile devices, and need to be translated in order for their meaning to be understood. Automatic translations can contain errors that pervert the communication process, and this may undermine the users’ confidence when expressing themselves across language barriers. Our idea is to create a user interface for cross-language communication that uses pictures as the primary mode of input, and words to express the detailed meaning. This interface creates a visual process of communication that occurs on two heterogeneous channels that can support each other. We implemented this user interface as application on the Apple iPad tablet, and performed a set of experiments to determine its usefulness as a translation aid for travellers.
Automated Non-Content Word List Generation Using hLDA
Krug, Wayne (Language Computer Corporation) | Tomlinson, Marc T. (Language Computer Corporation)
In this paper, we present a language-independent method for the automatic, unsupervised extraction of non-content words from a corpus of documents. This method permits the creation of word lists that may be used in place of traditional function word lists in various natural language processing tasks. As an example we generated lists of words from a corpus of English, Chinese, and Russian posts extracted from Wikipedia articles and Wikipedia Wikitalk discussion pages. We applied these lists to the task of authorship attribution on this corpus to compare the effectiveness of lists of words extracted with this method to expert-created function word lists and frequent word lists (a common alternative to function word lists). hLDA lists perform comparably to frequent word lists. The trials also show that corpus-derived lists tend to perform better than more generic lists, and both sets of generated lists significantly outperformed the expert lists. Additionally, we evaluated the performance of an English expert list on machine translations of our Chinese and Russian documents, showing that our method also outperforms this alternative.
Applying Automated Language Translation at a Global Enterprise Level
Rychtyckyj, Nestor (Ford Motor Company) | Plesco, Craig (Ford Motor Company)
In 2007 we presented a paper that described the application of Natural Language Processing (NLP) and Machine Translation (MT) for the automated translation of process build instructions from English to other languages to support Ford's assembly plants in non-English speaking countries. This project has continued to evolve with the addition of new languages and improvements to the translation process. However, we discovered that there was a large demand for automated language translation across all of Ford Motor Company and we decided to expand the scope of our project to address these requirements. This paper will describe our efforts to meet all of Ford's internal translation requirements with AI and MT technology and focus on the challenges and lessons that we learned from applying advanced technology across an entire corporation.
Applying Automated Language Translation at a Global Enterprise Level
Rychtyckyj, Nestor (Ford Motor Company) | Plesco, Craig (Ford Motor Company)
In 2007 we presented a paper that described the application of Natural Language Processing (NLP) and Machine Translation (MT) for the automated translation of process build instructions from English to other languages to support Ford’s assembly plants in non-English speaking countries. This project has continued to evolve with the addition of new languages and improvements to the translation process. However, we discovered that there was a large demand for automated language translation across all of Ford Motor Company and we decided to expand the scope of our project to address these requirements. This paper will describe our efforts to meet all of Ford’s internal translation requirements with AI and MT technology and focus on the challenges and lessons that we learned from applying advanced technology across an entire corporation.
Multi-Engine Machine Translation as a Lifelong Machine Learning Problem
Federmann, Christian (German Research Center for Artificial Intelligence)
We describe an approach for multi-engine machine translation that uses machine learning methods to train one or several classifiers for a given set of candidate translations. Contrary to existing approaches in quality estimation which only consider a single translation at a time, we explicitly model pairwise comparison with our feature vectors. We discuss several challenges our method is facing and discuss how lifelong machine learning could be applied to resolve these. We also show how the proposed architecture can be extended to allow human feedback to be included into the training process, improving the system's selection process over time.
Evaluating Indirect Strategies for Chinese-Spanish Statistical Machine Translation
Costa-jussà, M. R., Henríquez, C. A., Banchs, R. E.
Although, Chinese and Spanish are two of the most spoken languages in the world, not much research has been done in machine translation for this language pair. This paper focuses on investigating the state-of-the-art of Chinese-to-Spanish statistical machine translation (Smt), which nowadays is one of the most popular approaches to machine translation. For this purpose, we report details of the available parallel corpus which are Basic Traveller Expressions Corpus (Btec), Holy Bible and United Nations (Un). Additionally, we conduct experimental work with the largest of these three corpora to explore alternative Smt strategies by means of using a pivot language. Three alternatives are considered for pivoting: cascading, pseudo-corpus and triangulation. As pivot language, we use either English, Arabic or French. Results show that, for a phrase-based Smt system, English is the best pivot language between Chinese and Spanish. We propose a system output combination using the pivot strategies which is capable of outperforming the direct translation strategy. The main objective of this work is motivating and involving the research community to work in this important pair of languages given their demographic impact.
A Rule-Based Approach For Aligning Japanese-Spanish Sentences From A Comparable Corpora
Ramírez, Jessica C., Matsumoto, Yuji
The performance of a Statistical Machine Translation System (SMT) system is proportionally directed to the quality and length of the parallel corpus it uses. However for some pair of languages there is a considerable lack of them. The long term goal is to construct a Japanese-Spanish parallel corpus to be used for SMT, whereas, there are a lack of useful Japanese-Spanish parallel Corpus. To address this problem, In this study we proposed a method for extracting Japanese-Spanish Parallel Sentences from Wikipedia using POS tagging and Rule-Based approach. The main focus of this approach is the syntactic features of both languages. Human evaluation was performed over a sample and shows promising results, in comparison with the baseline.