Goto

Collaborating Authors

 Machine Translation


Not Every AI Problem is a Data Problem: We Should Be Intentional About Data Scaling

arXiv.org Artificial Intelligence

For example, translation between languages exhibits regular and persistent patterns at different scales (across sentences, paragraphs, documents). In general, language patterns are stable over time. We know what type of data we need to expand to new languages. And while it may be challenging to acquire the data for rare or only spoken languages, it is easy to judge whether newly acquired data is what we need. In contrast, use cases where data lacks strong, persistent topological features or where the structure is highly fragmented or unstable over time, may not be as well-suited for data scaling approaches.


Review for NeurIPS paper: Cross-lingual Retrieval for Iterative Self-Supervised Training

Neural Information Processing Systems

The paper proposes a novel approach for unsupervised parallel corpus mining and unsupervised machine translation, improving on the SoTA on both tasks by significant margins. Experiments are conducted on the Tatoeba retrieval task and a 25 language translation task based on a combination of a few academic benchmark datasets. Careful experiments to demonstrate how using parallel data from just one language pair significantly improves the cross-lingual embedding alignment in a multilingual de-noising auto-encoder. All reviewers support acceptance, as does the AC. Please make sure to incorporate the clarifications from the author response in the final version of the paper.


HERITAGE: An End-to-End Web Platform for Processing Korean Historical Documents in Hanja

arXiv.org Artificial Intelligence

While Korean historical documents are invaluable cultural heritage, understanding those documents requires in-depth Hanja expertise. Hanja is an ancient language used in Korea before the 20th century, whose characters were borrowed from old Chinese but had evolved in Korea for centuries. Modern Koreans and Chinese cannot understand Korean historical documents without substantial additional help, and while previous efforts have produced some Korean and English translations, this requires in-depth expertise, and so most of the documents are not translated into any modern language. To address this gap, we present HERITAGE, the first open-source Hanja NLP toolkit to assist in understanding and translating the unexplored Korean historical documents written in Hanja. HERITAGE is a web-based platform providing model predictions of three critical tasks in historical document understanding via Hanja language models: punctuation restoration, named entity recognition, and machine translation (MT). HERITAGE also provides an interactive glossary, which provides the character-level reading of the Hanja characters in modern Korean, as well as character-level English definition. HERITAGE serves two purposes. First, anyone interested in these documents can get a general understanding from the model predictions and the interactive glossary, especially MT outputs in Korean and English. Second, since the model outputs are not perfect, Hanja experts can revise them to produce better annotations and translations. This would boost the translation efficiency and potentially lead to most of the historical documents being translated into modern languages, lowering the barrier on unexplored Korean historical documents.


Extend Adversarial Policy Against Neural Machine Translation via Unknown Token

arXiv.org Artificial Intelligence

Generating adversarial examples contributes to mainstream neural machine translation~(NMT) robustness. However, popular adversarial policies are apt for fixed tokenization, hindering its efficacy for common character perturbations involving versatile tokenization. Based on existing adversarial generation via reinforcement learning~(RL), we propose the `DexChar policy' that introduces character perturbations for the existing mainstream adversarial policy based on token substitution. Furthermore, we improve the self-supervised matching that provides feedback in RL to cater to the semantic constraints required during training adversaries. Experiments show that our method is compatible with the scenario where baseline adversaries fail, and can generate high-efficiency adversarial examples for analysis and optimization of the system.


Reference-free Evaluation Metrics for Text Generation: A Survey

arXiv.org Artificial Intelligence

A number of automatic evaluation metrics have been proposed for natural language generation systems. The most common approach to automatic evaluation is the use of a reference-based metric that compares the model's output with gold-standard references written by humans. However, it is expensive to create such references, and for some tasks, such as response generation in dialogue, creating references is not a simple matter. Therefore, various reference-free metrics have been developed in recent years. In this survey, which intends to cover the full breadth of all NLG tasks, we investigate the most commonly used approaches, their application, and their other uses beyond evaluating models. The survey concludes by highlighting some promising directions for future research.


Proverbs Run in Pairs: Evaluating Proverb Translation Capability of Large Language Model

arXiv.org Artificial Intelligence

Despite achieving remarkable performance, machine translation (MT) research remains underexplored in terms of translating cultural elements in languages, such as idioms, proverbs, and colloquial expressions. This paper investigates the capability of state-of-the-art neural machine translation (NMT) and large language models (LLMs) in translating proverbs, which are deeply rooted in cultural contexts. We construct a translation dataset of standalone proverbs and proverbs in conversation for four language pairs. Our experiments show that the studied models can achieve good translation between languages with similar cultural backgrounds, and LLMs generally outperform NMT models in proverb translation. Furthermore, we find that current automatic evaluation metrics such as BLEU, CHRF++ and COMET are inadequate for reliably assessing the quality of proverb translation, highlighting the need for more culturally aware evaluation metrics.


Reviews: Dual Learning for Machine Translation

Neural Information Processing Systems

The same goal has been pursued by e.g. The paper does not sufficiently review the work that has been done in this direction and only focuses on the recent work by Sennrich et al. Since the goal of exploiting monolingual data for MT has been in the focus of many works, more empirical comparisons are needed to demonstrate the superiority of their system. It would have been easy to e.g. Also, there has been work on the unsupervised training of noisy-channel models [3] which needs to be mentioned.


Auslan-Daily: Australian Sign Language Translation for Daily Communication and News

Neural Information Processing Systems

Sign language translation (SLT) aims to convert a continuous sign language video clip into a spoken language. Considering different geographic regions generally have their own native sign languages, it is valuable to establish corresponding SLT datasets to support related communication and research. Auslan, as a sign language specific to Australia, still lacks a dedicated large-scale dataset for SLT.To fill this gap, we curate an Australian Sign Language translation dataset, dubbed Auslan-Daily, which is collected from the Auslan educational TV series and Auslan TV programs. The former involves daily communications among multiple signers in the wild, while the latter comprises sign language videos for up-to-date news, weather forecasts, and documentaries. In particular, Auslan-Daily has two main features: (1) the topics are diverse and signed by multiple signers, and (2) the scenes in our dataset are more complex, e.g., captured in various environments, gesture interference during multi-signers' interactions and various camera positions.


DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation

Neural Information Processing Systems

However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution, posing challenges to achieving both high-quality translations and fast decoding speeds for S2ST models. In this paper, we propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST. To better capture the complex distribution of the target speech, DASpeech adopts the two-pass architecture to decompose the generation process into two steps, where a linguistic decoder first generates the target text, and an acoustic decoder then generates the target speech based on the hidden states of the linguistic decoder. Specifically, we use the decoder of DA-Transformer as the linguistic decoder, and use FastSpeech 2 as the acoustic decoder. DA-Transformer models translations with a directed acyclic graph (DAG).


Cross-Entropy Attacks to Language Models via Rare Event Simulation

arXiv.org Artificial Intelligence

Black-box textual adversarial attacks are challenging due to the lack of model information and the discrete, non-differentiable nature of text. Existing methods often lack versatility for attacking different models, suffer from limited attacking performance due to the inefficient optimization with word saliency ranking, and frequently sacrifice semantic integrity to achieve better attack outcomes. This paper introduces a novel approach to textual adversarial attacks, which we call Cross-Entropy Attacks (CEA), that uses Cross-Entropy optimization to address the above issues. Our CEA approach defines adversarial objectives for both soft-label and hard-label settings and employs CE optimization to identify optimal replacements. Through extensive experiments on document classification and language translation problems, we demonstrate that our attack method excels in terms of attacking performance, imperceptibility, and sentence quality.