AITopics

2501.13779

Genre: Research Report (0.41)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)

Neural Information Processing SystemsJan-22-2025, 02:01:45 GMT

Review for NeurIPS paper: Cross-lingual Retrieval for Iterative Self-Supervised Training

The paper proposes a novel approach for unsupervised parallel corpus mining and unsupervised machine translation, improving on the SoTA on both tasks by significant margins. Experiments are conducted on the Tatoeba retrieval task and a 25 language translation task based on a combination of a few academic benchmark datasets. Careful experiments to demonstrate how using parallel data from just one language pair significantly improves the cross-lingual embedding alignment in a multilingual de-noising auto-encoder. All reviewers support acceptance, as does the AC. Please make sure to incorporate the clarifications from the author response in the final version of the paper.

cross-lingual retrieval, iterative self-supervised training, neurips paper

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.73)

HERITAGE: An End-to-End Web Platform for Processing Korean Historical Documents in Hanja

Song, Seyoung, Yoo, Haneul, Jin, Jiho, Cho, Kyunghyun, Oh, Alice

While Korean historical documents are invaluable cultural heritage, understanding those documents requires in-depth Hanja expertise. Hanja is an ancient language used in Korea before the 20th century, whose characters were borrowed from old Chinese but had evolved in Korea for centuries. Modern Koreans and Chinese cannot understand Korean historical documents without substantial additional help, and while previous efforts have produced some Korean and English translations, this requires in-depth expertise, and so most of the documents are not translated into any modern language. To address this gap, we present HERITAGE, the first open-source Hanja NLP toolkit to assist in understanding and translating the unexplored Korean historical documents written in Hanja. HERITAGE is a web-based platform providing model predictions of three critical tasks in historical document understanding via Hanja language models: punctuation restoration, named entity recognition, and machine translation (MT). HERITAGE also provides an interactive glossary, which provides the character-level reading of the Hanja characters in modern Korean, as well as character-level English definition. HERITAGE serves two purposes. First, anyone interested in these documents can get a general understanding from the model predictions and the interactive glossary, especially MT outputs in Korean and English. Second, since the model outputs are not perfect, Hanja experts can revise them to produce better annotations and translations. This would boost the translation efficiency and potentially lead to most of the historical documents being translated into modern languages, lowering the barrier on unexplored Korean historical documents.

artificial intelligence, computational linguistic, natural language, (12 more...)

2501.11951

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(7 more...)

Genre: Research Report (0.50)

Industry: Information Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Extend Adversarial Policy Against Neural Machine Translation via Unknown Token

Zou, Wei, Huang, Shujian, Chen, Jiajun

Generating adversarial examples contributes to mainstream neural machine translation~(NMT) robustness. However, popular adversarial policies are apt for fixed tokenization, hindering its efficacy for common character perturbations involving versatile tokenization. Based on existing adversarial generation via reinforcement learning~(RL), we propose the `DexChar policy' that introduces character perturbations for the existing mainstream adversarial policy based on token substitution. Furthermore, we improve the self-supervised matching that provides feedback in RL to cater to the semantic constraints required during training adversaries. Experiments show that our method is compatible with the scenario where baseline adversaries fail, and can generate high-efficiency adversarial examples for analysis and optimization of the system.

artificial intelligence, machine learning, natural language, (16 more...)

2501.12183

Country: Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Ito, Takumi, van Deemter, Kees, Suzuki, Jun

Reference-free Evaluation Metrics for Text Generation: A Survey

A number of automatic evaluation metrics have been proposed for natural language generation systems. The most common approach to automatic evaluation is the use of a reference-based metric that compares the model's output with gold-standard references written by humans. However, it is expensive to create such references, and for some tasks, such as response generation in dialogue, creating references is not a simple matter. Therefore, various reference-free metrics have been developed in recent years. In this survey, which intends to cover the full breadth of all NLG tasks, we investigate the most commonly used approaches, their application, and their other uses beyond evaluating models. The survey concludes by highlighting some promising directions for future research.

computational linguistic, machine learning, natural language, (15 more...)

2501.12011

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > Portugal > Lisbon > Lisbon (0.14)
(27 more...)

Genre: Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Wang, Minghan, Pham, Viet-Thanh, Moghimifar, Farhad, Vu, Thuy-Trang

Proverbs Run in Pairs: Evaluating Proverb Translation Capability of Large Language Model

Despite achieving remarkable performance, machine translation (MT) research remains underexplored in terms of translating cultural elements in languages, such as idioms, proverbs, and colloquial expressions. This paper investigates the capability of state-of-the-art neural machine translation (NMT) and large language models (LLMs) in translating proverbs, which are deeply rooted in cultural contexts. We construct a translation dataset of standalone proverbs and proverbs in conversation for four language pairs. Our experiments show that the studied models can achieve good translation between languages with similar cultural backgrounds, and LLMs generally outperform NMT models in proverb translation. Furthermore, we find that current automatic evaluation metrics such as BLEU, CHRF++ and COMET are inadequate for reliably assessing the quality of proverb translation, highlighting the need for more culturally aware evaluation metrics.

large language model, machine learning, translation, (19 more...)

2501.11953

Country:

Europe (1.00)
Asia (1.00)
North America > United States (0.67)

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Neural Information Processing SystemsJan-20-2025, 11:46:40 GMT

Reviews: Dual Learning for Machine Translation

The same goal has been pursued by e.g. The paper does not sufficiently review the work that has been done in this direction and only focuses on the recent work by Sennrich et al. Since the goal of exploiting monolingual data for MT has been in the focus of many works, more empirical comparisons are needed to demonstrate the superiority of their system. It would have been easy to e.g. Also, there has been work on the unsupervised training of noisy-channel models [3] which needs to be mentioned.

machine translation, monolingual data, translation model, (7 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.99)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.95)

Neural Information Processing SystemsJan-20-2025, 03:35:57 GMT

Auslan-Daily: Australian Sign Language Translation for Daily Communication and News

Sign language translation (SLT) aims to convert a continuous sign language video clip into a spoken language. Considering different geographic regions generally have their own native sign languages, it is valuable to establish corresponding SLT datasets to support related communication and research. Auslan, as a sign language specific to Australia, still lacks a dedicated large-scale dataset for SLT.To fill this gap, we curate an Australian Sign Language translation dataset, dubbed Auslan-Daily, which is collected from the Auslan educational TV series and Auslan TV programs. The former involves daily communications among multiple signers in the wild, while the latter comprises sign language videos for up-to-date news, weather forecasts, and documentaries. In particular, Auslan-Daily has two main features: (1) the topics are diverse and signed by multiple signers, and (2) the scenes in our dataset are more complex, e.g., captured in various environments, gesture interference during multi-signers' interactions and various camera positions.

auslan-daily, australian sign language translation, dataset, (5 more...)

Country: Oceania > Australia (0.27)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.89)

Neural Information Processing SystemsJan-20-2025, 01:05:30 GMT

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation

However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution, posing challenges to achieving both high-quality translations and fast decoding speeds for S2ST models. In this paper, we propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST. To better capture the complex distribution of the target speech, DASpeech adopts the two-pass architecture to decompose the generation process into two steps, where a linguistic decoder first generates the target text, and an acoustic decoder then generates the target speech based on the hidden states of the linguistic decoder. Specifically, we use the decoder of DA-Transformer as the linguistic decoder, and use FastSpeech 2 as the acoustic decoder. DA-Transformer models translations with a directed acyclic graph (DAG).

daspeech, decoder, fast and high-quality speech-to-speech translation, (5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.56)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.43)

arXiv.org Artificial IntelligenceJan-20-2025

Cross-Entropy Attacks to Language Models via Rare Event Simulation

Ni, Mingze, Gong, Yongshun, Liu, Wei

Black-box textual adversarial attacks are challenging due to the lack of model information and the discrete, non-differentiable nature of text. Existing methods often lack versatility for attacking different models, suffer from limited attacking performance due to the inefficient optimization with word saliency ranking, and frequently sacrifice semantic integrity to achieve better attack outcomes. This paper introduces a novel approach to textual adversarial attacks, which we call Cross-Entropy Attacks (CEA), that uses Cross-Entropy optimization to address the above issues. Our CEA approach defines adversarial objectives for both soft-label and hard-label settings and employs CE optimization to identify optimal replacements. Through extensive experiments on document classification and language translation problems, we demonstrate that our attack method excels in terms of attacking performance, imperceptibility, and sentence quality.

artificial intelligence, machine translation, rare event simulation, (3 more...)

2501.11852

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.53)