AITopics

2505.15069

Country:

Europe (1.00)
Asia (0.93)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceMay-22-2025

Granary: Speech Recognition and Translation Dataset in 25 European Languages

Koluguri, Nithin Rao, Sekoyan, Monica, Zelenfroynd, George, Meister, Sasha, Ding, Shuoyang, Kostandian, Sofia, Huang, He, Karpov, Nikolay, Balam, Jagadeesh, Lavrukhin, Vitaly, Peng, Yifan, Papi, Sara, Gaido, Marco, Brutti, Alessio, Ginsburg, Boris

Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. Dataset will be made available at https://hf.co/datasets/nvidia/Granary

artificial intelligence, machine learning, natural language, (17 more...)

2505.13404

Country:

Europe (0.28)
North America > United States (0.14)

Genre: Research Report > New Finding (0.86)

Industry: Information Technology (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.91)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.86)

FuxiMT: Sparsifying Large Language Models for Chinese-Centric Multilingual Machine Translation

Zhu, Shaolin, Dong, Tianyu, Li, Bo, Xiong, Deyi

In this paper, we present FuxiMT, a novel Chinese-centric multilingual machine translation model powered by a sparsified large language model (LLM). We adopt a two-stage strategy to train FuxiMT. We first pre-train the model on a massive Chinese corpus and then conduct multilingual fine-tuning on a large parallel dataset encompassing 65 languages. FuxiMT incorporates Mixture-of-Experts (MoEs) and employs a curriculum learning strategy for robust performance across various resource levels. Experimental results demonstrate that FuxiMT significantly outperforms strong baselines, including state-of-the-art LLMs and machine translation models, particularly under low-resource scenarios. Furthermore, FuxiMT exhibits remarkable zero-shot translation capabilities for unseen language pairs, indicating its potential to bridge communication gaps where parallel data are scarce or unavailable.

artificial intelligence, large language model, natural language, (16 more...)

2505.14256

Country:

Europe (1.00)
Asia > China (0.67)
North America > United States > California (0.28)

Genre:

Research Report > New Finding (0.48)
Research Report > Promising Solution (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

TransBench: Benchmarking Machine Translation for Industrial-Scale Applications

Li, Haijun, Shi, Tianqi, Shang, Zifu, Han, Yuxuan, Zhao, Xueyu, Wang, Hao, Qian, Yu, Qian, Zhiqiang, Xu, Linlong, Wu, Minghao, Lyu, Chenyang, Wang, Longyue, Tang, Gongbo, Luo, Weihua, Xu, Zhao, Zhang, Kaifu

Machine translation (MT) has become indispensable for cross-border communication in globalized industries like e-commerce, finance, and legal services, with recent advancements in large language models (LLMs) significantly enhancing translation quality. However, applying general-purpose MT models to industrial scenarios reveals critical limitations due to domain-specific terminology, cultural nuances, and stylistic conventions absent in generic benchmarks. Existing evaluation frameworks inadequately assess performance in specialized contexts, creating a gap between academic benchmarks and real-world efficacy. To address this, we propose a three-level translation capability framework: (1) Basic Linguistic Competence, (2) Domain-Specific Proficiency, and (3) Cultural Adaptation, emphasizing the need for holistic evaluation across these dimensions. We introduce TransBench, a benchmark tailored for industrial MT, initially targeting international e-commerce with 17,000 professionally translated sentences spanning 4 main scenarios and 33 language pairs. TransBench integrates traditional metrics (BLEU, TER) with Marco-MOS, a domain-specific evaluation model, and provides guidelines for reproducible benchmark construction. Our contributions include: (1) a structured framework for industrial MT evaluation, (2) the first publicly available benchmark for e-commerce translation, (3) novel metrics probing multi-level translation quality, and (4) open-sourced evaluation tools. This work bridges the evaluation gap, enabling researchers and practitioners to systematically assess and enhance MT systems for industry-specific needs.

machine learning, natural language, translation, (17 more...)

2505.14244

Country:

North America > United States (1.00)
Asia (1.00)
Europe > United Kingdom > England (0.28)

Genre: Research Report (0.50)

Industry:

Law (1.00)
Information Technology > Security & Privacy (0.93)
Information Technology > Services (0.76)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Liang, Yunlong, Meng, Fandong, Zhou, Jie

THOR-MoE: Hierarchical Task-Guided and Context-Responsive Routing for Neural Machine Translation

The sparse Mixture-of-Experts (MoE) has achieved significant progress for neural machine translation (NMT). However, there exist two limitations in current MoE solutions which may lead to sub-optimal performance: 1) they directly use the task knowledge of NMT into MoE (\emph{e.g.}, domain/linguistics-specific knowledge), which are generally unavailable at practical application and neglect the naturally grouped domain/linguistic properties; 2) the expert selection only depends on the localized token representation without considering the context, which fully grasps the state of each token in a global view. To address the above limitations, we propose THOR-MoE via arming the MoE with hierarchical task-guided and context-responsive routing policies. Specifically, it 1) firstly predicts the domain/language label and then extracts mixed domain/language representation to allocate task-level experts in a hierarchical manner; 2) injects the context information to enhance the token routing from the pre-selected task-level experts set, which can help each token to be accurately routed to more specialized and suitable experts. Extensive experiments on multi-domain translation and multilingual translation benchmarks with different architectures consistently demonstrate the superior performance of THOR-MoE. Additionally, the THOR-MoE operates as a plug-and-play module compatible with existing Top-$k$~\cite{shazeer2017} and Top-$p$~\cite{huang-etal-2024-harder} routing schemes, ensuring broad applicability across diverse MoE architectures. For instance, compared with vanilla Top-$p$~\cite{huang-etal-2024-harder} routing, the context-aware manner can achieve an average improvement of 0.75 BLEU with less than 22\% activated parameters on multi-domain translation tasks.

artificial intelligence, computational linguistic, natural language, (17 more...)

2505.14173

Country:

Asia (0.94)
Europe (0.93)
North America > United States > Minnesota (0.28)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Bankula, Ajitesh, Bankula, Praney

Cross-Linguistic Transfer in Multilingual NLP: The Role of Language Families and Morphology

Cross-lingual transfer has become a crucial aspect of multilingual NLP, as it allows for models trained on resource-rich languages to be applied to low-resource languages more effectively. Recently massively multilingual pre-trained language models (e.g., mBERT, XLM-R) demonstrate strong zero-shot transfer capabilities[14] [13]. This paper investigates cross-linguistic transfer through the lens of language families and morphology. Investigating how language family proximity and morphological similarity affect performance across NLP tasks. We further discuss our results and how it relates to findings from recent literature. Overall, we compare multilingual model performance and review how linguistic distance metrics correlate with transfer outcomes. We also look into emerging approaches that integrate typological and morphological information into model pre-training to improve transfer to diverse languages[18] [19].

artificial intelligence, machine learning, natural language, (18 more...)

2505.13908

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.95)

Combining the Best of Both Worlds: A Method for Hybrid NMT and LLM Translation

Wu, Zhanglin, Wei, Daimeng, Chen, Xiaoyu, Shang, Hengchao, Guo, Jiaxin, Li, Zongyao, Luo, Yuanchang, Yang, Jinlong, Rao, Zhiqiang, Yang, Hao

Large language model (LLM) shows promising performances in a variety of downstream tasks, such as machine translation (MT). However, using LLMs for translation suffers from high computational costs and significant latency. Based on our evaluation, in most cases, translations using LLMs are comparable to that generated by neural machine translation (NMT) systems. Only in particular scenarios, LLM and NMT models show respective advantages. As a result, integrating NMT and LLM for translation and using LLM only when necessary seems to be a sound solution. A scheduling policy that optimizes translation result while ensuring fast speed and as little LLM usage as possible is thereby required. We compare several scheduling policies and propose a novel and straightforward decider that leverages source sentence features. We conduct extensive experiments on multilingual test sets and the result shows that we can achieve optimal translation performance with minimal LLM usage, demonstrating effectiveness of our decider.

large language model, natural language, translation, (18 more...)

2505.13554

Country: Asia (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Rahman, Md. Atiqur, Islam, Sabrina, Omi, Mushfiqul Haque

LLM-Based Evaluation of Low-Resource Machine Translation: A Reference-less Dialect Guided Approach with a Refined Sylheti-English Benchmark

arXiv.org Artificial IntelligenceMay-20-2025

Evaluating machine translation (MT) for low - resource languages poses a persistent challenge, primarily due to the limited availability of high - quality reference translations. This issue is further exacerbated in languages with multiple dialects, where linguistic diversity and data scarcity hinder robust evaluation. Large Language Models (LLMs) present a promising solution through reference - free evaluation techniques; however, their effectiveness diminishes in the absence of dialect - specific context and tailored guidance. In this work, we propose a comprehensive framework that enhances LLM - based MT evaluation using a dialect guided approach. We extend the ONUBAD dataset by incorporating Sylheti - English sentence pairs, corresponding machine - translations, and Direct Assessment (DA) scores annotated by native speakers. To address the vocabulary gap, we augment the tokenizer vocabulary with dialect - specific terms. We further introduce a regression head to enable scalar score prediction and design a dialect - guided (DG) prompting strategy. Our evaluation across multiple LLMs shows that the proposed pipeline consistently outperforms existing methods, achieving the highest gain of +0.1083 in Spear-man correlation, along with improvements across other evaluation settings. The dataset and the code are available at https://github.com/180041123 - Atiq/MTEonLowResourceLanguage .

artificial intelligence, large language model, natural language, (14 more...)

2505.12273

Country:

Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.05)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceMay-20-2025

KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025

Koneru, Sai, Züfle, Maike, Nguyen, Thai-Binh, Akti, Seymanur, Niehues, Jan, Waibel, Alexander

The scope of the International Workshop on Spoken Language Translation (IWSLT) has recently broadened beyond traditional Speech Translation (ST) to encompass a wider array of tasks, including Speech Question Answering and Summarization. This shift is partly driven by the growing capabilities of modern systems, particularly with the success of Large Language Models (LLMs). In this paper, we present the Karlsruhe Institute of Technology's submissions for the Offline ST and Instruction Following (IF) tracks, where we leverage LLMs to enhance performance across all tasks. For the Offline ST track, we propose a pipeline that employs multiple automatic speech recognition systems, whose outputs are fused using an LLM with document-level context. This is followed by a two-step translation process, incorporating additional refinement step to improve translation quality. For the IF track, we develop an end-to-end model that integrates a speech encoder with an LLM to perform a wide range of instruction-following tasks. We complement it with a final document-level refinement stage to further enhance output quality by using contextual information.

large language model, natural language, translation, (18 more...)

2505.13036

Country:

Asia > Thailand (0.28)
North America > Mexico (0.28)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.24)

Genre: Research Report > New Finding (0.93)

Industry:

Government (0.68)
Leisure & Entertainment (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceMay-20-2025

ExTrans: Multilingual Deep Reasoning Translation via Exemplar-Enhanced Reinforcement Learning

Wang, Jiaan, Meng, Fandong, Zhou, Jie

In recent years, the emergence of large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, has shown impressive capabilities in complex problems, e.g., mathematics and coding. Some pioneering studies attempt to bring the success of LRMs in neural machine translation (MT). They try to build LRMs with deep reasoning MT ability via reinforcement learning (RL). Despite some progress that has been made, these attempts generally focus on several high-resource languages, e.g., English and Chinese, leaving the performance on other languages unclear. Besides, the reward modeling methods in previous work do not fully unleash the potential of reinforcement learning in MT. In this work, we first design a new reward modeling method that compares the translation results of the policy MT model with a strong LRM (i.e., DeepSeek-R1-671B), and quantifies the comparisons to provide rewards. Experimental results demonstrate the superiority of the reward modeling method. Using Qwen2.5-7B-Instruct as the backbone, the trained model achieves the new state-of-the-art performance in literary translation, and outperforms strong LRMs including OpenAI-o1 and DeepSeeK-R1. Furthermore, we extend our method to the multilingual settings with 11 languages. With a carefully designed lightweight reward modeling in RL, we can simply transfer the strong MT ability from a single direction into multiple (i.e., 90) translation directions and achieve impressive multilingual MT performance.

large language model, machine learning, translation, (19 more...)

2505.12996

Country:

Asia (0.68)
North America > Mexico (0.28)
North America > United States (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.45)