AITopics

2508.18549

Country:

Europe (0.67)
North America > United States (0.46)

Genre: Research Report (0.64)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Liu, Linfeng, Ghosh, Saptarshi, Jiang, Tianyu

Evaluating the Impact of Verbal Multiword Expressions on Machine Translation

arXiv.org Artificial IntelligenceAug-26-2025

Verbal multiword expressions (VMWEs) present significant challenges for natural language processing due to their complex and often non-compositional nature. While machine translation models have seen significant improvement with the advent of language models in recent years, accurately translating these complex linguistic structures remains an open problem. In this study, we analyze the impact of three VMWE categories -- verbal idioms, verb-particle constructions, and light verb constructions -- on machine translation quality from English to multiple languages. Using both established multiword expression datasets and sentences containing these language phenomena extracted from machine translation datasets, we evaluate how state-of-the-art translation systems handle these expressions. Our experimental results consistently show that VMWEs negatively affect translation quality. We also propose an LLM-based paraphrasing approach that replaces these expressions with their literal counterparts, demonstrating significant improvement in translation quality for verbal idioms and verb-particle constructions.

machine learning, natural language, translation, (17 more...)

2508.17458

Country:

Europe (1.00)
Asia (0.67)
North America > United States > California (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Sen, Jaydip, Dasgupta, Subhasis, Waghela, Hetvi

Confidence-Modulated Speculative Decoding for Large Language Models

arXiv.org Artificial IntelligenceAug-26-2025

-- Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft - then - verify paradigm. However, existing methods rely on static drafting lengths and rigid verification cri teria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information - theoretic framework for speculative decoding based on confidence - modulated drafting. By leveraging entropy and margin - based uncertainty measures over the drafter's output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, an d maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summariza tion tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug - in method for efficient and robust decoding in large language models under v arying conditions of uncertainty. Keywords -- Speculative Decoding, Autoregressive Models, Confidence Estimation, Adaptive Inference, Entropy - Based Drafting, Sequence Generation, Large Language Models, Large Language Models (LLMs), Information - Theoretic Decoding. The task of sequence generation lies at the heart of numerous applications in natural language processing, including machine translation, text summarization, dialogue generation, and code synthesis. In the overwhelming majority of these applications, autor egressive (AR) decoding remains the dominant paradigm for generating sequences from a probabilistic language model [1 - 2] . Autoregressive models, particularly those based on the Transformer architecture, operate by predicting each token conditioned on the e ntire history of previously generated tokens. This left - to - right decoding strategy, though optimal in terms of likelihood estimation, suffers from a fundamental limitation: the inherently sequential nature of generation prohibits efficient parallelization, severely hindering inference throughput, especially in latency - sensitive deployment scenarios.

computational linguistic, large language model, machine learning, (18 more...)

2508.15371

Country: Asia > India (0.68)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

arXiv.org Artificial IntelligenceAug-26-2025

Preliminary Ranking of WMT25 General Machine Translation Systems

Kocmi, Tom, Avramidis, Eleftherios, Bawden, Rachel, Bojar, Ondřej, Dranch, Konstantin, Dvorkovich, Anton, Dukanov, Sergey, Fedorova, Natalia, Fishel, Mark, Freitag, Markus, Gowda, Thamme, Grundkiewicz, Roman, Haddow, Barry, Karpinska, Marzena, Koehn, Philipp, Lakougna, Howard, Lundin, Jessica, Murray, Kenton, Nagata, Masaaki, Perrella, Stefano, Proietti, Lorenzo, Popel, Martin, Popović, Maja, Riley, Parker, Shmatova, Mariya, Steingrímsson, Steinþór, Yankovskaya, Lisa, Zouhar, Vilém

We present the preliminary rankings of machine translation (MT) systems submitted to the WMT25 General Machine Translation Shared Task, as determined by automatic evaluation metrics. Because these rankings are derived from automatic evaluation, they may exhibit a bias toward systems that employ re-ranking techniques, such as Quality Estimation or Minimum Bayes Risk decoding. The official WMT25 ranking will be based on human evaluation, which is more reliable and will supersede these results. The official WMT25 ranking will be based on human evaluation, which is more reliable and will supersede these results. The purpose of releasing these findings now is to assist task participants with their system description papers; not to provide final findings.

artificial intelligence, machine learning, natural language, (16 more...)

2508.14909

Country: Asia (0.67)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.76)

The Mediomatix Corpus: Parallel Data for Romansh Idioms via Comparable Schoolbooks

Hopton, Zachary, Vamvas, Jannis, Büchler, Andrin, Rutkiewicz, Anna, Cathomas, Rico, Sennrich, Rico

The five idioms (i.e., varieties) of the Romansh language are largely standardized and are taught in the schools of the respective communities in Switzerland. In this paper, we present the first parallel corpus of Romansh idioms. The corpus is based on 291 schoolbook volumes, which are comparable in content for the five idioms. We use automatic alignment methods to extract 207k multi-parallel segments from the books, with more than 2M tokens in total. A small-scale human evaluation confirms that the segments are highly parallel, making the dataset suitable for NLP applications such as machine translation between Romansh idioms. We release the parallel and unaligned versions of the dataset under a CC-BY-NC-SA license and demonstrate its utility for machine translation by training and evaluating an LLM on a sample of the dataset.

artificial intelligence, machine translation, natural language, (16 more...)

2508.16371

Country:

North America > United States (0.68)
Europe > Switzerland (0.67)

Genre: Research Report (0.50)

Industry: Education (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Nagata, Masaaki, Chousa, Katsuki, Yasuda, Norihito

JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus

We constructed JaParaPat (Japanese-English Parallel Patent Application Corpus), a bilingual corpus of more than 300 million Japanese-English sentence pairs from patent applications published in Japan and the United States from 2000 to 2021. We obtained the publication of unexamined patent applications from the Japan Patent Office (JPO) and the United States Patent and Trademark Office (USPTO). We also obtained patent family information from the DOCDB, that is a bibliographic database maintained by the European Patent Office (EPO). We extracted approximately 1.4M Japanese-English document pairs, which are translations of each other based on the patent families, and extracted about 350M sentence pairs from the document pairs using a translation-based sentence alignment method whose initial translation model is bootstrapped from a dictionary-based sentence alignment method. We experimentally improved the accuracy of the patent translations by 20 bleu points by adding more than 300M sentence pairs obtained from patent applications to 22M sentence pairs obtained from the web.

application, artificial intelligence, natural language, (14 more...)

2508.16303

Country:

Asia > Japan > Honshū (0.28)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.64)

Industry:

Law > Intellectual Property & Technology Law (1.00)
Government > Regional Government > North America Government > United States Government (0.56)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

CYCLE-INSTRUCT: Fully Seed-Free Instruction Tuning via Dual Self-Training and Cycle Consistency

Shen, Zhanming, Chen, Hao, Tang, Yulei, Zhu, Shaolin, Ye, Wentao, Hu, Xiaomeng, Wang, Haobo, Chen, Gang, Zhao, Junbo

Instruction tuning is vital for aligning large language models (LLMs) with human intent, but current methods typically rely on costly human-annotated seed data or powerful external teacher models. While instruction back-translation techniques reduce this dependency, they remain fundamentally tethered to an initial seed set, which limits full automation, introduces biases, and can lead to inefficient use of unlabeled corpora. In this paper, we propose Cycle-Instruct, a novel framework that achieves fully seed-free instruction tuning. Inspired by cycle consistency, Cycle-Instruct employs a dual self-training loop where two models-an answer generator and a question generator-are bootstrapped solely from raw, unlabeled text. These models mutually supervise each other by reconstructing original text segments from their counterpart's generated pseudo-labels, effectively learning from the intrinsic structure of the data without any human-provided seeds. We demonstrate Cycle-Instruct's efficacy across four diverse data tracks, including general instruction-following, domain-specific tasks, dialogue logs, and plain text. Our extensive experiments show that Cycle-Instruct not only outperforms seed-driven back-translation baselines but also achieves performance comparable to strongly supervised methods.

large language model, machine learning, natural language, (18 more...)

2508.161

Genre: Research Report (0.82)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Foroutan, Negar, Meister, Clara, Paul, Debjit, Niklaus, Joel, Ahmadi, Sina, Bosselut, Antoine, Sennrich, Rico

Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with placeholders. This phenomenon ultimately amplifies computational and financial inequalities between users from different language backgrounds. To remedy this, we introduce Parity-aware Byte Pair Encoding (BPE), a variant of the widely-used BPE algorithm. At every merge step, Parity-aware BPE maximizes the compression gain of the currently worst-compressed language, trading a small amount of global compression for cross-lingual parity. We find empirically that Parity-aware BPE leads to more equitable token counts across languages, with negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks.

computational linguistic, machine learning, natural language, (18 more...)

2508.04796

Country:

Europe (0.67)
North America > United States (0.28)
North America > Mexico (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.46)

Generalized Tree Edit Distance (GTED): A Faithful Evaluation Metric for Statement Autoformalization

Liu, Yuntian, Zhu, Tao, Liu, Xiaoyang, Chen, Yu, Liu, Zhaoxuan, Guo, Qingfeng, Zhang, Jiashuo, Bao, Kangjie, Luo, Tao

Statement autoformalization, the automated translation of statements from natural language into formal languages, has become a subject of extensive research, yet the development of robust automated evaluation metrics remains limited. Existing evaluation methods often lack semantic understanding, face challenges with high computational costs, and are constrained by the current progress of automated theorem proving. To address these issues, we propose GTED (Generalized Tree Edit Distance), a novel evaluation framework that first standardizes formal statements and converts them into operator trees, then determines the semantic similarity using the eponymous GTED metric. Across the miniF2F and ProofNet benchmarks, GTED consistently ranks as a top-performing metric, achieving the highest accuracy and Kappa on miniF2F and the joint-highest accuracy on ProofNet. This strong overall performance provides the community with a computationally lightweight and more faithful metric for automated evaluation. The code and experimental results are available at https://github.com/XiaoyangLiu-sjtu/GTED.

formal statement, logic & formal reasoning, machine learning, (16 more...)

2507.07399

Country: Europe (0.46)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.88)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.66)

Neural Information Processing SystemsAug-22-2025, 02:26:33 GMT

f63f65b503e22cb970527f23c9ad7db1-AuthorFeedback.pdf

machine translation, new version, transformer tts, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.35)