AITopics

This paper describes the DFKI-MLT submission to the WMT-SLT 2022 sign language translation (SLT) task from Swiss German Sign Language (video) into German (text). State-of-the-art techniques for SLT use a generic seq2seq architecture with customized input embeddings. Instead of word embeddings as used in textual machine translation, SLT systems use features extracted from video frames. Standard approaches often do not benefit from temporal features. In our participation, we present a system that learns spatio-temporal feature representations and translation in a single model, resulting in a real end-to-end architecture expected to better generalize to new data sets. Our best system achieved $5\pm1$ BLEU points on the development set, but the performance on the test dropped to $0.11\pm0.06$ BLEU points.

machine learning, natural language, translation, (14 more...)

2510.19413

Country:

North America > United States > Pennsylvania (0.14)
Europe > Germany > Saarland (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report (1.00)

Industry: Education > Curriculum > Subject-Specific Education (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.94)

Hamidullah, Yasser, Yazdani, Shakib, Oguz, Cennet, van Genabith, Josef, España-Bonet, Cristina

SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision

Sign language translation (SLT) is typically trained with text in a single spoken language, which limits scalability and cross-language generalization. Earlier approaches have replaced gloss supervision with text-based sentence embeddings, but up to now, these remain tied to a specific language and modality. In contrast, here we employ language-agnostic, multimodal embeddings trained on text and speech from multiple languages to supervise SLT, enabling direct multilingual translation. To address data scarcity, we propose a coupled augmentation method that combines multilingual target augmentations (i.e. translations into many languages) with video-level perturbations, improving model robustness. Experiments show consistent BLEURT gains over text-only sentence embedding supervision, with larger improvements in low-resource settings. Our results demonstrate that language-agnostic embedding supervision, combined with coupled augmentation, provides a scalable and semantically robust alternative to traditional SLT training.

artificial intelligence, natural language, translation, (16 more...)

2510.19398

Country:

Asia (0.68)
Europe > Spain > Catalonia (0.28)

Genre: Research Report > New Finding (0.68)

Industry: Education > Curriculum > Subject-Specific Education (0.67)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Hamidullah, Yasser, van Genabith, Josef, España-Bonet, Cristina

Sign Language Translation with Sentence Embedding Supervision

State-of-the-art sign language translation (SLT) systems facilitate the learning process through gloss annotations, either in an end2end manner or by involving an intermediate step. Unfortunately, gloss labelled sign language data is usually not available at scale and, when available, gloss annotations widely differ from dataset to dataset. We present a novel approach using sentence embeddings of the target sentences at training time that take the role of glosses. The new kind of supervision does not need any manual annotation but it is learned on raw textual data. As our approach easily facilitates multilinguality, we evaluate it on datasets covering German (PHOENIX-2014T) and American (How2Sign) sign languages and experiment with mono- and multilingual sentence embeddings and translation systems. Our approach significantly outperforms other gloss-free approaches, setting the new state-of-the-art for data sets where glosses are not available and when no additional SLT datasets are used for pretraining, diminishing the gap between gloss-free and gloss-dependent systems.

machine learning, natural language, translation, (12 more...)

doi: 10.18653/v1/2024.acl-short.40

2510.19367

Country:

Europe > Germany (0.68)
North America > United States (0.47)

Genre: Research Report > New Finding (0.46)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges

Huang, Cheng, Tashi, Nyima, Gao, Fan, Liu, Yutong, Li, Jiahao, Tian, Hao, Jiang, Siyang, Tsering, Thupten, Ma-bao, Ban, Duojie, Renzeg, Luosang, Gadeng, Dongrub, Rinchen, Tashi, Dorje, Zhang, Jin, Feng, Xiao, Wang, Hao, Tang, Jie, Tang, Guojie, Wang, Xiangxiang, Zhang, Jia, Lee, Tsengdar, Yu, Yongbin

Tibetan, one of the major low-resource languages in Asia, presents unique linguistic and sociocultural characteristics that pose both challenges and opportunities for AI research. Despite increasing interest in developing AI systems for underrepresented languages, Tibetan has received limited attention due to a lack of accessible data resources, standardized benchmarks, and dedicated tools. This paper provides a comprehensive survey of the current state of Tibetan AI in the AI domain, covering textual and speech data resources, NLP tasks, machine translation, speech recognition, and recent developments in LLMs. We systematically categorize existing datasets and tools, evaluate methods used across different tasks, and compare performance where possible. We also identify persistent bottlenecks such as data sparsity, orthographic variation, and the lack of unified evaluation metrics. Additionally, we discuss the potential of cross-lingual transfer, multi-modal learning, and community-driven resource creation. This survey aims to serve as a foundational reference for future work on Tibetan AI research and encourages collaborative efforts to build an inclusive and sustainable AI ecosystem for low-resource languages.

large language model, machine learning, natural language, (21 more...)

2510.19144

Country:

North America > United States (1.00)
Asia > China > Hong Kong (0.28)

Genre:

Overview (1.00)
Research Report > New Finding (0.46)

Industry:

Social Sector (0.67)
Health & Medicine (0.67)
Information Technology > Security & Privacy (0.45)
Education > Assessment & Standards (0.45)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
(7 more...)

Oni, Mangsura Kabir, Prama, Tabia Tanzin

Transformer-Based Low-Resource Language Translation: A Study on Standard Bengali to Sylheti

WORK Although the findings highlight the effectiveness of fine - tuned transformer models for Bengali - Sylheti translation, several limitations remain. The dataset size (5,002 parallel sentences) restricts the models' capacity to generalize across diverse syntactic structures, stylistic variations, and domain - specific expressions. In addition, orthographic inconsistencies in Sylheti introduce noise, leading to training instability, particularly in models like mBART - 50. Another limitation is the reliance on automatic evaluation metrics such as BLEU and chrF, which may not fully capture the linguistic richness or cultural nuance of Sylheti. Future research should therefore focus on expanding the datas et through community - driven contributions and data augmentation strategies. Incorporating orthographic normalization could improve consistency and reduce variability during training. Hybrid approaches that combine the strengths of pre - trained LLMs with fin e - tuned NMT models may also enhance translation robustness in low - resource settings. Finally, incorporating human evaluation will provide a more comprehensive assessment of translation adequacy, fluency, and cultural alignment.

large language model, machine learning, translation, (19 more...)

2510.18898

Country:

Asia (0.28)
North America > United States > Vermont (0.28)

Genre:

Overview (0.68)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Robustness Assessment and Enhancement of Text Watermarking for Google's SynthID

Han, Xia, Li, Qi, Ni, Jianbing, Zulkernine, Mohammad

Recent advances in LLM watermarking methods such as SynthID-Text by Google DeepMind offer promising solutions for tracing the provenance of AI-generated text. However, our robustness assessment reveals that SynthID-Text is vulnerable to meaning-preserving attacks, such as paraphrasing, copy-paste modifications, and back-translation, which can significantly degrade watermark detectability. To address these limitations, we propose SynGuard, a hybrid framework that combines the semantic alignment strength of Semantic Information Retrieval (SIR) with the probabilistic watermarking mechanism of SynthID-Text. Our approach jointly embeds watermarks at both lexical and semantic levels, enabling robust provenance tracking while preserving the original meaning. Experimental results across multiple attack scenarios show that SynGuard improves watermark recovery by an average of 11.1\% in F1 score compared to SynthID-Text. These findings demonstrate the effectiveness of semantic-aware watermarking in resisting real-world tampering. All code, datasets, and evaluation scripts are publicly available at: https://github.com/githshine/SynGuard.

large language model, machine learning, natural language, (21 more...)

2508.20228

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

QiMeng-MuPa: Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Ke, Changxin, Zhang, Rui, Wang, Shuo, Ding, Li, Li, Guangli, Wen, Yuanbo, Zhang, Shuoming, Xu, Ruiyuan, Qin, Jin, Guo, Jiaming, Wang, Chenxi, Li, Ling, Guo, Qi, Chen, Yunji

The rise of GPU-based high-performance computing (HPC) has driven the widespread adoption of parallel programming models such as CUDA. Yet, the inherent complexity of parallel programming creates a demand for the automated sequential-to-parallel approaches. However, data scarcity poses a significant challenge for machine learning-based sequential-to-parallel code translation. Although recent back-translation methods show promise, they still fail to ensure functional equivalence in the translated code. In this paper, we propose \textbf{QiMeng-MuPa}, a novel \textbf{Mu}tual-Supervised Learning framework for Sequential-to-\textbf{Pa}rallel code translation, to address the functional equivalence issue. QiMeng-MuPa consists of two models, a Translator and a Tester. Through an iterative loop consisting of Co-verify and Co-evolve steps, the Translator and the Tester mutually generate data for each other and improve collectively. The Tester generates unit tests to verify and filter functionally equivalent translated code, thereby evolving the Translator, while the Translator generates translated code as augmented input to evolve the Tester. Experimental results demonstrate that QiMeng-MuPa significantly enhances the performance of the base models: when applied to Qwen2.5-Coder, it not only improves Pass@1 by up to 28.91% and boosts Tester performance by 68.90%, but also outperforms the previous state-of-the-art method CodeRosetta by 1.56 and 6.92 in BLEU and CodeBLEU scores, while achieving performance comparable to DeepSeek-R1 and GPT-4.1. Our code is available at https://github.com/kcxain/mupa.

large language model, machine learning, natural language, (21 more...)

2506.11153

Country:

Asia (0.46)
Europe (0.46)
North America > United States (0.28)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.89)

arXiv.org Artificial IntelligenceOct-22-2025

See the Text: From Tokenization to Visual Reading

Xing, Ling, Wang, Alex Jinpeng, Yan, Rui, Qu, Hongyu, Li, Zechao, Tang, Jinhui

People see text. Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively. Modern large language models (LLMs), however, rely on subword tokenization, fragmenting text into pieces from a fixed vocabulary. While effective for high-resource languages, this approach over-segments low-resource languages, yielding long, linguistically meaningless sequences and inflating computation. In this work, we challenge this entrenched paradigm and move toward a vision-centric alternative. Our method, SeeTok, renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing strong OCR and text-vision alignment abilities learned from large-scale multimodal training. Across three different language tasks, SeeTok matches or surpasses subword tokenizers while requiring 4.43 times fewer tokens and reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy. SeeTok signals a shift from symbolic tokenization to human-like visual reading, and takes a step toward more natural and cognitively inspired language models.

large language model, machine learning, tokenization, (19 more...)

2510.1884

Country: Europe > United Kingdom > England (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Staliūnaitė, Ieva Raminta, Cheng, Julius, Vlachos, Andreas

Uncertainty Quantification for Evaluating Machine Translation Bias

arXiv.org Artificial IntelligenceOct-22-2025

The predictive uncertainty of machine translation (MT) models is typically used as a quality estimation proxy. In this work, we posit that apart from confidently translating when a single correct translation exists, models should also maintain uncertainty when the input is ambiguous. We use uncertainty to measure gender bias in MT systems. When the source sentence includes a lexeme whose gender is not overtly marked, but whose target-language equivalent requires gender specification, the model must infer the appropriate gender from the context and can be susceptible to biases. Prior work measured bias via gender accuracy, however it cannot be applied to ambiguous cases. Using semantic uncertainty, we are able to assess bias when translating both ambiguous and unambiguous source sentences, and find that high translation accuracy does not correlate with exhibiting uncertainty appropriately, and that debiasing affects the two cases differently.

artificial intelligence, computational linguistic, natural language, (12 more...)

2507.18338

Country:

North America > United States (1.00)
Europe (1.00)
Asia > Middle East > UAE (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceOct-22-2025

From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora

Shen, Yingli, Lai, Wen, Wang, Shuo, Gao, Ge, Luo, Kangyang, Fraser, Alexander, Sun, Maosong

Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.

computational linguistic, large language model, natural language, (13 more...)

2505.14045

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre: Research Report > New Finding (1.00)

Industry:

Energy (0.92)
Education (0.68)
Government (0.67)
Health & Medicine > Therapeutic Area > Neurology (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)