AITopics

2509.02506

Country:

North America > United States > California (0.46)
Europe > United Kingdom > England (0.28)

Genre: Research Report (1.00)

Industry:

Government (0.67)
Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

arXiv.org Artificial IntelligenceSep-3-2025

chDzDT: Word-level morphology-aware language model for Algerian social media text

Aries, Abdelkrime

Pre-trained language models (PLMs) have substantially advanced natural language processing by providing context-sensitive text representations. However, the Algerian dialect remains under-represented, with few dedicated models available. Processing this dialect is challenging due to its complex morphology, frequent code-switching, multiple scripts, and strong lexical influences from other languages. These characteristics complicate tokenization and reduce the effectiveness of conventional word- or subword-level approaches. To address this gap, we introduce chDzDT, a character-level pre-trained language model tailored for Algerian morphology. Unlike conventional PLMs that rely on token sequences, chDzDT is trained on isolated words. This design allows the model to encode morphological patterns robustly, without depending on token boundaries or standardized orthography. The training corpus draws from diverse sources, including YouTube comments, French, English, and Berber Wikipedia, as well as the Tatoeba project. It covers multiple scripts and linguistic varieties, resulting in a substantial pre-training workload. Our contributions are threefold: (i) a detailed morphological analysis of Algerian dialect using YouTube comments; (ii) the construction of a multilingual Algerian lexicon dataset; and (iii) the development and extensive evaluation of a character-level PLM as a morphology-focused encoder for downstream tasks. The proposed approach demonstrates the potential of character-level modeling for morphologically rich, low-resource dialects and lays a foundation for more inclusive and adaptable NLP systems.

arabic, large language model, machine learning, (22 more...)

2509.01772

Genre: Research Report > New Finding (1.00)

Industry: Information Technology (0.67)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(3 more...)

arXiv.org Artificial IntelligenceSep-3-2025

Conditional Generative Adversarial Networks Based Inertial Signal Translation

Kolakowski, Marcin

The paper presents an approach in which inertial signals measured with a wrist-worn sensor (e.g., a smartwatch) are translated into those that would be recorded using a shoe-mounted sensor, enabling the use of state-of-the-art gait analysis methods. In the study, the signals are translated using Conditional Generative Adversarial Networks (GANs). Two different GAN versions are used for experimental verification: traditional ones trained using binary cross-entropy loss and Wasserstein GANs (WGANs). For the generator, two architectures, a convolutional autoencoder, and a convolutional U-Net, are tested. The experiment results have shown that the proposed approach allows for an accurate translation, enabling the use of wrist sensor inertial signals for efficient, every-day gait analysis.

artificial intelligence, machine learning, natural language, (14 more...)

doi: 10.23919/SPSympo63739.2025.11124016

2509.00016

Country:

Europe > Poland (0.17)
Asia > Malaysia (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Niketan, Nripesh, Shrivastva, Vaatsalya

CrossTL: A Universal Programming Language Translator with Unified Intermediate Representation

arXiv.org Artificial IntelligenceSep-1-2025

We present CrossTL, a universal programming language translator enabling bidirectional translation between multiple languages through a unified intermediate representation called CrossGL. Traditional approaches require separate translators for each language pair, leading to exponential complexity growth. CrossTL uses a single universal IR to facilitate translations between CUDA, HIP, Metal, DirectX HLSL, OpenGL GLSL, Vulkan SPIR-V, Rust, and Mojo, with Slang support in development. Our system consists of: language-specific lexers/parsers converting source code to ASTs, bidirectional CrossGL translation modules implementing ToCrossGLConverter classes for importing code and CodeGen classes for target generation, and comprehensive backend implementations handling full translation pipelines. We demonstrate effectiveness through comprehensive evaluation across programming domains, achieving successful compilation and execution across all supported backends. The universal IR design enables adding new languages with minimal effort, requiring only language-specific frontend/backend components. Our contributions include: (1) a unified IR capturing semantics of multiple programming paradigms, (2) a modular architecture enabling extensibility, (3) a comprehensive framework supporting GPU compute, graphics programming, and systems languages, and (4) empirical validation demonstrating practical viability of universal code translation. CrossTL represents a significant step toward language-agnostic programming, enabling write-once, deploy-everywhere development.

artificial intelligence, natural language, programming language, (16 more...)

doi: 10.5281/zenodo.15826975

2508.21256

Genre: Research Report (0.64)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

arXiv.org Artificial IntelligenceSep-1-2025

Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach

Yang, Han, Lan, Jian, Liu, Yihong, Schütze, Hinrich, Seidl, Thomas

Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.

artificial intelligence, machine learning, natural language, (17 more...)

2508.21206

Country:

Europe (0.95)
North America > United States (0.94)
Asia > Middle East > UAE (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)

Marie, Benjamin, Fujita, Atsushi

The Uneven Impact of Post-Training Quantization in Machine Translation

Quantization is essential for deploying large language models (LLMs) on resource-constrained hardware, but its implications for multilingual tasks remain underexplored. We conduct the first large-scale evaluation of post-training quantization (PTQ) on machine translation across 55 languages using five LLMs ranging from 1.7B to 70B parameters. Our analysis reveals that while 4-bit quantization often preserves translation quality for high-resource languages and large models, significant degradation occurs for low-resource and typologically diverse languages, particularly in 2-bit settings. We compare four quantization techniques (AWQ, BitsAndBytes, GGUF, and AutoRound), showing that algorithm choice and model size jointly determine robustness. GGUF variants provide the most consistent performance, even at 2-bit precision. Additionally, we quantify the interactions between quantization, decoding hyperparameters, and calibration languages, finding that language-matched calibration offers benefits primarily in low-bit scenarios. Our findings offer actionable insights for deploying multilingual LLMs for machine translation under quantization constraints, especially in low-resource settings.

large language model, machine learning, quantization, (16 more...)

2508.20893

Country:

Europe (1.00)
North America > United States (0.69)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Proietti, Lorenzo, Perrella, Stefano, Zouhar, Vilém, Navigli, Roberto, Kocmi, Tom

Estimating Machine Translation Difficulty

Machine translation quality has steadily improved over the years, achieving near-perfect translations in recent benchmarks. These high-quality outputs make it difficult to distinguish between state-of-the-art models and to identify areas for future improvement. In this context, automatically identifying texts where machine translation systems struggle holds promise for developing more discriminative evaluations and guiding future research. In this work, we address this gap by formalizing the task of translation difficulty estimation, defining a text's difficulty based on the expected quality of its translations. We introduce a new metric to evaluate difficulty estimators and use it to assess both baselines and novel approaches. Finally, we demonstrate the practical utility of difficulty estimators by using them to construct more challenging benchmarks for machine translation. Our results show that dedicated models outperform both heuristic-based methods and LLM-as-a-judge approaches, with Sentinel-src achieving the best performance. Thus, we release two improved models for difficulty estimation, Sentinel-src-24 and Sentinel-src-25, which can be used to scan large collections of texts and select those most likely to challenge contemporary machine translation systems.

machine learning, natural language, translation, (18 more...)

2508.10175

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Ramos, Miguel Moura, Fernandes, Patrick, Agrawal, Sweta, Martins, André F. T.

Multilingual Contextualization of Large Language Models for Document-Level Machine Translation

Large language models (LLMs) have demonstrated strong performance in sentence-level machine translation, but scaling to document-level translation remains challenging, particularly in modeling long-range dependencies and discourse phenomena across sentences and paragraphs. In this work, we propose a method to improve LLM-based long-document translation through targeted fine-tuning on high-quality document-level data, which we curate and introduce as DocBlocks. Our approach supports multiple translation paradigms, including direct document-to-document and chunk-level translation, by integrating instructions both with and without surrounding context. This enables models to better capture cross-sentence dependencies while maintaining strong sentence-level translation performance. Experimental results show that incorporating multiple translation paradigms improves document-level translation quality and inference speed compared to prompting and agent-based methods.

computational linguistic, large language model, machine learning, (19 more...)

2504.1214

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark

Taguchi, Chihiro, Mai, Seng, Kurabe, Keita, Sakai, Yusuke, Agyei, Georgina, Eslami, Soudabeh, Chiang, David

Multilingual machine translation (MT) benchmarks play a central role in evaluating the capabilities of modern MT systems. Among them, the FLORES+ benchmark is widely used, offering English-to-many translation data for over 200 languages, curated with strict quality control protocols. However, we study data in four languages (Asante Twi, Japanese, Jinghpaw, and South Azerbaijani) and uncover critical shortcomings in the benchmark's suitability for truly multilingual evaluation. Human assessments reveal that many translations fall below the claimed 90% quality standard, and the annotators report that source sentences are often too domain-specific and culturally biased toward the English-speaking world. We further demonstrate that simple heuristics, such as copying named entities, can yield non-trivial BLEU scores, suggesting vulnerabilities in the evaluation protocol. Notably, we show that MT models trained on high-quality, naturalistic data perform poorly on FLORES+ while achieving significant gains on our domain-relevant evaluation set. Based on these findings, we advocate for multilingual MT benchmarks that use domain-general and culturally neutral source texts rely less on named entities, in order to better reflect real-world translation challenges.

artificial intelligence, benchmark, natural language, (15 more...)

2508.20511

Country:

Europe (1.00)
Asia (1.00)
North America > United States (0.69)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Sports (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceAug-28-2025

Step-Audio 2 Technical Report

Wu, Boyong, Yan, Chao, Hu, Chen, Yi, Cheng, Feng, Chengli, Tian, Fei, Shen, Feiyu, Yu, Gang, Zhang, Haoyang, Li, Jingbei, Chen, Mingrui, Liu, Peng, You, Wang, Zhang, Xiangyu Tony, Li, Xingyuan, Yang, Xuerui, Deng, Yayue, Huang, Yechang, Li, Yuxin, Zhang, Yuxin, You, Zhao, Li, Brian, Wan, Changyi, Hu, Hanpeng, Zhen, Jiangjie, Chen, Siyu, Yuan, Song, Zhang, Xuelin, Jiang, Yimin, Zhou, Yu, Yang, Yuxiang, Li, Bingxin, Ma, Buyun, Song, Changhe, Pang, Dongqing, Hu, Guoqiang, Sun, Haiyang, An, Kang, Wang, Na, Gao, Shuli, Ji, Wei, Li, Wen, Sun, Wen, Wen, Xuan, Ren, Yong, Ma, Yuankai, Lu, Yufan, Wang, Bin, Li, Bo, Miao, Changxin, Liu, Che, Xu, Chen, Shi, Dapeng, Hu, Dingyuan, Wu, Donghang, Liu, Enle, Huang, Guanzhe, Yan, Gulin, Zhang, Han, Nie, Hao, Jia, Haonan, Zhou, Hongyu, Sun, Jianjian, Wu, Jiaoren, Wu, Jie, Yang, Jie, Yang, Jin, Lin, Junzhe, Li, Kaixiang, Yang, Lei, Shi, Liying, Zhou, Li, Gu, Longlong, Li, Ming, Li, Mingliang, Li, Mingxiao, Wu, Nan, Han, Qi, Tan, Qinyuan, Pang, Shaoliang, Fan, Shengjie, Liu, Siqi, Cao, Tiancheng, Lu, Wanying, He, Wenqing, Xie, Wuxun, Zhao, Xu, Li, Xueqi, Yu, Yanbo, Yang, Yang, Liu, Yi, Lu, Yifan, Wang, Yilei, Ding, Yuanhao, Liang, Yuanwei, Lu, Yuanwei, Luo, Yuchu, Yin, Yuhe, Zhan, Yumeng, Zhang, Yuxiang, Yang, Zidong, Zhang, Zixin, Jiao, Binxing, Jiang, Daxin, Shum, Heung-Yeung, Chen, Jiansheng, Li, Jing, Zhang, Xiangyu, Zhu, Yibo

This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.

large language model, machine learning, natural language, (15 more...)

2507.16632

Country: Asia (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)