AITopics

2509.18987

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)
Asia > Middle East > UAE (0.28)

Genre:

Research Report > New Finding (0.47)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Anonto, Riad Ahmed, Zabin, Sardar Md. Saffat, Rahman, M. Saifur

Align Where the Words Look: Cross-Attention-Guided Patch Alignment with Contrastive and Transport Regularization for Bengali Captioning

arXiv.org Artificial IntelligenceSep-24-2025

Grounding vision--language models in low-resource languages remains challenging, as they often produce fluent text about the wrong objects. This stems from scarce paired data, translation pivots that break alignment, and English-centric pretraining that ignores target-language semantics. We address this with a compute-aware Bengali captioning pipeline trained on LaBSE-verified EN--BN pairs and 110k bilingual-prompted synthetic images. A frozen MaxViT yields stable visual patches, a Bengali-native mBART-50 decodes, and a lightweight bridge links the modalities. Our core novelty is a tri-loss objective: Patch-Alignment Loss (PAL) aligns real and synthetic patch descriptors using decoder cross-attention, InfoNCE enforces global real--synthetic separation, and Sinkhorn-based OT ensures balanced fine-grained patch correspondence. This PAL+InfoNCE+OT synergy improves grounding, reduces spurious matches, and drives strong gains on Flickr30k-1k (BLEU-4 12.29, METEOR 27.98, BERTScore-F1 71.20) and MSCOCO-1k (BLEU-4 12.00, METEOR 28.14, BERTScore-F1 75.40), outperforming strong CE baselines and narrowing the real--synthetic centroid gap by 41%.

large language model, machine learning, natural language, (21 more...)

2509.18369

Country:

North America (0.28)
Asia (0.28)
Europe (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Meng, Chutong, Koehn, Philipp

Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents

arXiv.org Artificial IntelligenceSep-24-2025

We present Speech Vecalign, a parallel speech document alignment method that monotonically aligns speech segment embeddings and does not depend on text transcriptions. Compared to the baseline method Global Mining, a variant of speech mining, Speech Vecalign produces longer speech-to-speech alignments. It also demonstrates greater robustness than Local Mining, another speech mining variant, as it produces less noise. We applied Speech Vecalign to 3,000 hours of unlabeled parallel English-German (En-De) speech documents from VoxPopuli, yielding about 1,000 hours of high-quality alignments. We then trained En-De speech-to-speech translation models on the aligned data. Speech Vecalign improves the En-to-De and De-to-En performance over Global Mining by 0.37 and 0.18 ASR-BLEU, respectively. Moreover, our models match or outperform SpeechMatrix model performance, despite using 8 times fewer raw speech documents.

artificial intelligence, machine learning, natural language, (17 more...)

2509.1836

Country:

Europe (1.00)
Asia > Middle East > UAE (0.46)
North America > United States > Minnesota (0.28)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.47)

Industry: Materials > Metals & Mining (0.59)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)

Communications of the ACMSep-23-2025, 17:32:46 GMT

Not Every AI Problem Is a Data Problem

Membership in ACM includes a subscription to Communications of the ACM (CACM), the computing industry's most trusted source for staying connected to the world of advanced computing. Why we should be intentional about data scaling. Large language models (LLMs) have revolutionized the AI landscape, demonstrating remarkable capabilities across a wide range of tasks. Each new model seemingly reinforces the notion that modern transformer-based AI can conquer any challenge if armed with sufficient compute and data. However, while scaling has accelerated certain applications, such as robotics, it has yet to show significant impact in others, such as identifying misinformation.

artificial intelligence, data-driven scaling, quality data, (13 more...)

Communications of the ACM

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > California > Santa Clara County > Mountain View (0.05)

Industry:

Information Technology (0.47)
Media > News (0.37)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Zhuang, Wenhao, Sun, Yuan

CUTE: A Multilingual Dataset for Enhancing Cross-Lingual Knowledge Transfer in Low-Resource Languages

Large Language Models (LLMs) demonstrate exceptional zero-shot capabilities in various NLP tasks, significantly enhancing user experience and efficiency. However, this advantage is primarily limited to resource-rich languages. For the diverse array of low-resource languages, support remains inadequate, with the scarcity of training corpora considered the primary cause. We construct and open-source CUTE Chinese, Uyghur, Tibetan,English dataset, consisting of two 25GB sets of four-language corpora (one parallel and one non-parallel), obtained through machine translation. CUTE encompasses two resource-rich languages (Chinese and English) and two low-resource languages (Uyghur and Tibetan). Prior to constructing CUTE, human assessment validates that the machine translation quality between Chinese-Uyghur and Chinese-Tibetan approaches that of Chinese-English translation. CUTE represents the largest open-source corpus for Uyghur and Tibetan languages to date, and we demonstrate its effectiveness in enhancing LLMs' ability to process low-resource languages while investigating the role of corpus parallelism in cross-lingual transfer learning. The CUTE corpus and related models are made publicly available to the research community.

artificial intelligence, large language model, natural language, (15 more...)

2509.16914

Country: Asia > China (0.15)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation

Villa-Cueva, Emilio, Bolatzhanova, Sholpan, Turmakhan, Diana, Elzeky, Kareem, Ademtew, Henok Biadglign, Aji, Alham Fikri, Araujo, Vladimir, Azime, Israel Abebe, Baek, Jinheon, Belcavello, Frederico, Cristobal, Fermin, Cruz, Jan Christian Blaise, Dabre, Mary, Dabre, Raj, Ehsan, Toqeer, Etori, Naome A, Farooqui, Fauzan, Geng, Jiahui, Ivetta, Guido, Jayakumar, Thanmay, Jeong, Soyeong, Lim, Zheng Wei, Mandal, Aishik, Martinelli, Sofia, Mihaylov, Mihail Minkov, Orel, Daniil, Pramanick, Aniket, Purkayastha, Sukannya, Salazar, Israfel, Song, Haiyue, Torrent, Tiago Timponi, Yadeta, Debela Desalegn, Hamed, Injy, Tonja, Atnafu Lambebo, Solorio, Thamar

Translating cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of images along with parallel captions in English and regional languages. Using this dataset, we evaluate five Vision Language Models (VLMs) in text-only and text+image settings. Through automatic and human evaluations, we find that visual context generally improves translation quality, especially in handling Culturally-Specific Items (CSIs), disambiguation, and correct gender marking. By releasing CaMMT, our objective is to support broader efforts to build and evaluate multimodal translation systems that are better aligned with cultural nuance and regional variations.

machine learning, natural language, translation, (18 more...)

2505.24456

Country:

Asia (1.00)
Europe (0.93)
Africa (0.68)
South America > Argentina (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Whisper-UT: A Unified Translation Framework for Speech and Text

Xiao, Cihan, Wiesner, Matthew, Chakraborty, Debashish, Kriz, Reno, Cunningham, Keith, Murray, Kenton, Duh, Kevin, Tavarez-Arce, Luis, McNamee, Paul, Khudanpur, Sanjeev

Encoder-decoder models have achieved remarkable success in speech and text tasks, yet efficiently adapting these models to diverse uni/multi-modal scenarios remains an open challenge. In this paper, we propose Whisper-UT, a unified and efficient framework that leverages lightweight adapters to enable seamless adaptation across tasks, including a multi-modal machine translation (MMT) task that explicitly conditions translation on both speech and source language text inputs. By incorporating ASR hypotheses or ground-truth transcripts as prompts, this approach not only enables the system to process both modalities simultaneously but also enhances speech translation (ST) performance through a 2-stage decoding strategy. We demonstrate our methods using the Whisper model, though in principle they are general and could be applied to similar multitask models. We highlight the effectiveness of cross-modal and cross-task fine-tuning, which improves performance without requiring 3-way parallel data. Our results underscore the flexibility, efficiency, and general applicability of the proposed framework for multi-modal translation.

machine learning, natural language, translation, (17 more...)

2509.16375

Country:

Europe (0.46)
Asia (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Rowe, Jacqueline, Klimaszewski, Mateusz, Guillou, Liane, Vallor, Shannon, Birch, Alexandra

EuroGEST: Investigating gender stereotypes in multilingual language models

Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric. We introduce EuroGEST, a dataset designed to measure gender-stereotypical reasoning in LLMs across English and 29 European languages. EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics. Human evaluations confirm that our data generation method results in high accuracy of both translations and gender labels across languages. We use EuroGEST to evaluate 24 multilingual language models from six model families, demonstrating that the strongest stereotypes in all models across all languages are that women are 'beautiful', 'empathetic' and 'neat' and men are 'leaders', 'strong, tough' and 'professional'. We also show that larger models encode gendered stereotypes more strongly and that instruction finetuning does not consistently reduce gendered stereotypes. Our work highlights the need for more multilingual studies of fairness in LLMs and offers scalable methods and resources to audit gender bias across languages.

large language model, machine learning, natural language, (18 more...)

2506.03867

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Guan, Yiwen, Whitehill, Jacob

Transformer-Encoder Trees for Efficient Multilingual Machine Translation and Speech Translation

Multilingual translation faces challenges of computational redundancy and limited accuracy for low-resource languages, especially in speech translation. To address this, we propose a novel hierarchical Transformer Encoder Tree (TET) combined with non-autoregressive encoder-only models trained with Connectionist Temporal Classification for multilingual translation. By sharing intermediate representations among linguistically similar target languages, TET can improve accuracy on low-resource languages, reduce computational redundancy, and allow generating all target languages in a single forward pass, thus eliminating sequential bottlenecks and improving parallelism. For speech translation, combining TET with a non-autoregressive speech recognition backbone (wav2vec2) shows promising results in terms of translation quality compared to autoregressive systems while being 7-14 times faster.

artificial intelligence, natural language, translation, (17 more...)

2509.1793

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Ahsan, Arafat, Mujadia, Vandan, Mishra, Pruthwik, Bhaskar, Yash, Sharma, Dipti Misra

Crosslingual Optimized Metric for Translation Assessment of Indian Languages

Automatic evaluation of translation remains a challenging task owing to the orthographic, morphological, syntactic and semantic richness and divergence observed across languages. String-based metrics such as BLEU have previously been extensively used for automatic evaluation tasks, but their limitations are now increasingly recognized. Although learned neural metrics have helped mitigate some of the limitations of string-based approaches, they remain constrained by a paucity of gold evaluation data in most languages beyond the usual high-resource pairs. In this present work we address some of these gaps. We create a large human evaluation ratings dataset for 13 Indian languages covering 21 translation directions and then train a neural translation evaluation metric named Cross-lingual Optimized Metric for Translation Assessment of Indian Languages (COMTAIL) on this dataset. The best performing metric variants show significant performance gains over previous state-of-the-art when adjudging translation pairs with at least one Indian language. Furthermore, we conduct a series of ablation studies to highlight the sensitivities of such a metric to changes in domain, translation quality, and language groupings. We release both the COMTAIL dataset and the accompanying metric models.

large language model, machine learning, natural language, (16 more...)

2509.17667

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)