Goto

Collaborating Authors

 Machine Translation


Structured Prediction with Stronger Consistency Guarantees

Neural Information Processing Systems

In most applications, the output labels of learning problems have some structure that is crucial to consider. This includes natural language processing applications, where the output may be a sentence, a sequence of parts-of-speech tags, a parse tree, or a dependency graph.


Structured Prediction with Stronger Consistency Guarantees

Neural Information Processing Systems

In most applications, the output labels of learning problems have some structure that is crucial to consider. This includes natural language processing applications, where the output may be a sentence, a sequence of parts-of-speech tags, a parse tree, or a dependency graph.



HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples

arXiv.org Artificial Intelligence

With nearly 1.5 billion people and more than 120 major languages, India represents one of the most diverse regions in the world. As multilingual Vision-Language Models (VLMs) gain prominence, robust evaluation methodologies are essential to drive progress toward equitable AI for low-resource languages. Current multilingual VLM evaluations suffer from four major limitations: reliance on unverified auto-translations, narrow task/domain coverage, limited sample sizes, and lack of cultural and natively sourced Question-Answering (QA). To address these gaps, we present a scalable framework to evaluate VLMs in Indian languages and compare it with performance in English. Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples. Our contributions are threefold: (1) a semi-automated dataset creation framework combining back-translation, filtering, and human verification; (2) the most comprehensive vision-language benchmark for Hindi and and Telugu, including adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native novel Indic datasets (JEE for STEM, VAANI for cultural grounding) with approximately 4,000 QA pairs per language; and (3) a detailed performance analysis of various State-of-the-Art (SOTA) open-weight and closed-source VLMs. We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu. We categorize common failure modes to highlight concrete areas of improvement in multilingual multimodal understanding.


The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages

arXiv.org Artificial Intelligence

Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.


A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts

arXiv.org Artificial Intelligence

Synthetic data sets are used across linguistic domains and NLP tasks, particularly in scenarios where authentic data is limited (or even non-existent). One such domain is that of clinical (healthcare) contexts, where there exist significant and long-standing challenges (e.g., privacy, anonymization, and data governance) which have led to the development of an increasing number of synthetic datasets. One increasingly important category of clinical dataset is that of clinical dialogues which are especially sensitive and difficult to collect, and as such are commonly synthesized. While such synthetic datasets have been shown to be sufficient in some situations, little theory exists to inform how they may be best used and generalized to new applications. In this paper, we provide an overview of how synthetic datasets are created, evaluated and being used for dialogue related tasks in the medical domain. Additionally, we propose a novel typology for use in classifying types and degrees of data synthesis, to facilitate comparison and evaluation.


CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents

arXiv.org Artificial Intelligence

Cross-lingual information retrieval (CLIR) helps users find documents in languages different from their queries. This is especially important in academic search, where key research is often published in non-English languages. We present CLIRudit, a novel English-French academic retrieval dataset built from ร‰rudit, a Canadian publishing platform. Using multilingual metadata, we pair English author-written keywords as queries with non-English abstracts as target documents, a method that can be applied to other languages and repositories. We benchmark various first-stage sparse and dense retrievers, with and without machine translation. We find that dense embeddings without translation perform nearly as well as systems using machine translation, that translating documents is generally more effective than translating queries, and that sparse retrievers with document translation remain competitive while offering greater efficiency. Along with releasing the first English-French academic retrieval dataset, we provide a reproducible benchmarking method to improve access to non-English scholarly content.


Instruction Tuning With Loss Over Instructions

Neural Information Processing Systems

Further analysis substantiates our hypothesis that our improvement can be attributed to reduced overfitting to instruction tuning datasets. It is worth noting that we are not proposing IM as a replacement for the current instruction tuning process. Instead, our work aims to provide practical guidance for instruction tuning LMs, especially in low-resource scenarios.



Can QE-informed (Re)Translation lead to Error Correction?

arXiv.org Artificial Intelligence

The paper presents two approaches submitted to the WMT 2025 Automated Translation Quality Evaluation Systems Task 3 - Quality Estimation (QE)-informed Segment-level Error Correction. While jointly training QE systems with Automatic Post-Editing (APE) has shown improved performance for both tasks, APE systems are still known to overcorrect the output of Machine Translation (MT), leading to a degradation in performance. We investigate a simple training-free approach - QE-informed Retranslation, and compare it with another within the same training-free paradigm. Our winning approach selects the highest-quality translation from multiple candidates generated by different LLMs. The second approach, more akin to APE, instructs an LLM to replace error substrings as specified in the provided QE explanation(s). A conditional heuristic was employed to minimise the number of edits, with the aim of maximising the Gain-to-Edit ratio. The two proposed approaches achieved a Delta COMET score of 0.0201 and -0.0108, respectively, leading the first approach to achieve the winning position on the subtask leaderboard.