AITopics

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Maryland > Baltimore (0.04)
North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
(4 more...)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.68)
(4 more...)

Neural Information Processing SystemsNov-20-2025, 13:02:17 GMT

Structured Prediction with Stronger Consistency Guarantees

machine learning, natural language, prediction, (21 more...)

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Maryland > Baltimore (0.04)
North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
(4 more...)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.68)
(4 more...)

Neural Information Processing SystemsNov-20-2025, 04:17:01 GMT

cdf00c97c0cb2cc35179f03363da6c4f-Paper-Conference.pdf

artificial intelligence, machine learning, natural language, (18 more...)

Country:

North America > United States > Washington > King County > Seattle (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)

Chigrupaatii, Rishikant, Kanishka, Ponnada Sai Tulasi, Routhu, Lalit Chandra, Reddy, Martin Patel Sama Supratheek, Gupta, Divyam, Srikar, Dasari, Kuchimanchi, Krishna Teja, Misra, Rajiv, Tripathi, Rohun

HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples

With nearly 1.5 billion people and more than 120 major languages, India represents one of the most diverse regions in the world. As multilingual Vision-Language Models (VLMs) gain prominence, robust evaluation methodologies are essential to drive progress toward equitable AI for low-resource languages. Current multilingual VLM evaluations suffer from four major limitations: reliance on unverified auto-translations, narrow task/domain coverage, limited sample sizes, and lack of cultural and natively sourced Question-Answering (QA). To address these gaps, we present a scalable framework to evaluate VLMs in Indian languages and compare it with performance in English. Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples. Our contributions are threefold: (1) a semi-automated dataset creation framework combining back-translation, filtering, and human verification; (2) the most comprehensive vision-language benchmark for Hindi and and Telugu, including adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native novel Indic datasets (JEE for STEM, VAANI for cultural grounding) with approximately 4,000 QA pairs per language; and (3) a detailed performance analysis of various State-of-the-Art (SOTA) open-weight and closed-source VLMs. We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu. We categorize common failure modes to highlight concrete areas of improvement in multilingual multimodal understanding.

large language model, machine learning, natural language, (19 more...)

2511.15183

Country:

North America > United States (0.28)
Asia > India (0.25)

Genre: Research Report (0.50)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Meyer, Francois, Buys, Jan

The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages

Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.

computational linguistic, machine learning, natural language, (19 more...)

2511.09197

Country:

Europe (1.00)
Asia (1.00)
Africa (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)

Bedrick, Steven, Doğruöz, A. Seza, Nisioi, Sergiu

A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts

Synthetic data sets are used across linguistic domains and NLP tasks, particularly in scenarios where authentic data is limited (or even non-existent). One such domain is that of clinical (healthcare) contexts, where there exist significant and long-standing challenges (e.g., privacy, anonymization, and data governance) which have led to the development of an increasing number of synthetic datasets. One increasingly important category of clinical dataset is that of clinical dialogues which are especially sensitive and difficult to collect, and as such are commonly synthesized. While such synthetic datasets have been shown to be sufficient in some situations, little theory exists to inform how they may be best used and generalized to new applications. In this paper, we provide an overview of how synthetic datasets are created, evaluated and being used for dialogue related tasks in the medical domain. Additionally, we propose a novel typology for use in classifying types and degrees of data synthesis, to facilitate comparison and evaluation.

data mining, large language model, machine learning, (22 more...)

2505.03025

Country:

North America > United States (1.00)
Asia (1.00)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)

Genre: Overview (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Health Care Providers & Services (0.67)
Health & Medicine > Consumer Health (0.67)
Health & Medicine > Health Care Technology > Medical Record (0.47)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
(4 more...)

Valentini, Francisco, Kozlowski, Diego, Larivière, Vincent

CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents

Cross-lingual information retrieval (CLIR) helps users find documents in languages different from their queries. This is especially important in academic search, where key research is often published in non-English languages. We present CLIRudit, a novel English-French academic retrieval dataset built from Érudit, a Canadian publishing platform. Using multilingual metadata, we pair English author-written keywords as queries with non-English abstracts as target documents, a method that can be applied to other languages and repositories. We benchmark various first-stage sparse and dense retrievers, with and without machine translation. We find that dense embeddings without translation perform nearly as well as systems using machine translation, that translating documents is generally more effective than translating queries, and that sparse retrievers with document translation remain competitive while offering greater efficiency. Along with releasing the first English-French academic retrieval dataset, we provide a reproducible benchmarking method to improve access to non-English scholarly content.

computational linguistic, large language model, machine learning, (14 more...)

doi: 10.18653/v1/2025.mrl-main.16

2504.16264

Country:

Europe (1.00)
Asia > Middle East > UAE (0.46)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Neural Information Processing SystemsNov-19-2025, 18:38:58 GMT

Instruction Tuning With Loss Over Instructions

Further analysis substantiates our hypothesis that our improvement can be attributed to reduced overfitting to instruction tuning datasets. It is worth noting that we are not proposing IM as a replacement for the current instruction tuning process. Instead, our work aims to provide practical guidance for instruction tuning LMs, especially in low-resource scenarios.

large language model, machine learning, natural language, (20 more...)

Country:

North America > Canada > Ontario > Toronto (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Asia > Middle East > Jordan (0.04)
(8 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Neural Information Processing SystemsNov-19-2025, 05:24:45 GMT

dc960c46c38bd16e953d97cdeefdbc68-AuthorFeedback.pdf

artificial intelligence, natural language, transformer, (19 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.30)

arXiv.org Artificial IntelligenceNov-19-2025

Can QE-informed (Re)Translation lead to Error Correction?

Padmanabhan, Govardhan

The paper presents two approaches submitted to the WMT 2025 Automated Translation Quality Evaluation Systems Task 3 - Quality Estimation (QE)-informed Segment-level Error Correction. While jointly training QE systems with Automatic Post-Editing (APE) has shown improved performance for both tasks, APE systems are still known to overcorrect the output of Machine Translation (MT), leading to a degradation in performance. We investigate a simple training-free approach - QE-informed Retranslation, and compare it with another within the same training-free paradigm. Our winning approach selects the highest-quality translation from multiple candidates generated by different LLMs. The second approach, more akin to APE, instructs an LLM to replace error substrings as specified in the provided QE explanation(s). A conditional heuristic was employed to minimise the number of edits, with the aim of maximising the Gain-to-Edit ratio. The two proposed approaches achieved a Delta COMET score of 0.0201 and -0.0108, respectively, leading the first approach to achieve the winning position on the subtask leaderboard.

large language model, machine learning, natural language, (17 more...)

2511.13884

Country:

North America > United States (0.29)
Europe > United Kingdom (0.28)
Asia (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)