MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

Zgreabăn, Mădălina, Deoskar, Tejaswini, Abzianidze, Lasha

Oct-29-2025–arXiv.org Artificial Intelligence

In recent years, many generalization benchmarks have shown language models' lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models' predictions across reasoning-preserving variants of the original problem. Our results show that NLI models' perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements, word probability, and plausibility influence NLI models' performance.

artificial intelligence, natural language, variant, (18 more...)

arXiv.org Artificial Intelligence

Oct-29-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- North America > United States (0.93)

Genre:
- Research Report > New Finding (0.54)

Industry:
- Consumer Products & Services (0.46)

Technology:
- Information Technology > Artificial Intelligence > Natural Language (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found